[jira] [Updated] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24828:

Attachment: image-2018-07-18-13-57-21-148.png

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip, image-2018-07-18-13-57-21-148.png
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  
> The file is attached [^a2_m2.parquet.zip]
>  
> The following code reproduces the error:
> df = spark.read.parquet('a2_m2.parquet')
> from pyspark.ml.evaluation import 

[jira] [Comment Edited] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547418#comment-16547418
 ] 

Romeo Kienzer edited comment on SPARK-24828 at 7/18/18 5:53 AM:


Dear [~q79969786] - thanks for pointing this out.

[~hyukjin.kwon] - I've created the following code casting Integer to Long on 
the label field - but it results in the same error

 

from pyspark.sql.types import LongType

df = spark.read.parquet('a2_m2.parquet')
df_hat = df \
    .withColumn("label_tmp", df["label"].cast(LongType())) \
    .drop('label') \
    .withColumnRenamed('label_tmp', 'label')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df_hat)


was (Author: romeokienzler):
Dear [~q79969786] - thanks for pointing this out.

[~hyukjin.kwon] - I've created the following code casting Integer to Long on 
the label field - but it results in the same error

 

from pyspark.sql.types import LongType

df = spark.read.parquet('a2_m2.parquet')
df_cast = df.withColumn("label_tmp", df["label"].cast(LongType()))
df_hat = df_cast.drop('label').withColumnRenamed('label_tmp', 'label')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df_hat)

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> 

[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547418#comment-16547418
 ] 

Romeo Kienzer commented on SPARK-24828:
---

Dear [~q79969786] - thanks for pointing this out.

[~hyukjin.kwon] - I've created the following code casting Integer to Long on 
the label field - but it results in the same error

 

from pyspark.sql.types import LongType

df = spark.read.parquet('a2_m2.parquet')
df_cast = df.withColumn("label_tmp", df["label"].cast(LongType()))
df_hat = df_cast.drop('label').withColumnRenamed('label_tmp', 'label')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df_hat)

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> 

[jira] [Assigned] (SPARK-24840) do not use dummy filter to switch codegen on/off

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24840:


Assignee: Wenchen Fan  (was: Apache Spark)

> do not use dummy filter to switch codegen on/off
> 
>
> Key: SPARK-24840
> URL: https://issues.apache.org/jira/browse/SPARK-24840
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24840) do not use dummy filter to switch codegen on/off

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24840:


Assignee: Apache Spark  (was: Wenchen Fan)

> do not use dummy filter to switch codegen on/off
> 
>
> Key: SPARK-24840
> URL: https://issues.apache.org/jira/browse/SPARK-24840
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24840) do not use dummy filter to switch codegen on/off

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547387#comment-16547387
 ] 

Apache Spark commented on SPARK-24840:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21795

> do not use dummy filter to switch codegen on/off
> 
>
> Key: SPARK-24840
> URL: https://issues.apache.org/jira/browse/SPARK-24840
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24840) do not use dummy filter to switch codegen on/off

2018-07-17 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-24840:
---

 Summary: do not use dummy filter to switch codegen on/off
 Key: SPARK-24840
 URL: https://issues.apache.org/jira/browse/SPARK-24840
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-17 Thread zenglinxi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337
 ] 

zenglinxi edited comment on SPARK-24809 at 7/18/18 4:10 AM:


[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

 
{code:java}
// code in HashedRelation.scala
private def write(
writeBoolean: (Boolean) => Unit,
writeLong: (Long) => Unit,
writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = {
  writeBoolean(isDense)
  writeLong(minKey)
  writeLong(maxKey)
  writeLong(numKeys)
  writeLong(numValues)
  writeLong(numKeyLookups)
  writeLong(numProbes)

  writeLong(array.length)
  writeLongArray(writeBuffer, array, array.length)
  val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
  writeLong(used)
  writeLongArray(writeBuffer, page, used)
}
{code}
This write func in HashedRelation.scala will be called when executor didn't 
have enough memory for the LongToUnsafeRowMap in which the data of broadcast 
table been saved, however, the value of cursor in executor may not changed 
after initialization by 
{code:java}
// code placeholder
private var cursor: Long = Platform.LONG_ARRAY_OFFSET
{code}
which makes the value of "used" in write func been zero when write to disk, 
then in the case of deserializing this data in disk will get wrong pointer. 
Finally, we may get the wrong data from broadcast join.

 

 

 


was (Author: gostop_zlx):
[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

 
{code:java}
// code in HashedRelation.scala
private def write(
writeBoolean: (Boolean) => Unit,
writeLong: (Long) => Unit,
writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = {
  writeBoolean(isDense)
  writeLong(minKey)
  writeLong(maxKey)
  writeLong(numKeys)
  writeLong(numValues)
  writeLong(numKeyLookups)
  writeLong(numProbes)

  writeLong(array.length)
  writeLongArray(writeBuffer, array, array.length)
  val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
  writeLong(used)
  writeLongArray(writeBuffer, page, used)
}
{code}
This write func in HashedRelation.scala will be called when executor didn't 
have enough memory for the LongToUnsafeRowMap in which the data of broadcast 
table been saved, however, the value of cursor in executor may not changed 
after initialization by 
{code:java}
// code placeholder
private var cursor: Long = Platform.LONG_ARRAY_OFFSET
{code}
which makes the value of "used" in write func been zero when write to disk, 
then in the case of deserializing this data in disk will get wrong pointer. 
Finally, we may get the wrong data from broadcast join.

 

 

 

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-17 Thread zenglinxi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337
 ] 

zenglinxi edited comment on SPARK-24809 at 7/18/18 4:09 AM:


[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

 
{code:java}
// code in HashedRelation.scala
private def write(
writeBoolean: (Boolean) => Unit,
writeLong: (Long) => Unit,
writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = {
  writeBoolean(isDense)
  writeLong(minKey)
  writeLong(maxKey)
  writeLong(numKeys)
  writeLong(numValues)
  writeLong(numKeyLookups)
  writeLong(numProbes)

  writeLong(array.length)
  writeLongArray(writeBuffer, array, array.length)
  val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
  writeLong(used)
  writeLongArray(writeBuffer, page, used)
}
{code}
This write func in HashedRelation.scala will be called when executor didn't 
have enough memory for the LongToUnsafeRowMap in which the data of broadcast 
table been saved, however, the value of cursor in executor may not changed 
after initialization by 
{code:java}
// code placeholder
private var cursor: Long = Platform.LONG_ARRAY_OFFSET
{code}
which makes the value of "used" in write func been zero when write to disk, 
then in the case of deserializing this data in disk will get wrong pointer. 
Finally, we may get the wrong data from broadcast join.

 

 

 


was (Author: gostop_zlx):
[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24768) Have a built-in AVRO data source implementation

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547346#comment-16547346
 ] 

Apache Spark commented on SPARK-24768:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21801

> Have a built-in AVRO data source implementation
> ---
>
> Key: SPARK-24768
> URL: https://issues.apache.org/jira/browse/SPARK-24768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Built-in AVRO Data Source In Spark 2.4.pdf
>
>
> Apache Avro (https://avro.apache.org) is a popular data serialization format. 
> It is widely used in the Spark and Hadoop ecosystem, especially for 
> Kafka-based data pipelines.  Using the external package 
> [https://github.com/databricks/spark-avro], Spark SQL can read and write the 
> avro data. Making spark-Avro built-in can provide a better experience for 
> first-time users of Spark SQL and structured streaming. We expect the 
> built-in Avro data source can further improve the adoption of structured 
> streaming. The proposal is to inline code from spark-avro package 
> ([https://github.com/databricks/spark-avro]). The target release is Spark 
> 2.4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24386) implement continuous processing coalesce(1)

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547345#comment-16547345
 ] 

Apache Spark commented on SPARK-24386:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21801

> implement continuous processing coalesce(1)
> ---
>
> Key: SPARK-24386
> URL: https://issues.apache.org/jira/browse/SPARK-24386
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.4.0
>
>
> [~marmbrus] suggested this as a good implementation checkpoint. If we do the 
> shuffle reader and writer correctly, it should be easy to make a custom 
> coalesce(1) execution for continuous processing using them, without having to 
> implement the logic for shuffle writers finding out where shuffle readers are 
> located. (The coalesce(1) can just get the RpcEndpointRef directly from the 
> reader and pass it to the writers.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-17 Thread zenglinxi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337
 ] 

zenglinxi commented on SPARK-24809:
---

[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-17 Thread zenglinxi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-24809:
--
Attachment: Spark LongHashedRelation serialization.svg

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql

2018-07-17 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-24829:

Summary: In Spark Thrift Server, CAST AS FLOAT inconsistent with 
spark-shell or spark-sql   (was: CAST AS FLOAT inconsistent with Hive)

> In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or 
> spark-sql 
> -
>
> Key: SPARK-24829
> URL: https://issues.apache.org/jira/browse/SPARK-24829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-07-18_110944.png, 2018-07-18_11.png
>
>
> SELECT CAST('4.56' AS FLOAT)
> the result is 4.55942779541 , it should be 4.56



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql

2018-07-17 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-24829:

Attachment: (was: CAST-FLOAT.png)

> In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or 
> spark-sql 
> -
>
> Key: SPARK-24829
> URL: https://issues.apache.org/jira/browse/SPARK-24829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-07-18_110944.png, 2018-07-18_11.png
>
>
> SELECT CAST('4.56' AS FLOAT)
> the result is 4.55942779541 , it should be 4.56



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql

2018-07-17 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-24829:

Attachment: 2018-07-18_11.png

> In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or 
> spark-sql 
> -
>
> Key: SPARK-24829
> URL: https://issues.apache.org/jira/browse/SPARK-24829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-07-18_110944.png, 2018-07-18_11.png
>
>
> SELECT CAST('4.56' AS FLOAT)
> the result is 4.55942779541 , it should be 4.56



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql

2018-07-17 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-24829:

Attachment: 2018-07-18_110944.png

> In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or 
> spark-sql 
> -
>
> Key: SPARK-24829
> URL: https://issues.apache.org/jira/browse/SPARK-24829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-07-18_110944.png, 2018-07-18_11.png
>
>
> SELECT CAST('4.56' AS FLOAT)
> the result is 4.55942779541 , it should be 4.56



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547299#comment-16547299
 ] 

Hyukjin Kwon commented on SPARK-24828:
--

Thanks [~q79969786]. Yea, currently you can't mix the type. [~romeokienzler] 
mind testing it after matching the schema?

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  
> The file is attached [^a2_m2.parquet.zip]
>  
> The following code reproduces the error:
> df = 

[jira] [Resolved] (SPARK-23998) It may be better to add @transient to field 'taskMemoryManager' in class Task, for it is only be set and used in executor side

2018-07-17 Thread eaton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eaton resolved SPARK-23998.
---
Resolution: Won't Do

> It may be better to add @transient to field 'taskMemoryManager' in class 
> Task, for it is only be set and used in executor side
> --
>
> Key: SPARK-23998
> URL: https://issues.apache.org/jira/browse/SPARK-23998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: eaton
>Priority: Minor
>
> It may be better to add @transient to field 'taskMemoryManager' in class 
> Task, for it is only be set and used in executor side



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547274#comment-16547274
 ] 

Yuming Wang commented on SPARK-24828:
-

You can't read different data with a schema.

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  
> The file is attached [^a2_m2.parquet.zip]
>  
> The following code reproduces the error:
> df = spark.read.parquet('a2_m2.parquet')
> from pyspark.ml.evaluation import 

[jira] [Comment Edited] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547264#comment-16547264
 ] 

Yuming Wang edited comment on SPARK-24828 at 7/18/18 1:27 AM:
--

The {{label}} column, sometime is {{integer}} type, sometime is {{long}} type:

{"name":"label","type":"integer","nullable":false,"metadata":{}}

("name":"label","type":"long","nullable":true,"metadata":{}}

 
{noformat}
$ java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar meta 
file:///Users/data/a2_m2.parquet/
file: 
file:/Users/yumwang/data/a2_m2.parquet/part-0-1ff92a81-68c8-446b-a54e-a042a8fd7f1e.snappy.parquet
creator: parquet-mr version 1.8.1 (build 
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
extra: org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"label","type":"integer","nullable":false,"metadata":{}},{"name":"CLASS","type":"long","nullable":true,"metadata":{}},{"name":"SENSORID","type":"string","nullable":true,"metadata":{}},{"name":"X","type":"double","nullable":true,"metadata":{}},{"name":"Y","type":"double","nullable":true,"metadata":{}},{"name":"Z","type":"double","nullable":true,"metadata":{}},{"name":"_id","type":"string","nullable":true,"metadata":{}},{"name":"_rev","type":"string","nullable":true,"metadata":{}},{"name":"features","type":{"type":"udt","class":"org.apache.spark.ml.linalg.VectorUDT","pyClass":"pyspark.ml.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"type","type":"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","nullable":true,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"integer","containsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}}]}},"nullable":true,"metadata":{"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"X"},{"idx":1,"name":"Y"},{"idx":2,"name":"Z"}]},"num_attrs":3}}},{"name":"prediction","type":"double","nullable":true,"metadata":{}}]}

file schema: spark_schema

label: REQUIRED INT32 R:0 D:0
CLASS: OPTIONAL INT64 R:0 D:1
SENSORID: OPTIONAL BINARY O:UTF8 R:0 D:1
X: OPTIONAL DOUBLE R:0 D:1
Y: OPTIONAL DOUBLE R:0 D:1
Z: OPTIONAL DOUBLE R:0 D:1
_id: OPTIONAL BINARY O:UTF8 R:0 D:1
_rev: OPTIONAL BINARY O:UTF8 R:0 D:1
features: OPTIONAL F:4
.type: REQUIRED INT32 O:INT_8 R:0 D:1
.size: OPTIONAL INT32 R:0 D:2
.indices: OPTIONAL F:1
..list: REPEATED F:1
...element: REQUIRED INT32 R:1 D:3
.values: OPTIONAL F:1
..list: REPEATED F:1
...element: REQUIRED DOUBLE R:1 D:3
prediction: OPTIONAL DOUBLE R:0 D:1

row group 1: RC:893 TS:41499 OFFSET:4

label: INT32 SNAPPY DO:0 FPO:4 SZ:108/104/0.96 VC:893 
ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
CLASS: INT64 SNAPPY DO:0 FPO:112 SZ:131/127/0.97 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]
SENSORID: BINARY SNAPPY DO:0 FPO:243 SZ:81/77/0.95 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: , max: , 
num_nulls: 0]
X: DOUBLE SNAPPY DO:0 FPO:324 SZ:957/1045/1.09 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: -0.85, max: 1.59, num_nulls: 0]
Y: DOUBLE SNAPPY DO:0 FPO:1281 SZ:957/1045/1.09 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: -0.85, max: 1.59, num_nulls: 0]
Z: DOUBLE SNAPPY DO:0 FPO:2238 SZ:957/1045/1.09 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: -0.85, max: 1.59, num_nulls: 0]
_id: BINARY SNAPPY DO:0 FPO:3195 SZ:7964/32248/4.05 VC:893 
ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
_rev: BINARY SNAPPY DO:0 FPO:11159 SZ:2340/2503/1.07 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[no stats for this column]
features:
.type: INT32 SNAPPY DO:0 FPO:13499 SZ:199/193/0.97 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
.size: INT32 SNAPPY DO:0 FPO:13698 SZ:296/290/0.98 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 3, max: 3, num_nulls: 620]
.indices:
..list:
...element: INT32 SNAPPY DO:0 FPO:13994 SZ:269/265/0.99 VC:893 ENC:RLE,PLAIN 
ST:[num_nulls: 893, min/max not defined]
.values:
..list:
...element: DOUBLE SNAPPY DO:0 FPO:14263 SZ:1772/2406/1.36 VC:2133 
ENC:RLE,PLAIN_DICTIONARY ST:[min: -0.85, max: 1.59, num_nulls: 273]
prediction: DOUBLE SNAPPY DO:0 FPO:16035 SZ:156/151/0.97 VC:893 
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: -0.0, max: 1.0, num_nulls: 0]
file: 
file:/Users/yumwang/data/a2_m2.parquet/part-0-50be1622-7a2c-43d2-b3ce-f0703ab8f458.snappy.parquet
creator: parquet-mr version 1.8.1 (build 
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
extra: org.apache.spark.sql.parquet.row.metadata = 

[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547264#comment-16547264
 ] 

Yuming Wang commented on SPARK-24828:
-

I'm working on

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  
> The file is attached [^a2_m2.parquet.zip]
>  
> The following code reproduces the error:
> df = spark.read.parquet('a2_m2.parquet')
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> binEval = 

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-17 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547263#comment-16547263
 ] 

Saisai Shao commented on SPARK-24615:
-

Sure, I will also add it as Xiangrui also suggested the same concern.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty

2018-07-17 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24402.
-
Resolution: Resolved

> Optimize `In` expression when only one element in the collection or 
> collection is empty 
> 
>
> Key: SPARK-24402
> URL: https://issues.apache.org/jira/browse/SPARK-24402
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Collection}}*, the physical plan
> will be optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical
> plan will be simplified to
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
> TODO:
>  # For multiple conditions with numbers less than certain thresholds,
> we should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the
> numbers of the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we
> should do benchmark for using different set implementation for faster
> query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reassigned SPARK-24825:
---

Assignee: Matt Cheah

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
The problem shows up when joining a column that has constant (lit) value. As 
seen from the exception in the logical plan, the literal column gets dropped, 
which results in joining two DF on a column that does not exist (which then 
correctly results in a Cartesian join). 

 
{code:java}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 

[jira] [Issue Comment Deleted] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Comment: was deleted

(was: Might be the closest issues related to this. )

> Incorrect drop of lit() column results in cross join
> 
>
> Key: SPARK-24839
> URL: https://issues.apache.org/jira/browse/SPARK-24839
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1
>Reporter: marios iliofotou
>Priority: Major
>
> The problem shows up when joining a column that has constant value. As seen 
> from the exception in the logical plan, the literal column gets dropped, 
> which results in joining two DF on a column that does not exist, which 
> correctly results in a Cartesian join.  
>  
> {code:java}
> scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 
> 4))).withColumn("index", lit("a"))
> scala> df1.show
> +---+---+-+
> | _1| _2|index|
> +---+---+-+
> |  1|  2|a|
> |  2|  4|a|
> +---+---+-+
>  
> scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
> "someval")
> scala> df2.show()
> +-+---+
> |index|someval|
> +-+---+
> |a|  1|
> |b|  2|
> +-+---+
> scala> df1.join(df2).show()
>  org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
> for INNER join between logical plans
>  LocalRelation [_1#370, _2#371|#370, _2#371]
>  and
>  LocalRelation [index#335, someval#336|#335, someval#336]
>  Join condition is missing or trivial.
>  Either: use the CROSS JOIN syntax to allow cartesian products between these
>  relations, or: enable implicit cartesian products by setting the 
> configuration
>  variable spark.sql.crossJoin.enabled=true;
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at 

[jira] [Commented] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547246#comment-16547246
 ] 

marios iliofotou commented on SPARK-24839:
--

Might be the closest issues related to this. 

> Incorrect drop of lit() column results in cross join
> 
>
> Key: SPARK-24839
> URL: https://issues.apache.org/jira/browse/SPARK-24839
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1
>Reporter: marios iliofotou
>Priority: Major
>
> The problem shows up when joining a column that has constant value. As seen 
> from the exception in the logical plan, the literal column gets dropped, 
> which results in joining two DF on a column that does not exist, which 
> correctly results in a Cartesian join.  
>  
> {code:java}
> scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 
> 4))).withColumn("index", lit("a"))
> scala> df1.show
> +---+---+-+
> | _1| _2|index|
> +---+---+-+
> |  1|  2|a|
> |  2|  4|a|
> +---+---+-+
>  
> scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
> "someval")
> scala> df2.show()
> +-+---+
> |index|someval|
> +-+---+
> |a|  1|
> |b|  2|
> +-+---+
> scala> df1.join(df2).show()
>  org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
> for INNER join between logical plans
>  LocalRelation [_1#370, _2#371|#370, _2#371]
>  and
>  LocalRelation [index#335, someval#336|#335, someval#336]
>  Join condition is missing or trivial.
>  Either: use the CROSS JOIN syntax to allow cartesian products between these
>  relations, or: enable implicit cartesian products by setting the 
> configuration
>  variable spark.sql.crossJoin.enabled=true;
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at 

[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
The problem shows up when joining a column that has constant value. As seen 
from the exception in the logical plan, the literal column gets dropped, which 
results in joining two DF on a column that does not exist, which correctly 
results in a Cartesian join.  

 
{code:java}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 

[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
The problem shows up when joining a column that has constant value. As seen 
from the exception in the logical plan, the literal column gets dropped, which 
results in joining two on column that does not exist, which correctly results 
in a Cartesian join.  

 
{code:java}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 

[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
The problem shows up when joining a column that has constant value. As seen 
from the exception in the logical plan, the literal column gets dropped, which 
results in joining two on column that does not exist, which correctly results 
in a Cartesian join.  

 
{code:java}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 

[jira] [Assigned] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24825:


Assignee: Apache Spark

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547238#comment-16547238
 ] 

Apache Spark commented on SPARK-24825:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/21800

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24825:


Assignee: (was: Apache Spark)

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
The problem shows up when joining a column that has constant value. As seen 
from the exception in the logical plan, the literal column gets dropped, which 
results in joining two on column that does not exist, which correctly results 
in a Cartesian join.  

 
{code:java}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 

[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
The problem shows up when joining a column that has constant value. As seen 
from the exception in the logical plan, the literal column gets dropped, which 
results in joining two on column that does not exist, which correctly results 
in a Cartesian join.  

 
{code:java}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 

[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
The problem shows up when joining a column that has constant value. As seen 
from the exception in the logical plan, the literal column gets dropped, which 
results in joining two on column that does not exist, which correctly results 
in a Cartesian join.  

 
{code:java}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 

[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Summary: Incorrect drop of lit() column results in cross join  (was: 
Incorrect drop of lit column results in cross join. )

> Incorrect drop of lit() column results in cross join
> 
>
> Key: SPARK-24839
> URL: https://issues.apache.org/jira/browse/SPARK-24839
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1
>Reporter: marios iliofotou
>Priority: Major
>
> {code}
> scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 
> 4))).withColumn("index", lit("a"))
> scala> df1.show
> +---+---+-+
> | _1| _2|index|
> +---+---+-+
> |  1|  2|a|
> |  2|  4|a|
> +---+---+-+
>  
> scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
> "someval")
> scala> df2.show()
> +-+---+
> |index|someval|
> +-+---+
> |a|  1|
> |b|  2|
> +-+---+
> scala> df1.join(df2).show()
>  org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
> for INNER join between logical plans
>  LocalRelation [_1#370, _2#371|#370, _2#371]
>  and
>  LocalRelation [index#335, someval#336|#335, someval#336]
>  Join condition is missing or trivial.
>  Either: use the CROSS JOIN syntax to allow cartesian products between these
>  relations, or: enable implicit cartesian products by setting the 
> configuration
>  variable spark.sql.crossJoin.enabled=true;
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
>  at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
> 

[jira] [Updated] (SPARK-24839) Incorrect drop of lit column results in cross join.

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
{code}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
 at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
 at 

[jira] [Updated] (SPARK-24839) Incorrect drop of lit column results in cross join.

2018-07-17 Thread marios iliofotou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marios iliofotou updated SPARK-24839:
-
Description: 
{code:scala}
scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", 
org.apache.spark.sql.functions.lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
|  1|  2|a|
|  2|  4|a|
+---+---+-+
 
scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
|a|  1|
|b|  2|
+-+---+

scala> df1.join(df2).show()
 org.apache.spark.sql.AnalysisException: Detected implicit cartesian product 
for INNER join between logical plans
 LocalRelation [_1#370, _2#371|#370, _2#371]
 and
 LocalRelation [index#335, someval#336|#335, someval#336]
 Join condition is missing or trivial.
 Either: use the CROSS JOIN syntax to allow cartesian products between these
 relations, or: enable implicit cartesian products by setting the configuration
 variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
 at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
 at 

[jira] [Created] (SPARK-24839) Incorrect drop of lit column results in cross join.

2018-07-17 Thread marios iliofotou (JIRA)
marios iliofotou created SPARK-24839:


 Summary: Incorrect drop of lit column results in cross join. 
 Key: SPARK-24839
 URL: https://issues.apache.org/jira/browse/SPARK-24839
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.3.1
Reporter: marios iliofotou


```
 scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 
4))).withColumn("index", org.apache.spark.sql.functions.lit("a"))

scala> df1.show
+---+---+-+
| _1| _2|index|
+---+---+-+
| 1| 2| a|
| 2| 4| a|
+---+---+-+

scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", 
"someval")

scala> df2.show()
+-+---+
|index|someval|
+-+---+
| a| 1|
| b| 2|
+-+---+

scala> df1.join(df2).show()
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for 
INNER join between logical plans
LocalRelation [_1#370, _2#371]
and
LocalRelation [index#335, someval#336]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121)
 at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
 at 

[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547225#comment-16547225
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 11:43 PM:
---

Yes [~srowen] is right, this seems to work: 

dev-run-integration-tests.sh

cd ../../../

./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl 
resource-managers/kubernetes/integration-tests -amd integration-test 
$\{properties[@]}


was (Author: skonto):
Yes [~srowen] is right, this seems to work: 

./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl 
resource-managers/kubernetes/integration-tests -amd integration-test 
$\{properties[@]}

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547225#comment-16547225
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 11:43 PM:
---

Yes [~srowen] is right, this seems to work: 

dev-run-integration-tests.sh:

cd ../../../

./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl 
resource-managers/kubernetes/integration-tests -amd integration-test 
$\{properties[@]}


was (Author: skonto):
Yes [~srowen] is right, this seems to work: 

dev-run-integration-tests.sh

cd ../../../

./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl 
resource-managers/kubernetes/integration-tests -amd integration-test 
$\{properties[@]}

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547225#comment-16547225
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 11:38 PM:
---

Yes [~srowen] is right, this seems to work: 

./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl 
resource-managers/kubernetes/integration-tests -amd integration-test 
$\{properties[@]}


was (Author: skonto):
Yes this seems to work: 

./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl 
resource-managers/kubernetes/integration-tests -amd integration-test 
${properties[@]}

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547225#comment-16547225
 ] 

Stavros Kontopoulos commented on SPARK-24825:
-

Yes this seems to work: 

./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl 
resource-managers/kubernetes/integration-tests -amd integration-test 
${properties[@]}

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547200#comment-16547200
 ] 

Apache Spark commented on SPARK-24747:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/21799

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547189#comment-16547189
 ] 

shane knapp commented on SPARK-24825:
-

[~mcheah] is working on a patch now... 

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24825:

Comment: was deleted

(was: [~srowen] I played with submodules but it didnt work for me, but didnt 
spend much time.)

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547186#comment-16547186
 ] 

Stavros Kontopoulos commented on SPARK-24825:
-

[~srowen] I played with submodules but it didnt work for me, but didnt spend 
much time.

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547142#comment-16547142
 ] 

Sean Owen commented on SPARK-24825:
---

To build and test only a child module, you can't just run tests in the child's 
dir. You run something like "mvn ... -pl child/pom.xml" from the parent. Is 
that the issue?

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547141#comment-16547141
 ] 

shane knapp commented on SPARK-24825:
-

[~vanzin] [~joshrosen] any thoughts?

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547137#comment-16547137
 ] 

Matt Cheah commented on SPARK-24825:


We're looking into this now, this particular phase was built out by myself, 
[~ssuchter], [~foxish], and [~ifilonenko]. Consulting with some folks from 
RiseLab now also - [~shaneknapp]. I think we really shouldn't have to maven 
install, the multi-module build should pick up the other modules properly. 
We've likely configured the maven reactor incorrectly in the integration test's 
pom.xml.

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547133#comment-16547133
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:37 PM:
--

You could remove .m2, downloads should be fast. :) I am hacking too :P

But I suspect also timestamp of published metadata is messed up.

Can we ping people who own the process? [~srowen]?

 


was (Author: skonto):
You could remove .m2, downloads should be fast. :) I am hacking too :P

But i suspect also timestamp of published metadata are messed up.

Can we ping people who own the process? [~srowen]?

 

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547133#comment-16547133
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:37 PM:
--

You could remove .m2, downloads should be fast. :) I am hacking too :P

But i suspect also timestamp of published metadata are messed up.

Can we ping people who own the process? [~srowen]?

 


was (Author: skonto):
You could remove .m2, downloads should be fast. :) I am hacking too :P

But i suspect also timestamps of published jars are messed up.

Can we ping people who own the process? [~srowen]?

 

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547133#comment-16547133
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:36 PM:
--

You could remove .m2, downloads should be fast. :) I am hacking too :P

But i suspect also timestamps of published jars are messed up.

Can we ping people who own the process? [~srowen]?

 


was (Author: skonto):
You could remove .m2, downloads should be fast. :) I am hacking too :P

 

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547133#comment-16547133
 ] 

Stavros Kontopoulos commented on SPARK-24825:
-

You could remove .m2, downloads should be fast. :) I am hacking too :P

 

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547128#comment-16547128
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:34 PM:
--

As I mentioned elsewhere one thing that worked for me is modify 
./dev/make-distribution.sh to do a clean install instead of clean package. But 
not sure for the long term for this, although you could purge deps before a 
build.

Test suite has the following spark deps:

mvn -f pom.xml dependency:resolve | grep spark

[INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile
 [INFO] org.spark-project.spark:unused:jar:1.0.0:compile
 [INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile


was (Author: skonto):
As I mentioned elsewhere one thing that worked for me is modify 
./dev/make-distribution.sh to do a clean install instead of clean package. But 
not sure for the long term for this, although you could purge deps.

Test suite has the following spark deps:

mvn -f pom.xml dependency:resolve | grep spark

[INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile
 [INFO] org.spark-project.spark:unused:jar:1.0.0:compile
 [INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547132#comment-16547132
 ] 

shane knapp commented on SPARK-24825:
-

yeah, `clean install` isn't probably a good long term solution.  we're still 
hacking on things, so give us a little bit.  :)

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547128#comment-16547128
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:31 PM:
--

As I mentioned elsewhere one thing that worked for me is modify 
./dev/make-distribution.sh to do a clean install instead of clean package. But 
not sure for the long term for this, although you could purge deps.

Test suite has the following spark deps:

mvn -f pom.xml dependency:resolve | grep spark

[INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile
 [INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile
 [INFO] org.spark-project.spark:unused:jar:1.0.0:compile
 [INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile


was (Author: skonto):
As I mentioned elsewhere one thing that worked for me is modify 
./dev/make-distribution.sh to do an clean install instead of clean package. But 
not sure for the long term.

Test suite has the following spark deps:

mvn -f pom.xml dependency:resolve | grep spark

[INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile
[INFO] org.spark-project.spark:unused:jar:1.0.0:compile
[INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547128#comment-16547128
 ] 

Stavros Kontopoulos commented on SPARK-24825:
-

As I mentioned elsewhere one thing that worked for me is modify 
./dev/make-distribution.sh to do an clean install instead of clean package. But 
not sure for the long term.

Test suite has the following spark deps:

mvn -f pom.xml dependency:resolve | grep spark

[INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile
[INFO] org.spark-project.spark:unused:jar:1.0.0:compile
[INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24835) col function ignores drop

2018-07-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547123#comment-16547123
 ] 

Liang-Chi Hsieh commented on SPARK-24835:
-

`drop` actually does to add a projection on top of original dataset. So the 
following query works:

{code}
df2 = df.drop('c')
df2.where(df['c'] < 6).show()
{code}

It can be seen as a query (in Scala) like:

{code}
val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
val filter1 = df.select(df("name")).filter(df("id") === 0)
{code}

This is a valid query.

Regarding the query can't be run:

{code}
df = df.drop('c')
df.where(df['c'] < 6).show()
{code}

Because you want to resolve column {{c}} on the top of updated {{df}} which 
have output {{[a, b]}} now, you can't successfully resolve this column.



> col function ignores drop
> -
>
> Key: SPARK-24835
> URL: https://issues.apache.org/jira/browse/SPARK-24835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0
> Python 3.5.3
>Reporter: Michael Souder
>Priority: Minor
>
> Not sure if this is a bug or user error, but I've noticed that accessing 
> columns with the col function ignores a previous call to drop.
> {code}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(1,3,5), (2, None, 7), (0, 3, 2)], ['a', 'b', 
> 'c'])
> df.show()
> +---++---+
> |  a|   b|  c|
> +---++---+
> |  1|   3|  5|
> |  2|null|  7|
> |  0|   3|  2|
> +---++---+
> df = df.drop('c')
> # the col function is able to see the 'c' column even though it has been 
> dropped
> df.where(F.col('c') < 6).show()
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  3|
> |  0|  3|
> +---+---+
> # trying the same with brackets on the data frame fails with the expected 
> error
> df.where(df['c'] < 6).show()
> Py4JJavaError: An error occurred while calling o36909.apply.
> : org.apache.spark.sql.AnalysisException: Cannot resolve column name "c" 
> among (a, b);{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24835) col function ignores drop

2018-07-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547123#comment-16547123
 ] 

Liang-Chi Hsieh edited comment on SPARK-24835 at 7/17/18 9:25 PM:
--

`drop` actually does to add a projection on top of original dataset. So the 
following query works:
{code:java}
df2 = df.drop('c')
df2.where(df['c'] < 6).show()
{code}
It can be seen as a query (in Scala) like:
{code:java}
val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
df.select(df("name")).filter(df("id") === 0).show()
{code}
This is a valid query.

Regarding the query can't be run:
{code:java}
df = df.drop('c')
df.where(df['c'] < 6).show()
{code}
Because you want to resolve column {{c}} on the top of updated {{df}} which 
have output {{[a, b]}} now, you can't successfully resolve this column.


was (Author: viirya):
`drop` actually does to add a projection on top of original dataset. So the 
following query works:

{code}
df2 = df.drop('c')
df2.where(df['c'] < 6).show()
{code}

It can be seen as a query (in Scala) like:

{code}
val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
val filter1 = df.select(df("name")).filter(df("id") === 0)
{code}

This is a valid query.

Regarding the query can't be run:

{code}
df = df.drop('c')
df.where(df['c'] < 6).show()
{code}

Because you want to resolve column {{c}} on the top of updated {{df}} which 
have output {{[a, b]}} now, you can't successfully resolve this column.



> col function ignores drop
> -
>
> Key: SPARK-24835
> URL: https://issues.apache.org/jira/browse/SPARK-24835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0
> Python 3.5.3
>Reporter: Michael Souder
>Priority: Minor
>
> Not sure if this is a bug or user error, but I've noticed that accessing 
> columns with the col function ignores a previous call to drop.
> {code}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(1,3,5), (2, None, 7), (0, 3, 2)], ['a', 'b', 
> 'c'])
> df.show()
> +---++---+
> |  a|   b|  c|
> +---++---+
> |  1|   3|  5|
> |  2|null|  7|
> |  0|   3|  2|
> +---++---+
> df = df.drop('c')
> # the col function is able to see the 'c' column even though it has been 
> dropped
> df.where(F.col('c') < 6).show()
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  3|
> |  0|  3|
> +---+---+
> # trying the same with brackets on the data frame fails with the expected 
> error
> df.where(df['c'] < 6).show()
> Py4JJavaError: An error occurred while calling o36909.apply.
> : org.apache.spark.sql.AnalysisException: Cannot resolve column name "c" 
> among (a, b);{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24681) Cannot create a view from a table when a nested column name contains ':'

2018-07-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24681.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Cannot create a view from a table when a nested column name contains ':'
> 
>
> Key: SPARK-24681
> URL: https://issues.apache.org/jira/browse/SPARK-24681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Adrian Ionescu
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>
> Here's a patch that reproduces the issue: 
> {code:java}
> diff --git 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
> index 09c1547..29bb3db 100644 
> --- 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
> +++ 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala 
> @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive 
>  
> import org.apache.spark.sql.{QueryTest, Row} 
> import org.apache.spark.sql.execution.datasources.parquet.ParquetTest 
> +import org.apache.spark.sql.functions.{lit, struct} 
> import org.apache.spark.sql.hive.test.TestHiveSingleton 
>  
> case class Cases(lower: String, UPPER: String) 
> @@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest 
> with TestHiveSingleton 
>   } 
> } 
>   } 
> + 
> +  test("column names including ':' characters") { 
> +    withTempPath { path => 
> +  withTable("test_table") { 
> +    spark.range(0) 
> +  .select(struct(lit(0).as("nested:column")).as("toplevel:column")) 
> +  .write.format("parquet") 
> +  .option("path", path.getCanonicalPath) 
> +  .saveAsTable("test_table") 
> + 
> +    sql("CREATE VIEW test_view_1 AS SELECT `toplevel:column`.* FROM 
> test_table") 
> +    sql("CREATE VIEW test_view_2 AS SELECT * FROM test_table") 
> + 
> +  } 
> +    } 
> +  } 
> }{code}
> The first "CREATE VIEW" statement succeeds, but the second one fails with:
> {code:java}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-07-17 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547105#comment-16547105
 ] 

Xiao Li commented on SPARK-24838:
-

cc [~dkbiswal]

> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-07-17 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-24838:


 Summary: Support uncorrelated IN/EXISTS subqueries for more 
operators 
 Key: SPARK-24838
 URL: https://issues.apache.org/jira/browse/SPARK-24838
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Qifan Pu


Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
Running a query:

{{select name in (select * from valid_names)}}
{{from all_names}}

returns error:
{code:java}
Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries can 
only be used in a Filter
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 8:54 PM:
--

To reproduce it try build the test suite:

../../../build/mvn install
 Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
 [INFO] Scanning for projects...
 [INFO] 
 [INFO] 
 [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
 [INFO] 
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 (2 KB at 0.4 KB/sec)
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom]
 [WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no 
dependency information available
 [WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is 
missing, no dependency information available
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 (2 KB at 0.7 KB/sec)
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom]
 [WARNING] The POM for 
org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is 
missing, no dependency information available
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar]
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar]
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar]
 (9 KB at 5.2 KB/sec)
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar]
 (10438 KB at 577.5 KB/sec)
 [INFO] 
 [INFO] BUILD FAILURE
 [INFO] 
 [INFO] Total time: 26.346 s
 [INFO] Finished at: 2018-07-17T23:15:29+03:00
 [INFO] Final Memory: 14M/204M
 [INFO] 
 [ERROR] Failed to execute goal on project 
spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for 
project 
spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT:
 Could not find artifact 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in 
apache.snapshots ([https://repository.apache.org/snapshots]) -> [Help 1]
 [ERROR] 
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR] 
 [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
 [ERROR] [Help 1] 
[http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException]

 

The metadata here show expected version:

[https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]

 
 org.apache.spark
 spark-core_2.11
 2.4.0-SNAPSHOT
 
 
 20180717.095738
 170
 
 20180717095738
  
 but there is no such jar uploaded.

 

Latest is: 
[spark-core_2.11-2.4.0-20180717.095737-170.jar|https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170.jar]

Probably timestamp is wrong here.


was (Author: skonto):
To reproduce it try build the test suite:

../../../build/mvn install
 Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
 [INFO] Scanning for projects...
 [INFO] 
 [INFO] 
 [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
 [INFO] 

[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 8:54 PM:
--

To reproduce it try build the test suite:

../../../build/mvn install
 Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
 [INFO] Scanning for projects...
 [INFO] 
 [INFO] 
 [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
 [INFO] 
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 (2 KB at 0.4 KB/sec)
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom]
 [WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no 
dependency information available
 [WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is 
missing, no dependency information available
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 (2 KB at 0.7 KB/sec)
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom]
 [WARNING] The POM for 
org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is 
missing, no dependency information available
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar]
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar]
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar]
 (9 KB at 5.2 KB/sec)
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar]
 (10438 KB at 577.5 KB/sec)
 [INFO] 
 [INFO] BUILD FAILURE
 [INFO] 
 [INFO] Total time: 26.346 s
 [INFO] Finished at: 2018-07-17T23:15:29+03:00
 [INFO] Final Memory: 14M/204M
 [INFO] 
 [ERROR] Failed to execute goal on project 
spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for 
project 
spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT:
 Could not find artifact 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in 
apache.snapshots ([https://repository.apache.org/snapshots]) -> [Help 1]
 [ERROR] 
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR] 
 [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
 [ERROR] [Help 1] 
[http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException]

 

The metadata here show expected version:

[https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]

 
 org.apache.spark
 spark-core_2.11
 2.4.0-SNAPSHOT
 
 
 20180717.095738
 170
 
 20180717095738
  
 but there is no such jar uploaded.

 

Latest is: 
[spark-core_2.11-2.4.0-20180717.095737-170.jar|https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170.jar]

Probably timestamp is wrong here:


was (Author: skonto):
To reproduce it try build the test suite:

../../../build/mvn install
 Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
 [INFO] Scanning for projects...
 [INFO] 
 [INFO] 
 [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
 [INFO] 

[jira] [Updated] (SPARK-23723) New encoding option for json datasource

2018-07-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23723:

Summary: New encoding option for json datasource  (was: New charset option 
for json datasource)

> New encoding option for json datasource
> ---
>
> Key: SPARK-23723
> URL: https://issues.apache.org/jira/browse/SPARK-23723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently JSON Reader can read json files in different charset/encodings. The 
> JSON Reader uses the jackson-json library to automatically detect the charset 
> of input text/stream. Here you can see the method which detects encoding: 
> [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174]
>  
> The detectEncoding method checks the BOM 
> ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. 
> The BOM can be in the file but it is not mandatory. If it is not present, the 
> auto detection mechanism can select wrong charset. And as a consequence of 
> that, the user cannot read the json file. *The proposed option will allow to 
> bypass the auto detection mechanism and set the charset explicitly.*
>  
> The charset option is already exposed as a CSV option: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88]
>  . I propose to add the same option for JSON.
>  
> Regarding to JSON Writer, *the charset option will give to the user 
> opportunity* to read json files in charset different from UTF-8, modify the 
> dataset and *write results back to json files in the original encoding.* At 
> the moment it is not possible to do because the result can be saved in UTF-8 
> only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547084#comment-16547084
 ] 

shane knapp commented on SPARK-24825:
-

we're trying to get things to work w/o uploading the jar, but instead have it 
found locally.

 

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 8:44 PM:
--

To reproduce it try build the test suite:

../../../build/mvn install
 Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
 [INFO] Scanning for projects...
 [INFO] 
 [INFO] 
 [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
 [INFO] 
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 (2 KB at 0.4 KB/sec)
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom]
 [WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no 
dependency information available
 [WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is 
missing, no dependency information available
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 (2 KB at 0.7 KB/sec)
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom]
 [WARNING] The POM for 
org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is 
missing, no dependency information available
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar]
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar]
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar]
 (9 KB at 5.2 KB/sec)
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar]
 (10438 KB at 577.5 KB/sec)
 [INFO] 
 [INFO] BUILD FAILURE
 [INFO] 
 [INFO] Total time: 26.346 s
 [INFO] Finished at: 2018-07-17T23:15:29+03:00
 [INFO] Final Memory: 14M/204M
 [INFO] 
 [ERROR] Failed to execute goal on project 
spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for 
project 
spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT:
 Could not find artifact 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in 
apache.snapshots ([https://repository.apache.org/snapshots]) -> [Help 1]
 [ERROR] 
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR] 
 [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
 [ERROR] [Help 1] 
[http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException]

 

The metadata here show expected version:

[https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]

 
org.apache.spark
spark-core_2.11
2.4.0-SNAPSHOT


20180717.095738
170

20180717095738
 
but there is no such jar uploaded.


was (Author: skonto):
To reproduce it try build the test suite:

../../../build/mvn install
 Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
 [INFO] Scanning for projects...
 [INFO] 
 [INFO] 
 [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
 [INFO] 
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 

[jira] [Commented] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-07-17 Thread Nihar Sheth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547074#comment-16547074
 ] 

Nihar Sheth commented on SPARK-24536:
-

I'd like to take a shot at this, if no one else is

> Query with nonsensical LIMIT hits AssertionError
> 
>
> Key: SPARK-24536
> URL: https://issues.apache.org/jira/browse/SPARK-24536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Behm
>Priority: Trivial
>  Labels: beginner
>
> SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT)
> fails in the QueryPlanner with:
> {code}
> java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
> {code}
> I think this issue should be caught earlier during semantic analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24837) Add kafka as spark metrics sink

2018-07-17 Thread Sandish Kumar HN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-24837:
-
Description: 
Sink spark metrics to kafka producer 
spark/core/src/main/scala/org/apache/spark/metrics/sink/
 someone assign this to me?

  was:
Sink spark logs/metrics to kafka producer 
spark/core/src/main/scala/org/apache/spark/metrics/sink/
someone assign this to me?


> Add kafka as spark metrics sink
> ---
>
> Key: SPARK-24837
> URL: https://issues.apache.org/jira/browse/SPARK-24837
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Sink spark metrics to kafka producer 
> spark/core/src/main/scala/org/apache/spark/metrics/sink/
>  someone assign this to me?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24837) Add kafka as spark metrics sink

2018-07-17 Thread Sandish Kumar HN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-24837:
-
Summary: Add kafka as spark metrics sink  (was: Add kafka as spark 
logs/metrics sink)

> Add kafka as spark metrics sink
> ---
>
> Key: SPARK-24837
> URL: https://issues.apache.org/jira/browse/SPARK-24837
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Sink spark logs/metrics to kafka producer 
> spark/core/src/main/scala/org/apache/spark/metrics/sink/
> someone assign this to me?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061
 ] 

Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 8:20 PM:
--

To reproduce it try build the test suite:

../../../build/mvn install
 Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
 [INFO] Scanning for projects...
 [INFO] 
 [INFO] 
 [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
 [INFO] 
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 (2 KB at 0.4 KB/sec)
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom]
 [WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no 
dependency information available
 [WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is 
missing, no dependency information available
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]
 (2 KB at 0.7 KB/sec)
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom]
 [WARNING] The POM for 
org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is 
missing, no dependency information available
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar]
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar]
 Downloading: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar]
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar]
 (9 KB at 5.2 KB/sec)
 Downloaded: 
[https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar]
 (10438 KB at 577.5 KB/sec)
 [INFO] 
 [INFO] BUILD FAILURE
 [INFO] 
 [INFO] Total time: 26.346 s
 [INFO] Finished at: 2018-07-17T23:15:29+03:00
 [INFO] Final Memory: 14M/204M
 [INFO] 
 [ERROR] Failed to execute goal on project 
spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for 
project 
spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT:
 Could not find artifact 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in 
apache.snapshots ([https://repository.apache.org/snapshots]) -> [Help 1]
 [ERROR] 
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR] 
 [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
 [ERROR] [Help 1] 
[http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException]

 

 

 

 

 

 

``


was (Author: skonto):
To reproduce it try build the test suite:

```

../../../build/mvn install
Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
[INFO] Scanning for projects...
[INFO] 
[INFO] 
[INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
[INFO] 
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml
Downloaded: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml
 (2 KB at 0.4 KB/sec)
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom
[WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, 

[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-17 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061
 ] 

Stavros Kontopoulos commented on SPARK-24825:
-

To reproduce it try build the test suite:

```

../../../build/mvn install
Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn
[INFO] Scanning for projects...
[INFO] 
[INFO] 
[INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT
[INFO] 
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml
Downloaded: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml
 (2 KB at 0.4 KB/sec)
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom
[WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no 
dependency information available
[WARNING] The POM for 
org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is 
missing, no dependency information available
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml
Downloaded: 
https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml
 (2 KB at 0.7 KB/sec)
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom
[WARNING] The POM for 
org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is 
missing, no dependency information available
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar
Downloading: 
https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar
Downloaded: 
https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar
 (9 KB at 5.2 KB/sec)
Downloaded: 
https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar
 (10438 KB at 577.5 KB/sec)
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 26.346 s
[INFO] Finished at: 2018-07-17T23:15:29+03:00
[INFO] Final Memory: 14M/204M
[INFO] 
[ERROR] Failed to execute goal on project 
spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for 
project 
spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT:
 Could not find artifact 
org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in 
apache.snapshots (https://repository.apache.org/snapshots) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
[http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException]

```

 

 

 

 

 

 

``

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from 

[jira] [Updated] (SPARK-24837) Add kafka as spark logs/metrics sink

2018-07-17 Thread Sandish Kumar HN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-24837:
-
Description: 
Sink spark logs/metrics to kafka producer 
spark/core/src/main/scala/org/apache/spark/metrics/sink/
someone assign this to me?

  was:
Sink spark logs/metrics to kafka producer 
spark/core/src/main/scala/org/apache/spark/metrics/sink/


> Add kafka as spark logs/metrics sink
> 
>
> Key: SPARK-24837
> URL: https://issues.apache.org/jira/browse/SPARK-24837
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Sink spark logs/metrics to kafka producer 
> spark/core/src/main/scala/org/apache/spark/metrics/sink/
> someone assign this to me?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24837) Add kafka as spark logs/metrics sink

2018-07-17 Thread Sandish Kumar HN (JIRA)
Sandish Kumar HN created SPARK-24837:


 Summary: Add kafka as spark logs/metrics sink
 Key: SPARK-24837
 URL: https://issues.apache.org/jira/browse/SPARK-24837
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Sandish Kumar HN


Sink spark logs/metrics to kafka producer 
spark/core/src/main/scala/org/apache/spark/metrics/sink/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24747.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21719
[https://github.com/apache/spark/pull/21719]

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1

2018-07-17 Thread Hichame El Khalfi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547031#comment-16547031
 ] 

Hichame El Khalfi commented on SPARK-24644:
---

Indeed we were using an old version on pandas, now after updating it to 0.19.2, 
no crash/error to report.

Thank you [~bryanc] and [~hyukjin.kwon] for you valuable help and input (y).

 

> Pyarrow exception while running pandas_udf on pyspark 2.3.1
> ---
>
> Key: SPARK-24644
> URL: https://issues.apache.org/jira/browse/SPARK-24644
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: os: centos
> pyspark 2.3.1
> spark 2.3.1
> pyarrow >= 0.8.0
>Reporter: Hichame El Khalfi
>Priority: Major
>
> Hello,
> When I try to run a `pandas_udf` on my spark dataframe, I get this error
>  
> {code:java}
>   File 
> "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py",
>  lin
> e 280, in load_stream
> pdf = batch.to_pandas()
>   File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226)
> return Table.from_batches([self]).to_pandas(nthreads=nthreads)
>   File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331)
> mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 528, in table_to_blockmanager
> blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 622, in _table_to_blocks
> return [_reconstruct_block(item) for item in result]
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 446, in _reconstruct_block
> block = _int.make_block(block_arr, placement=placement)
> TypeError: make_block() takes at least 3 arguments (2 given)
> {code}
>  
>  More than happy to provide any additional information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24836) New option - ignoreExtension

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24836:


Assignee: Apache Spark

> New option - ignoreExtension
> 
>
> Key: SPARK-24836
> URL: https://issues.apache.org/jira/browse/SPARK-24836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Need to add new option for Avro datasource - *ignoreExtension*. It should 
> control ignoring of the .avro extensions. If it is set to *true* (by 
> default), files with and without .avro extensions should be loaded. Example 
> of usage:
> {code:scala}
> spark
>   .read
>   .option("ignoreExtension", false)
>   .avro("path to avro files")
> {code}
> The option duplicates Hadoop's config 
> avro.mapred.ignore.inputs.without.extension which is taken into account by 
> Avro datasource now and can be set like:
> {code:scala}
> spark
>   .sqlContext
>   .sparkContext
>   .hadoopConfiguration
>   .set("avro.mapred.ignore.inputs.without.extension", "true")
> {code}
> The ignoreExtension option must override 
> avro.mapred.ignore.inputs.without.extension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24836) New option - ignoreExtension

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24836:


Assignee: (was: Apache Spark)

> New option - ignoreExtension
> 
>
> Key: SPARK-24836
> URL: https://issues.apache.org/jira/browse/SPARK-24836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new option for Avro datasource - *ignoreExtension*. It should 
> control ignoring of the .avro extensions. If it is set to *true* (by 
> default), files with and without .avro extensions should be loaded. Example 
> of usage:
> {code:scala}
> spark
>   .read
>   .option("ignoreExtension", false)
>   .avro("path to avro files")
> {code}
> The option duplicates Hadoop's config 
> avro.mapred.ignore.inputs.without.extension which is taken into account by 
> Avro datasource now and can be set like:
> {code:scala}
> spark
>   .sqlContext
>   .sparkContext
>   .hadoopConfiguration
>   .set("avro.mapred.ignore.inputs.without.extension", "true")
> {code}
> The ignoreExtension option must override 
> avro.mapred.ignore.inputs.without.extension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24836) New option - ignoreExtension

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547010#comment-16547010
 ] 

Apache Spark commented on SPARK-24836:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21798

> New option - ignoreExtension
> 
>
> Key: SPARK-24836
> URL: https://issues.apache.org/jira/browse/SPARK-24836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new option for Avro datasource - *ignoreExtension*. It should 
> control ignoring of the .avro extensions. If it is set to *true* (by 
> default), files with and without .avro extensions should be loaded. Example 
> of usage:
> {code:scala}
> spark
>   .read
>   .option("ignoreExtension", false)
>   .avro("path to avro files")
> {code}
> The option duplicates Hadoop's config 
> avro.mapred.ignore.inputs.without.extension which is taken into account by 
> Avro datasource now and can be set like:
> {code:scala}
> spark
>   .sqlContext
>   .sparkContext
>   .hadoopConfiguration
>   .set("avro.mapred.ignore.inputs.without.extension", "true")
> {code}
> The ignoreExtension option must override 
> avro.mapred.ignore.inputs.without.extension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24836) New option - ignoreExtension

2018-07-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24836:
--

 Summary: New option - ignoreExtension
 Key: SPARK-24836
 URL: https://issues.apache.org/jira/browse/SPARK-24836
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to add new option for Avro datasource - *ignoreExtension*. It should 
control ignoring of the .avro extensions. If it is set to *true* (by default), 
files with and without .avro extensions should be loaded. Example of usage:
{code:scala}
spark
  .read
  .option("ignoreExtension", false)
  .avro("path to avro files")
{code}

The option duplicates Hadoop's config 
avro.mapred.ignore.inputs.without.extension which is taken into account by Avro 
datasource now and can be set like:
{code:scala}
spark
  .sqlContext
  .sparkContext
  .hadoopConfiguration
  .set("avro.mapred.ignore.inputs.without.extension", "true")
{code}

The ignoreExtension option must override 
avro.mapred.ignore.inputs.without.extension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24835) col function ignores drop

2018-07-17 Thread Michael Souder (JIRA)
Michael Souder created SPARK-24835:
--

 Summary: col function ignores drop
 Key: SPARK-24835
 URL: https://issues.apache.org/jira/browse/SPARK-24835
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
 Environment: Spark 2.3.0

Python 3.5.3
Reporter: Michael Souder


Not sure if this is a bug or user error, but I've noticed that accessing 
columns with the col function ignores a previous call to drop.
{code}
import pyspark.sql.functions as F

df = spark.createDataFrame([(1,3,5), (2, None, 7), (0, 3, 2)], ['a', 'b', 'c'])
df.show()

+---++---+
|  a|   b|  c|
+---++---+
|  1|   3|  5|
|  2|null|  7|
|  0|   3|  2|
+---++---+

df = df.drop('c')

# the col function is able to see the 'c' column even though it has been dropped
df.where(F.col('c') < 6).show()

+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  0|  3|
+---+---+

# trying the same with brackets on the data frame fails with the expected error
df.where(df['c'] < 6).show()

Py4JJavaError: An error occurred while calling o36909.apply.
: org.apache.spark.sql.AnalysisException: Cannot resolve column name "c" among 
(a, b);{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-17 Thread Misha Dmitriev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546981#comment-16546981
 ] 

Misha Dmitriev commented on SPARK-24801:


[~irashid] yes, I'll submit a change to lazily initialize 
{{SaslEncryption$EncryptedMessage}}.{{byteChannel}}, that won't hurt anyway.

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24071) Micro-benchmark of Parquet Filter Pushdown

2018-07-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24071.
-
   Resolution: Duplicate
Fix Version/s: 2.4.0

> Micro-benchmark of Parquet Filter Pushdown
> --
>
> Key: SPARK-24071
> URL: https://issues.apache.org/jira/browse/SPARK-24071
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> Need a micro-benchmark suite for Parquet filter pushdown



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-07-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24070.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546920#comment-16546920
 ] 

Apache Spark commented on SPARK-24402:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21797

> Optimize `In` expression when only one element in the collection or 
> collection is empty 
> 
>
> Key: SPARK-24402
> URL: https://issues.apache.org/jira/browse/SPARK-24402
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Collection}}*, the physical plan
> will be optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical
> plan will be simplified to
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
> TODO:
>  # For multiple conditions with numbers less than certain thresholds,
> we should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the
> numbers of the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we
> should do benchmark for using different set implementation for faster
> query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-17 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546916#comment-16546916
 ] 

Imran Rashid commented on SPARK-24801:
--

I don't think the messages themselves are actually empty, its just that the 
bytebuffer is empty until the {{transferTo}} call, as you mentioned earlier.  
Netty can buffer these messages up after spark calls 
{{ChannelHandlerContext.write(msg)}}, but I can't believe there are so many of 
them.  I dunno if you can see anything unusual about the {{buf}} or {{region}} 
field of those encrypted messages and if they are all empty.

But in any case, I guess what you're proposing is a simple fix in the meantime 
before we figure that out.  Would you like to submit a pr?

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2018-07-17 Thread Gayathiri Duraikannu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546891#comment-16546891
 ] 

Gayathiri Duraikannu commented on SPARK-19680:
--

Ours is a framework and multiple consumers use this to stream the data. We have 
thousands of topics to consume the data from. Adding the start offset 
explicitly wouldn't work for all of our use cases. Is there any other alternate 
approach.

> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>Priority: Major
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
> slideDuration: Int): DStream[TopListEntry] = {
> val Mapper = MapperObject.readerFor[SearchEventDTO]
> result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
>   .filter(s => s != null && s.getSearchRequest != null && 
> s.getSearchRequest.getSearchParameters != null && s.getVertical == 
> Vertical.BAP && 
> s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
>   .map(row => {
> val name = 
> row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
> (name, new TopListEntry(name, 1, row.getResultCount))
>   })
>   .reduceByKeyAndWindow(
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + 
> b.getMeanSearchHits),
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - 
> b.getMeanSearchHits),
> Minutes(windowDuration),
> Seconds(slideDuration))
>   .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L)
>   .map(row => (row._2.getSearchCount, row._2))
>   .transform(rdd => rdd.sortByKey(ascending = false))
>   .map(row => new TopListEntry(row._2.getKeyword, row._2.getSearchCount, 
> row._2.getMeanSearchHits / row._2.getSearchCount))
>   }
>   def main(properties: Properties): Unit = {
> val sparkSession = SparkUtil.getDefaultSparkSession(properties, TaskName)
> val kafkaSink = 
> sparkSession.sparkContext.broadcast(KafkaSinkUtil.apply[TopList](SparkUtil.getDefaultSparkProperties(properties)))
> val kafkaParams: Map[String, Object] = 
> SparkUtil.getDefaultKafkaReceiverParameter(properties)
> val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(30))
> 

[jira] [Resolved] (SPARK-21590) Structured Streaming window start time should support negative values to adjust time zone

2018-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21590.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 18903
[https://github.com/apache/spark/pull/18903]

> Structured Streaming window start time should support negative values to 
> adjust time zone
> -
>
> Key: SPARK-21590
> URL: https://issues.apache.org/jira/browse/SPARK-21590
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0
> Environment: spark 2.2.0
>Reporter: Kevin Zhang
>Assignee: Kevin Zhang
>Priority: Major
>  Labels: spark-sql, spark2.2, streaming, structured, timezone, 
> window
> Fix For: 2.4.0
>
>
> I want to calculate (unique) daily access count using structured streaming 
> (2.2.0). 
> Now strut streaming' s window with 1 day duration starts at 
> 00:00:00 UTC and ends at 23:59:59 UTC each day, but my local timezone is CST 
> (UTC + 8 hours) and I
> want date boundaries to be 00:00:00 CST (that is 00:00:00 UTC - 8). 
> In Flink I can set the window offset to -8 hours to make it, but here in 
> struct streaming if I set the start time (same as the offset in Flink) to -8 
> or any other negative values, I will get the following error:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve 'timewindow(timestamp, 864, 864, -288)' due 
> to data type mismatch: The start time (-288) must be greater than or 
> equal to 0.;;
> {code}
> because the time window checks the input parameters to guarantee each value 
> is greater than or equal to 0.
> So I'm thinking about whether we can remove the limit that the start time 
> cannot be negative?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21590) Structured Streaming window start time should support negative values to adjust time zone

2018-07-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21590:
-

Assignee: Kevin Zhang

> Structured Streaming window start time should support negative values to 
> adjust time zone
> -
>
> Key: SPARK-21590
> URL: https://issues.apache.org/jira/browse/SPARK-21590
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0
> Environment: spark 2.2.0
>Reporter: Kevin Zhang
>Assignee: Kevin Zhang
>Priority: Major
>  Labels: spark-sql, spark2.2, streaming, structured, timezone, 
> window
> Fix For: 2.4.0
>
>
> I want to calculate (unique) daily access count using structured streaming 
> (2.2.0). 
> Now strut streaming' s window with 1 day duration starts at 
> 00:00:00 UTC and ends at 23:59:59 UTC each day, but my local timezone is CST 
> (UTC + 8 hours) and I
> want date boundaries to be 00:00:00 CST (that is 00:00:00 UTC - 8). 
> In Flink I can set the window offset to -8 hours to make it, but here in 
> struct streaming if I set the start time (same as the offset in Flink) to -8 
> or any other negative values, I will get the following error:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve 'timewindow(timestamp, 864, 864, -288)' due 
> to data type mismatch: The start time (-288) must be greater than or 
> equal to 0.;;
> {code}
> because the time window checks the input parameters to guarantee each value 
> is greater than or equal to 0.
> So I'm thinking about whether we can remove the limit that the start time 
> cannot be negative?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9850) Adaptive execution in Spark

2018-07-17 Thread Michail Giannakopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546840#comment-16546840
 ] 

Michail Giannakopoulos commented on SPARK-9850:
---

Hello [~yhuai]! Are people currently working on this Epic? In other words, is 
this work in progress, or have you determined that it should be stalled?
I am asking because recently I logged an issue related with adaptive execution 
(SPARK-24826). It would be nice to know if you are working on this actively 
since it reduces a lot the number of partitions during shuffles when executing 
sql queries (one of the main bottlenecks for spark). Thanks a lot!

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Yin Huai
>Priority: Major
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24833) Allow specifying Kubernetes host name aliases in the pod specs

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24833:


Assignee: (was: Apache Spark)

> Allow specifying Kubernetes host name aliases in the pod specs
> --
>
> Key: SPARK-24833
> URL: https://issues.apache.org/jira/browse/SPARK-24833
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> For some workloads you would like to allow Spark executors to access external 
> services using host name aliases.  Currently there is no way to specify Host 
> name aliases 
> (https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/)
>  to the pods that Spark generates and pod presets cannot be used to add these 
> at admission time currently (plus the fact that pod presets are still an 
> Alpha feature so not guaranteed to be usable on any given cluster).
> Since Spark on K8S already allows adding secrets and volumes to mount via 
> Spark configuration it should be fairly easy to use the same approach to 
> include host name aliases.
> I will look at opening a PR for this in the next couple of days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24833) Allow specifying Kubernetes host name aliases in the pod specs

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24833:


Assignee: Apache Spark

> Allow specifying Kubernetes host name aliases in the pod specs
> --
>
> Key: SPARK-24833
> URL: https://issues.apache.org/jira/browse/SPARK-24833
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Rob Vesse
>Assignee: Apache Spark
>Priority: Major
>
> For some workloads you would like to allow Spark executors to access external 
> services using host name aliases.  Currently there is no way to specify Host 
> name aliases 
> (https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/)
>  to the pods that Spark generates and pod presets cannot be used to add these 
> at admission time currently (plus the fact that pod presets are still an 
> Alpha feature so not guaranteed to be usable on any given cluster).
> Since Spark on K8S already allows adding secrets and volumes to mount via 
> Spark configuration it should be fairly easy to use the same approach to 
> include host name aliases.
> I will look at opening a PR for this in the next couple of days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24833) Allow specifying Kubernetes host name aliases in the pod specs

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546798#comment-16546798
 ] 

Apache Spark commented on SPARK-24833:
--

User 'rvesse' has created a pull request for this issue:
https://github.com/apache/spark/pull/21796

> Allow specifying Kubernetes host name aliases in the pod specs
> --
>
> Key: SPARK-24833
> URL: https://issues.apache.org/jira/browse/SPARK-24833
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> For some workloads you would like to allow Spark executors to access external 
> services using host name aliases.  Currently there is no way to specify Host 
> name aliases 
> (https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/)
>  to the pods that Spark generates and pod presets cannot be used to add these 
> at admission time currently (plus the fact that pod presets are still an 
> Alpha feature so not guaranteed to be usable on any given cluster).
> Since Spark on K8S already allows adding secrets and volumes to mount via 
> Spark configuration it should be fairly easy to use the same approach to 
> include host name aliases.
> I will look at opening a PR for this in the next couple of days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24305) Avoid serialization of private fields in new collection expressions

2018-07-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24305.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21352
[https://github.com/apache/spark/pull/21352]

> Avoid serialization of private fields in new collection expressions
> ---
>
> Key: SPARK-24305
> URL: https://issues.apache.org/jira/browse/SPARK-24305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Minor
> Fix For: 2.4.0
>
>
> Make sure that private fields of expression case classes in 
> _collectionOperations.scala_ are not serialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24305) Avoid serialization of private fields in new collection expressions

2018-07-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24305:
---

Assignee: Marek Novotny

> Avoid serialization of private fields in new collection expressions
> ---
>
> Key: SPARK-24305
> URL: https://issues.apache.org/jira/browse/SPARK-24305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Minor
>
> Make sure that private fields of expression case classes in 
> _collectionOperations.scala_ are not serialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24165) UDF within when().otherwise() raises NullPointerException

2018-07-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546750#comment-16546750
 ] 

Apache Spark commented on SPARK-24165:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21795

> UDF within when().otherwise() raises NullPointerException
> -
>
> Key: SPARK-24165
> URL: https://issues.apache.org/jira/browse/SPARK-24165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jingxuan Wang
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> I have a UDF which takes java.sql.Timestamp and String as input column type 
> and returns an Array of (Seq[case class], Double) as output. Since some of 
> values in input columns can be nullable, I put the UDF inside a 
> when($input.isNull, null).otherwise(UDF) filter. Such function works well 
> when I test in spark shell. But running as a scala jar in spark-submit with 
> yarn cluster mode, it raised NullPointerException which points to the UDF 
> function. If I remove the when().otherwsie() condition, but put null check 
> inside the UDF, the function works without issue in spark-submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24834) Utils#nanSafeCompare{Double,Float} functions do not differ from normal java double/float comparison

2018-07-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24834:


Assignee: Apache Spark

> Utils#nanSafeCompare{Double,Float} functions do not differ from normal java 
> double/float comparison
> ---
>
> Key: SPARK-24834
> URL: https://issues.apache.org/jira/browse/SPARK-24834
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Benjamin Duffield
>Assignee: Apache Spark
>Priority: Minor
>
> Utils.scala contains two functions `nanSafeCompareDoubles` and 
> `nanSafeCompareFloats` which purport to have special handling of NaN values 
> in comparisons.
> The handling in these functions do not appear to differ from 
> java.lang.Double.compare and java.lang.Float.compare - they seem to produce 
> identical output to the built-in java comparison functions.
> I think it's clearer to not have these special Utils functions, and instead 
> just use the standard java comparison functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >