[jira] [Updated] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24828: Attachment: image-2018-07-18-13-57-21-148.png > Incompatible parquet formats - java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > - > > Key: SPARK-24828 > URL: https://issues.apache.org/jira/browse/SPARK-24828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Environment for creating the parquet file: > IBM Watson Studio Apache Spark Service, V2.1.2 > Environment for reading the parquet file: > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > MacOSX 10.13.3 (17D47) > Spark spark-2.1.2-bin-hadoop2.7 directly obtained from > http://spark.apache.org/downloads.html >Reporter: Romeo Kienzer >Priority: Minor > Attachments: a2_m2.parquet.zip, image-2018-07-18-13-57-21-148.png > > > As requested by [~hyukjin.kwon] here a new issue - related issue can be found > here > > Using the attached parquet file from one Spark installation, reading it using > an installation directly obtained from > [http://spark.apache.org/downloads.html] yields to the following exception: > > 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) > scala.MatchError: [1.0,null] (of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > The file is attached [^a2_m2.parquet.zip] > > The following code reproduces the error: > df = spark.read.parquet('a2_m2.parquet') > from pyspark.ml.evaluation import
[jira] [Comment Edited] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547418#comment-16547418 ] Romeo Kienzer edited comment on SPARK-24828 at 7/18/18 5:53 AM: Dear [~q79969786] - thanks for pointing this out. [~hyukjin.kwon] - I've created the following code casting Integer to Long on the label field - but it results in the same error from pyspark.sql.types import LongType df = spark.read.parquet('a2_m2.parquet') df_hat = df \ .withColumn("label_tmp", df["label"].cast(LongType())) \ .drop('label') \ .withColumnRenamed('label_tmp', 'label') from pyspark.ml.evaluation import MulticlassClassificationEvaluator binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("label") accuracy = binEval.evaluate(df_hat) was (Author: romeokienzler): Dear [~q79969786] - thanks for pointing this out. [~hyukjin.kwon] - I've created the following code casting Integer to Long on the label field - but it results in the same error from pyspark.sql.types import LongType df = spark.read.parquet('a2_m2.parquet') df_cast = df.withColumn("label_tmp", df["label"].cast(LongType())) df_hat = df_cast.drop('label').withColumnRenamed('label_tmp', 'label') from pyspark.ml.evaluation import MulticlassClassificationEvaluator binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("label") accuracy = binEval.evaluate(df_hat) > Incompatible parquet formats - java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > - > > Key: SPARK-24828 > URL: https://issues.apache.org/jira/browse/SPARK-24828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Environment for creating the parquet file: > IBM Watson Studio Apache Spark Service, V2.1.2 > Environment for reading the parquet file: > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > MacOSX 10.13.3 (17D47) > Spark spark-2.1.2-bin-hadoop2.7 directly obtained from > http://spark.apache.org/downloads.html >Reporter: Romeo Kienzer >Priority: Minor > Attachments: a2_m2.parquet.zip > > > As requested by [~hyukjin.kwon] here a new issue - related issue can be found > here > > Using the attached parquet file from one Spark installation, reading it using > an installation directly obtained from > [http://spark.apache.org/downloads.html] yields to the following exception: > > 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) > scala.MatchError: [1.0,null] (of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at >
[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547418#comment-16547418 ] Romeo Kienzer commented on SPARK-24828: --- Dear [~q79969786] - thanks for pointing this out. [~hyukjin.kwon] - I've created the following code casting Integer to Long on the label field - but it results in the same error from pyspark.sql.types import LongType df = spark.read.parquet('a2_m2.parquet') df_cast = df.withColumn("label_tmp", df["label"].cast(LongType())) df_hat = df_cast.drop('label').withColumnRenamed('label_tmp', 'label') from pyspark.ml.evaluation import MulticlassClassificationEvaluator binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("label") accuracy = binEval.evaluate(df_hat) > Incompatible parquet formats - java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > - > > Key: SPARK-24828 > URL: https://issues.apache.org/jira/browse/SPARK-24828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Environment for creating the parquet file: > IBM Watson Studio Apache Spark Service, V2.1.2 > Environment for reading the parquet file: > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > MacOSX 10.13.3 (17D47) > Spark spark-2.1.2-bin-hadoop2.7 directly obtained from > http://spark.apache.org/downloads.html >Reporter: Romeo Kienzer >Priority: Minor > Attachments: a2_m2.parquet.zip > > > As requested by [~hyukjin.kwon] here a new issue - related issue can be found > here > > Using the attached parquet file from one Spark installation, reading it using > an installation directly obtained from > [http://spark.apache.org/downloads.html] yields to the following exception: > > 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) > scala.MatchError: [1.0,null] (of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at >
[jira] [Assigned] (SPARK-24840) do not use dummy filter to switch codegen on/off
[ https://issues.apache.org/jira/browse/SPARK-24840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24840: Assignee: Wenchen Fan (was: Apache Spark) > do not use dummy filter to switch codegen on/off > > > Key: SPARK-24840 > URL: https://issues.apache.org/jira/browse/SPARK-24840 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24840) do not use dummy filter to switch codegen on/off
[ https://issues.apache.org/jira/browse/SPARK-24840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24840: Assignee: Apache Spark (was: Wenchen Fan) > do not use dummy filter to switch codegen on/off > > > Key: SPARK-24840 > URL: https://issues.apache.org/jira/browse/SPARK-24840 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24840) do not use dummy filter to switch codegen on/off
[ https://issues.apache.org/jira/browse/SPARK-24840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547387#comment-16547387 ] Apache Spark commented on SPARK-24840: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21795 > do not use dummy filter to switch codegen on/off > > > Key: SPARK-24840 > URL: https://issues.apache.org/jira/browse/SPARK-24840 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24840) do not use dummy filter to switch codegen on/off
Wenchen Fan created SPARK-24840: --- Summary: do not use dummy filter to switch codegen on/off Key: SPARK-24840 URL: https://issues.apache.org/jira/browse/SPARK-24840 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337 ] zenglinxi edited comment on SPARK-24809 at 7/18/18 4:10 AM: [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. {code:java} // code in HashedRelation.scala private def write( writeBoolean: (Boolean) => Unit, writeLong: (Long) => Unit, writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = { writeBoolean(isDense) writeLong(minKey) writeLong(maxKey) writeLong(numKeys) writeLong(numValues) writeLong(numKeyLookups) writeLong(numProbes) writeLong(array.length) writeLongArray(writeBuffer, array, array.length) val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt writeLong(used) writeLongArray(writeBuffer, page, used) } {code} This write func in HashedRelation.scala will be called when executor didn't have enough memory for the LongToUnsafeRowMap in which the data of broadcast table been saved, however, the value of cursor in executor may not changed after initialization by {code:java} // code placeholder private var cursor: Long = Platform.LONG_ARRAY_OFFSET {code} which makes the value of "used" in write func been zero when write to disk, then in the case of deserializing this data in disk will get wrong pointer. Finally, we may get the wrong data from broadcast join. was (Author: gostop_zlx): [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. {code:java} // code in HashedRelation.scala private def write( writeBoolean: (Boolean) => Unit, writeLong: (Long) => Unit, writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = { writeBoolean(isDense) writeLong(minKey) writeLong(maxKey) writeLong(numKeys) writeLong(numValues) writeLong(numKeyLookups) writeLong(numProbes) writeLong(array.length) writeLongArray(writeBuffer, array, array.length) val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt writeLong(used) writeLongArray(writeBuffer, page, used) } {code} This write func in HashedRelation.scala will be called when executor didn't have enough memory for the LongToUnsafeRowMap in which the data of broadcast table been saved, however, the value of cursor in executor may not changed after initialization by {code:java} // code placeholder private var cursor: Long = Platform.LONG_ARRAY_OFFSET {code} which makes the value of "used" in write func been zero when write to disk, then in the case of deserializing this data in disk will get wrong pointer. Finally, we may get the wrong data from broadcast join. > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > Attachments: Spark LongHashedRelation serialization.svg > > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337 ] zenglinxi edited comment on SPARK-24809 at 7/18/18 4:09 AM: [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. {code:java} // code in HashedRelation.scala private def write( writeBoolean: (Boolean) => Unit, writeLong: (Long) => Unit, writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = { writeBoolean(isDense) writeLong(minKey) writeLong(maxKey) writeLong(numKeys) writeLong(numValues) writeLong(numKeyLookups) writeLong(numProbes) writeLong(array.length) writeLongArray(writeBuffer, array, array.length) val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt writeLong(used) writeLongArray(writeBuffer, page, used) } {code} This write func in HashedRelation.scala will be called when executor didn't have enough memory for the LongToUnsafeRowMap in which the data of broadcast table been saved, however, the value of cursor in executor may not changed after initialization by {code:java} // code placeholder private var cursor: Long = Platform.LONG_ARRAY_OFFSET {code} which makes the value of "used" in write func been zero when write to disk, then in the case of deserializing this data in disk will get wrong pointer. Finally, we may get the wrong data from broadcast join. was (Author: gostop_zlx): [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > Attachments: Spark LongHashedRelation serialization.svg > > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24768) Have a built-in AVRO data source implementation
[ https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547346#comment-16547346 ] Apache Spark commented on SPARK-24768: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/21801 > Have a built-in AVRO data source implementation > --- > > Key: SPARK-24768 > URL: https://issues.apache.org/jira/browse/SPARK-24768 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Built-in AVRO Data Source In Spark 2.4.pdf > > > Apache Avro (https://avro.apache.org) is a popular data serialization format. > It is widely used in the Spark and Hadoop ecosystem, especially for > Kafka-based data pipelines. Using the external package > [https://github.com/databricks/spark-avro], Spark SQL can read and write the > avro data. Making spark-Avro built-in can provide a better experience for > first-time users of Spark SQL and structured streaming. We expect the > built-in Avro data source can further improve the adoption of structured > streaming. The proposal is to inline code from spark-avro package > ([https://github.com/databricks/spark-avro]). The target release is Spark > 2.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24386) implement continuous processing coalesce(1)
[ https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547345#comment-16547345 ] Apache Spark commented on SPARK-24386: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/21801 > implement continuous processing coalesce(1) > --- > > Key: SPARK-24386 > URL: https://issues.apache.org/jira/browse/SPARK-24386 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 2.4.0 > > > [~marmbrus] suggested this as a good implementation checkpoint. If we do the > shuffle reader and writer correctly, it should be easy to make a custom > coalesce(1) execution for continuous processing using them, without having to > implement the logic for shuffle writers finding out where shuffle readers are > located. (The coalesce(1) can just get the RpcEndpointRef directly from the > reader and pass it to the writers.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337 ] zenglinxi commented on SPARK-24809: --- [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > Attachments: Spark LongHashedRelation serialization.svg > > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-24809: -- Attachment: Spark LongHashedRelation serialization.svg > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > Attachments: Spark LongHashedRelation serialization.svg > > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql
[ https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-24829: Summary: In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql (was: CAST AS FLOAT inconsistent with Hive) > In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or > spark-sql > - > > Key: SPARK-24829 > URL: https://issues.apache.org/jira/browse/SPARK-24829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-07-18_110944.png, 2018-07-18_11.png > > > SELECT CAST('4.56' AS FLOAT) > the result is 4.55942779541 , it should be 4.56 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql
[ https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-24829: Attachment: (was: CAST-FLOAT.png) > In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or > spark-sql > - > > Key: SPARK-24829 > URL: https://issues.apache.org/jira/browse/SPARK-24829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-07-18_110944.png, 2018-07-18_11.png > > > SELECT CAST('4.56' AS FLOAT) > the result is 4.55942779541 , it should be 4.56 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql
[ https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-24829: Attachment: 2018-07-18_11.png > In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or > spark-sql > - > > Key: SPARK-24829 > URL: https://issues.apache.org/jira/browse/SPARK-24829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-07-18_110944.png, 2018-07-18_11.png > > > SELECT CAST('4.56' AS FLOAT) > the result is 4.55942779541 , it should be 4.56 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql
[ https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-24829: Attachment: 2018-07-18_110944.png > In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or > spark-sql > - > > Key: SPARK-24829 > URL: https://issues.apache.org/jira/browse/SPARK-24829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-07-18_110944.png, 2018-07-18_11.png > > > SELECT CAST('4.56' AS FLOAT) > the result is 4.55942779541 , it should be 4.56 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547299#comment-16547299 ] Hyukjin Kwon commented on SPARK-24828: -- Thanks [~q79969786]. Yea, currently you can't mix the type. [~romeokienzler] mind testing it after matching the schema? > Incompatible parquet formats - java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > - > > Key: SPARK-24828 > URL: https://issues.apache.org/jira/browse/SPARK-24828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Environment for creating the parquet file: > IBM Watson Studio Apache Spark Service, V2.1.2 > Environment for reading the parquet file: > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > MacOSX 10.13.3 (17D47) > Spark spark-2.1.2-bin-hadoop2.7 directly obtained from > http://spark.apache.org/downloads.html >Reporter: Romeo Kienzer >Priority: Minor > Attachments: a2_m2.parquet.zip > > > As requested by [~hyukjin.kwon] here a new issue - related issue can be found > here > > Using the attached parquet file from one Spark installation, reading it using > an installation directly obtained from > [http://spark.apache.org/downloads.html] yields to the following exception: > > 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) > scala.MatchError: [1.0,null] (of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > The file is attached [^a2_m2.parquet.zip] > > The following code reproduces the error: > df =
[jira] [Resolved] (SPARK-23998) It may be better to add @transient to field 'taskMemoryManager' in class Task, for it is only be set and used in executor side
[ https://issues.apache.org/jira/browse/SPARK-23998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] eaton resolved SPARK-23998. --- Resolution: Won't Do > It may be better to add @transient to field 'taskMemoryManager' in class > Task, for it is only be set and used in executor side > -- > > Key: SPARK-23998 > URL: https://issues.apache.org/jira/browse/SPARK-23998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: eaton >Priority: Minor > > It may be better to add @transient to field 'taskMemoryManager' in class > Task, for it is only be set and used in executor side -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547274#comment-16547274 ] Yuming Wang commented on SPARK-24828: - You can't read different data with a schema. > Incompatible parquet formats - java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > - > > Key: SPARK-24828 > URL: https://issues.apache.org/jira/browse/SPARK-24828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Environment for creating the parquet file: > IBM Watson Studio Apache Spark Service, V2.1.2 > Environment for reading the parquet file: > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > MacOSX 10.13.3 (17D47) > Spark spark-2.1.2-bin-hadoop2.7 directly obtained from > http://spark.apache.org/downloads.html >Reporter: Romeo Kienzer >Priority: Minor > Attachments: a2_m2.parquet.zip > > > As requested by [~hyukjin.kwon] here a new issue - related issue can be found > here > > Using the attached parquet file from one Spark installation, reading it using > an installation directly obtained from > [http://spark.apache.org/downloads.html] yields to the following exception: > > 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) > scala.MatchError: [1.0,null] (of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > The file is attached [^a2_m2.parquet.zip] > > The following code reproduces the error: > df = spark.read.parquet('a2_m2.parquet') > from pyspark.ml.evaluation import
[jira] [Comment Edited] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547264#comment-16547264 ] Yuming Wang edited comment on SPARK-24828 at 7/18/18 1:27 AM: -- The {{label}} column, sometime is {{integer}} type, sometime is {{long}} type: {"name":"label","type":"integer","nullable":false,"metadata":{}} ("name":"label","type":"long","nullable":true,"metadata":{}} {noformat} $ java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar meta file:///Users/data/a2_m2.parquet/ file: file:/Users/yumwang/data/a2_m2.parquet/part-0-1ff92a81-68c8-446b-a54e-a042a8fd7f1e.snappy.parquet creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"label","type":"integer","nullable":false,"metadata":{}},{"name":"CLASS","type":"long","nullable":true,"metadata":{}},{"name":"SENSORID","type":"string","nullable":true,"metadata":{}},{"name":"X","type":"double","nullable":true,"metadata":{}},{"name":"Y","type":"double","nullable":true,"metadata":{}},{"name":"Z","type":"double","nullable":true,"metadata":{}},{"name":"_id","type":"string","nullable":true,"metadata":{}},{"name":"_rev","type":"string","nullable":true,"metadata":{}},{"name":"features","type":{"type":"udt","class":"org.apache.spark.ml.linalg.VectorUDT","pyClass":"pyspark.ml.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"type","type":"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","nullable":true,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"integer","containsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}}]}},"nullable":true,"metadata":{"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"X"},{"idx":1,"name":"Y"},{"idx":2,"name":"Z"}]},"num_attrs":3}}},{"name":"prediction","type":"double","nullable":true,"metadata":{}}]} file schema: spark_schema label: REQUIRED INT32 R:0 D:0 CLASS: OPTIONAL INT64 R:0 D:1 SENSORID: OPTIONAL BINARY O:UTF8 R:0 D:1 X: OPTIONAL DOUBLE R:0 D:1 Y: OPTIONAL DOUBLE R:0 D:1 Z: OPTIONAL DOUBLE R:0 D:1 _id: OPTIONAL BINARY O:UTF8 R:0 D:1 _rev: OPTIONAL BINARY O:UTF8 R:0 D:1 features: OPTIONAL F:4 .type: REQUIRED INT32 O:INT_8 R:0 D:1 .size: OPTIONAL INT32 R:0 D:2 .indices: OPTIONAL F:1 ..list: REPEATED F:1 ...element: REQUIRED INT32 R:1 D:3 .values: OPTIONAL F:1 ..list: REPEATED F:1 ...element: REQUIRED DOUBLE R:1 D:3 prediction: OPTIONAL DOUBLE R:0 D:1 row group 1: RC:893 TS:41499 OFFSET:4 label: INT32 SNAPPY DO:0 FPO:4 SZ:108/104/0.96 VC:893 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] CLASS: INT64 SNAPPY DO:0 FPO:112 SZ:131/127/0.97 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0] SENSORID: BINARY SNAPPY DO:0 FPO:243 SZ:81/77/0.95 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: , max: , num_nulls: 0] X: DOUBLE SNAPPY DO:0 FPO:324 SZ:957/1045/1.09 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: -0.85, max: 1.59, num_nulls: 0] Y: DOUBLE SNAPPY DO:0 FPO:1281 SZ:957/1045/1.09 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: -0.85, max: 1.59, num_nulls: 0] Z: DOUBLE SNAPPY DO:0 FPO:2238 SZ:957/1045/1.09 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: -0.85, max: 1.59, num_nulls: 0] _id: BINARY SNAPPY DO:0 FPO:3195 SZ:7964/32248/4.05 VC:893 ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column] _rev: BINARY SNAPPY DO:0 FPO:11159 SZ:2340/2503/1.07 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[no stats for this column] features: .type: INT32 SNAPPY DO:0 FPO:13499 SZ:199/193/0.97 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] .size: INT32 SNAPPY DO:0 FPO:13698 SZ:296/290/0.98 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 3, max: 3, num_nulls: 620] .indices: ..list: ...element: INT32 SNAPPY DO:0 FPO:13994 SZ:269/265/0.99 VC:893 ENC:RLE,PLAIN ST:[num_nulls: 893, min/max not defined] .values: ..list: ...element: DOUBLE SNAPPY DO:0 FPO:14263 SZ:1772/2406/1.36 VC:2133 ENC:RLE,PLAIN_DICTIONARY ST:[min: -0.85, max: 1.59, num_nulls: 273] prediction: DOUBLE SNAPPY DO:0 FPO:16035 SZ:156/151/0.97 VC:893 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: -0.0, max: 1.0, num_nulls: 0] file: file:/Users/yumwang/data/a2_m2.parquet/part-0-50be1622-7a2c-43d2-b3ce-f0703ab8f458.snappy.parquet creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) extra: org.apache.spark.sql.parquet.row.metadata =
[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547264#comment-16547264 ] Yuming Wang commented on SPARK-24828: - I'm working on > Incompatible parquet formats - java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > - > > Key: SPARK-24828 > URL: https://issues.apache.org/jira/browse/SPARK-24828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Environment for creating the parquet file: > IBM Watson Studio Apache Spark Service, V2.1.2 > Environment for reading the parquet file: > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > MacOSX 10.13.3 (17D47) > Spark spark-2.1.2-bin-hadoop2.7 directly obtained from > http://spark.apache.org/downloads.html >Reporter: Romeo Kienzer >Priority: Minor > Attachments: a2_m2.parquet.zip > > > As requested by [~hyukjin.kwon] here a new issue - related issue can be found > here > > Using the attached parquet file from one Spark installation, reading it using > an installation directly obtained from > [http://spark.apache.org/downloads.html] yields to the following exception: > > 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) > scala.MatchError: [1.0,null] (of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > The file is attached [^a2_m2.parquet.zip] > > The following code reproduces the error: > df = spark.read.parquet('a2_m2.parquet') > from pyspark.ml.evaluation import MulticlassClassificationEvaluator > binEval =
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547263#comment-16547263 ] Saisai Shao commented on SPARK-24615: - Sure, I will also add it as Xiangrui also suggested the same concern. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty
[ https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-24402. - Resolution: Resolved > Optimize `In` expression when only one element in the collection or > collection is empty > > > Key: SPARK-24402 > URL: https://issues.apache.org/jira/browse/SPARK-24402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Two new rules in the logical plan optimizers are added. > # When there is only one element in the *{{Collection}}*, the physical plan > will be optimized to *{{EqualTo}}*, so predicate pushdown can be used. > {code} > profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true) > """ > |== Physical Plan ==| > |*(1) Project [profileID#0|#0]| > |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| > |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| > |PartitionFilters: [],| > |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| > |ReadSchema: struct > """.stripMargin > {code} > # When the *{{Set}}* is empty, and the input is nullable, the logical > plan will be simplified to > {code} > profileDF.filter( $"profileID".isInCollection(Set())).explain(true) > """ > |== Optimized Logical Plan ==| > |Filter if (isnull(profileID#0)) null else false| > |+- Relation[profileID#0|#0] parquet > """.stripMargin > {code} > TODO: > # For multiple conditions with numbers less than certain thresholds, > we should still allow predicate pushdown. > # Optimize the `In` using tableswitch or lookupswitch when the > numbers of the categories are low, and they are `Int`, `Long`. > # The default immutable hash trees set is slow for query, and we > should do benchmark for using different set implementation for faster > query. > # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp reassigned SPARK-24825: --- Assignee: Matt Cheah > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Assignee: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: The problem shows up when joining a column that has constant (lit) value. As seen from the exception in the logical plan, the literal column gets dropped, which results in joining two DF on a column that does not exist (which then correctly results in a Cartesian join). {code:java} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at
[jira] [Issue Comment Deleted] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Comment: was deleted (was: Might be the closest issues related to this. ) > Incorrect drop of lit() column results in cross join > > > Key: SPARK-24839 > URL: https://issues.apache.org/jira/browse/SPARK-24839 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.3.1 >Reporter: marios iliofotou >Priority: Major > > The problem shows up when joining a column that has constant value. As seen > from the exception in the logical plan, the literal column gets dropped, > which results in joining two DF on a column that does not exist, which > correctly results in a Cartesian join. > > {code:java} > scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, > 4))).withColumn("index", lit("a")) > scala> df1.show > +---+---+-+ > | _1| _2|index| > +---+---+-+ > | 1| 2|a| > | 2| 4|a| > +---+---+-+ > > scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", > "someval") > scala> df2.show() > +-+---+ > |index|someval| > +-+---+ > |a| 1| > |b| 2| > +-+---+ > scala> df1.join(df2).show() > org.apache.spark.sql.AnalysisException: Detected implicit cartesian product > for INNER join between logical plans > LocalRelation [_1#370, _2#371|#370, _2#371] > and > LocalRelation [index#335, someval#336|#335, someval#336] > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the > configuration > variable spark.sql.crossJoin.enabled=true; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at
[jira] [Commented] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547246#comment-16547246 ] marios iliofotou commented on SPARK-24839: -- Might be the closest issues related to this. > Incorrect drop of lit() column results in cross join > > > Key: SPARK-24839 > URL: https://issues.apache.org/jira/browse/SPARK-24839 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.3.1 >Reporter: marios iliofotou >Priority: Major > > The problem shows up when joining a column that has constant value. As seen > from the exception in the logical plan, the literal column gets dropped, > which results in joining two DF on a column that does not exist, which > correctly results in a Cartesian join. > > {code:java} > scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, > 4))).withColumn("index", lit("a")) > scala> df1.show > +---+---+-+ > | _1| _2|index| > +---+---+-+ > | 1| 2|a| > | 2| 4|a| > +---+---+-+ > > scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", > "someval") > scala> df2.show() > +-+---+ > |index|someval| > +-+---+ > |a| 1| > |b| 2| > +-+---+ > scala> df1.join(df2).show() > org.apache.spark.sql.AnalysisException: Detected implicit cartesian product > for INNER join between logical plans > LocalRelation [_1#370, _2#371|#370, _2#371] > and > LocalRelation [index#335, someval#336|#335, someval#336] > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the > configuration > variable spark.sql.crossJoin.enabled=true; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at
[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: The problem shows up when joining a column that has constant value. As seen from the exception in the logical plan, the literal column gets dropped, which results in joining two DF on a column that does not exist, which correctly results in a Cartesian join. {code:java} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at
[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: The problem shows up when joining a column that has constant value. As seen from the exception in the logical plan, the literal column gets dropped, which results in joining two on column that does not exist, which correctly results in a Cartesian join. {code:java} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at
[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: The problem shows up when joining a column that has constant value. As seen from the exception in the logical plan, the literal column gets dropped, which results in joining two on column that does not exist, which correctly results in a Cartesian join. {code:java} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at
[jira] [Assigned] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24825: Assignee: Apache Spark > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Assignee: Apache Spark >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547238#comment-16547238 ] Apache Spark commented on SPARK-24825: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/21800 > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24825: Assignee: (was: Apache Spark) > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: The problem shows up when joining a column that has constant value. As seen from the exception in the logical plan, the literal column gets dropped, which results in joining two on column that does not exist, which correctly results in a Cartesian join. {code:java} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at
[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: The problem shows up when joining a column that has constant value. As seen from the exception in the logical plan, the literal column gets dropped, which results in joining two on column that does not exist, which correctly results in a Cartesian join. {code:java} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at
[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: The problem shows up when joining a column that has constant value. As seen from the exception in the logical plan, the literal column gets dropped, which results in joining two on column that does not exist, which correctly results in a Cartesian join. {code:java} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at
[jira] [Updated] (SPARK-24839) Incorrect drop of lit() column results in cross join
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Summary: Incorrect drop of lit() column results in cross join (was: Incorrect drop of lit column results in cross join. ) > Incorrect drop of lit() column results in cross join > > > Key: SPARK-24839 > URL: https://issues.apache.org/jira/browse/SPARK-24839 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.3.1 >Reporter: marios iliofotou >Priority: Major > > {code} > scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, > 4))).withColumn("index", lit("a")) > scala> df1.show > +---+---+-+ > | _1| _2|index| > +---+---+-+ > | 1| 2|a| > | 2| 4|a| > +---+---+-+ > > scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", > "someval") > scala> df2.show() > +-+---+ > |index|someval| > +-+---+ > |a| 1| > |b| 2| > +-+---+ > scala> df1.join(df2).show() > org.apache.spark.sql.AnalysisException: Detected implicit cartesian product > for INNER join between logical plans > LocalRelation [_1#370, _2#371|#370, _2#371] > and > LocalRelation [index#335, someval#336|#335, someval#336] > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the > configuration > variable spark.sql.crossJoin.enabled=true; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at >
[jira] [Updated] (SPARK-24839) Incorrect drop of lit column results in cross join.
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: {code} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at
[jira] [Updated] (SPARK-24839) Incorrect drop of lit column results in cross join.
[ https://issues.apache.org/jira/browse/SPARK-24839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marios iliofotou updated SPARK-24839: - Description: {code:scala} scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", org.apache.spark.sql.functions.lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2|a| | 2| 4|a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ |a| 1| |b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371|#370, _2#371] and LocalRelation [index#335, someval#336|#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at
[jira] [Created] (SPARK-24839) Incorrect drop of lit column results in cross join.
marios iliofotou created SPARK-24839: Summary: Incorrect drop of lit column results in cross join. Key: SPARK-24839 URL: https://issues.apache.org/jira/browse/SPARK-24839 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.3.1 Reporter: marios iliofotou ``` scala> val df1 = spark.createDataFrame(Seq((1, 2), (2, 4))).withColumn("index", org.apache.spark.sql.functions.lit("a")) scala> df1.show +---+---+-+ | _1| _2|index| +---+---+-+ | 1| 2| a| | 2| 4| a| +---+---+-+ scala> val df2 = spark.createDataFrame(Seq(("a", 1),("b", 2))).toDF("index", "someval") scala> df2.show() +-+---+ |index|someval| +-+---+ | a| 1| | b| 2| +-+---+ scala> df1.join(df2).show() org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans LocalRelation [_1#370, _2#371] and LocalRelation [index#335, someval#336] Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1121) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$.apply(Optimizer.scala:1103) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547225#comment-16547225 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 11:43 PM: --- Yes [~srowen] is right, this seems to work: dev-run-integration-tests.sh cd ../../../ ./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl resource-managers/kubernetes/integration-tests -amd integration-test $\{properties[@]} was (Author: skonto): Yes [~srowen] is right, this seems to work: ./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl resource-managers/kubernetes/integration-tests -amd integration-test $\{properties[@]} > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547225#comment-16547225 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 11:43 PM: --- Yes [~srowen] is right, this seems to work: dev-run-integration-tests.sh: cd ../../../ ./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl resource-managers/kubernetes/integration-tests -amd integration-test $\{properties[@]} was (Author: skonto): Yes [~srowen] is right, this seems to work: dev-run-integration-tests.sh cd ../../../ ./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl resource-managers/kubernetes/integration-tests -amd integration-test $\{properties[@]} > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547225#comment-16547225 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 11:38 PM: --- Yes [~srowen] is right, this seems to work: ./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl resource-managers/kubernetes/integration-tests -amd integration-test $\{properties[@]} was (Author: skonto): Yes this seems to work: ./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl resource-managers/kubernetes/integration-tests -amd integration-test ${properties[@]} > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547225#comment-16547225 ] Stavros Kontopoulos commented on SPARK-24825: - Yes this seems to work: ./build/mvn -T 1C -Pscala-2.11 -Pkubernetes -pl resource-managers/kubernetes/integration-tests -amd integration-test ${properties[@]} > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible
[ https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547200#comment-16547200 ] Apache Spark commented on SPARK-24747: -- User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/21799 > Make spark.ml.util.Instrumentation class more flexible > -- > > Key: SPARK-24747 > URL: https://issues.apache.org/jira/browse/SPARK-24747 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.1 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > The Instrumentation class (which is an internal private class) is some what > limited by it's current APIs. The class requires an estimator and dataset be > passed to the constructor which limits how it can be used. Furthermore, the > current APIs make it hard to intercept failures and record anything related > to those failures. > The following changes could make the instrumentation class easier to work > with. All these changes are for private APIs and should not be visible to > users. > {code} > // New no-argument constructor: > Instrumentation() > // New api to log previous constructor arguments. > logTrainingContext(estimator: Estimator[_], dataset: Dataset[_]) > logFailure(e: Throwable): Unit > // Log success with no arguments > logSuccess(): Unit > // Log result model explicitly instead of passing to logSuccess > logModel(model: Model[_]): Unit > // On Companion object > Instrumentation.instrumented[T](body: (Instrumentation => T)): T > // The above API will allow us to write instrumented methods more clearly and > handle logging success and failure automatically: > def someMethod(...): T = instrumented { instr => > instr.logNamedValue(name, value) > // more code here > instr.logModel(model) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547189#comment-16547189 ] shane knapp commented on SPARK-24825: - [~mcheah] is working on a patch now... > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24825: Comment: was deleted (was: [~srowen] I played with submodules but it didnt work for me, but didnt spend much time.) > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547186#comment-16547186 ] Stavros Kontopoulos commented on SPARK-24825: - [~srowen] I played with submodules but it didnt work for me, but didnt spend much time. > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547142#comment-16547142 ] Sean Owen commented on SPARK-24825: --- To build and test only a child module, you can't just run tests in the child's dir. You run something like "mvn ... -pl child/pom.xml" from the parent. Is that the issue? > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547141#comment-16547141 ] shane knapp commented on SPARK-24825: - [~vanzin] [~joshrosen] any thoughts? > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547137#comment-16547137 ] Matt Cheah commented on SPARK-24825: We're looking into this now, this particular phase was built out by myself, [~ssuchter], [~foxish], and [~ifilonenko]. Consulting with some folks from RiseLab now also - [~shaneknapp]. I think we really shouldn't have to maven install, the multi-module build should pick up the other modules properly. We've likely configured the maven reactor incorrectly in the integration test's pom.xml. > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547133#comment-16547133 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:37 PM: -- You could remove .m2, downloads should be fast. :) I am hacking too :P But I suspect also timestamp of published metadata is messed up. Can we ping people who own the process? [~srowen]? was (Author: skonto): You could remove .m2, downloads should be fast. :) I am hacking too :P But i suspect also timestamp of published metadata are messed up. Can we ping people who own the process? [~srowen]? > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547133#comment-16547133 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:37 PM: -- You could remove .m2, downloads should be fast. :) I am hacking too :P But i suspect also timestamp of published metadata are messed up. Can we ping people who own the process? [~srowen]? was (Author: skonto): You could remove .m2, downloads should be fast. :) I am hacking too :P But i suspect also timestamps of published jars are messed up. Can we ping people who own the process? [~srowen]? > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547133#comment-16547133 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:36 PM: -- You could remove .m2, downloads should be fast. :) I am hacking too :P But i suspect also timestamps of published jars are messed up. Can we ping people who own the process? [~srowen]? was (Author: skonto): You could remove .m2, downloads should be fast. :) I am hacking too :P > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547133#comment-16547133 ] Stavros Kontopoulos commented on SPARK-24825: - You could remove .m2, downloads should be fast. :) I am hacking too :P > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547128#comment-16547128 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:34 PM: -- As I mentioned elsewhere one thing that worked for me is modify ./dev/make-distribution.sh to do a clean install instead of clean package. But not sure for the long term for this, although you could purge deps before a build. Test suite has the following spark deps: mvn -f pom.xml dependency:resolve | grep spark [INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile [INFO] org.spark-project.spark:unused:jar:1.0.0:compile [INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile was (Author: skonto): As I mentioned elsewhere one thing that worked for me is modify ./dev/make-distribution.sh to do a clean install instead of clean package. But not sure for the long term for this, although you could purge deps. Test suite has the following spark deps: mvn -f pom.xml dependency:resolve | grep spark [INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile [INFO] org.spark-project.spark:unused:jar:1.0.0:compile [INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547132#comment-16547132 ] shane knapp commented on SPARK-24825: - yeah, `clean install` isn't probably a good long term solution. we're still hacking on things, so give us a little bit. :) > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547128#comment-16547128 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 9:31 PM: -- As I mentioned elsewhere one thing that worked for me is modify ./dev/make-distribution.sh to do a clean install instead of clean package. But not sure for the long term for this, although you could purge deps. Test suite has the following spark deps: mvn -f pom.xml dependency:resolve | grep spark [INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile [INFO] org.spark-project.spark:unused:jar:1.0.0:compile [INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile was (Author: skonto): As I mentioned elsewhere one thing that worked for me is modify ./dev/make-distribution.sh to do an clean install instead of clean package. But not sure for the long term. Test suite has the following spark deps: mvn -f pom.xml dependency:resolve | grep spark [INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile [INFO] org.spark-project.spark:unused:jar:1.0.0:compile [INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547128#comment-16547128 ] Stavros Kontopoulos commented on SPARK-24825: - As I mentioned elsewhere one thing that worked for me is modify ./dev/make-distribution.sh to do an clean install instead of clean package. But not sure for the long term. Test suite has the following spark deps: mvn -f pom.xml dependency:resolve | grep spark [INFO] org.apache.spark:spark-kvstore_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-unsafe_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-launcher_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] org.apache.spark:spark-tags_2.11:test-jar:tests:2.4.0-SNAPSHOT:compile [INFO] org.spark-project.spark:unused:jar:1.0.0:compile [INFO] org.apache.spark:spark-network-common_2.11:jar:2.4.0-SNAPSHOT:compile > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24835) col function ignores drop
[ https://issues.apache.org/jira/browse/SPARK-24835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547123#comment-16547123 ] Liang-Chi Hsieh commented on SPARK-24835: - `drop` actually does to add a projection on top of original dataset. So the following query works: {code} df2 = df.drop('c') df2.where(df['c'] < 6).show() {code} It can be seen as a query (in Scala) like: {code} val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") val filter1 = df.select(df("name")).filter(df("id") === 0) {code} This is a valid query. Regarding the query can't be run: {code} df = df.drop('c') df.where(df['c'] < 6).show() {code} Because you want to resolve column {{c}} on the top of updated {{df}} which have output {{[a, b]}} now, you can't successfully resolve this column. > col function ignores drop > - > > Key: SPARK-24835 > URL: https://issues.apache.org/jira/browse/SPARK-24835 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Spark 2.3.0 > Python 3.5.3 >Reporter: Michael Souder >Priority: Minor > > Not sure if this is a bug or user error, but I've noticed that accessing > columns with the col function ignores a previous call to drop. > {code} > import pyspark.sql.functions as F > df = spark.createDataFrame([(1,3,5), (2, None, 7), (0, 3, 2)], ['a', 'b', > 'c']) > df.show() > +---++---+ > | a| b| c| > +---++---+ > | 1| 3| 5| > | 2|null| 7| > | 0| 3| 2| > +---++---+ > df = df.drop('c') > # the col function is able to see the 'c' column even though it has been > dropped > df.where(F.col('c') < 6).show() > +---+---+ > | a| b| > +---+---+ > | 1| 3| > | 0| 3| > +---+---+ > # trying the same with brackets on the data frame fails with the expected > error > df.where(df['c'] < 6).show() > Py4JJavaError: An error occurred while calling o36909.apply. > : org.apache.spark.sql.AnalysisException: Cannot resolve column name "c" > among (a, b);{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24835) col function ignores drop
[ https://issues.apache.org/jira/browse/SPARK-24835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547123#comment-16547123 ] Liang-Chi Hsieh edited comment on SPARK-24835 at 7/17/18 9:25 PM: -- `drop` actually does to add a projection on top of original dataset. So the following query works: {code:java} df2 = df.drop('c') df2.where(df['c'] < 6).show() {code} It can be seen as a query (in Scala) like: {code:java} val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") df.select(df("name")).filter(df("id") === 0).show() {code} This is a valid query. Regarding the query can't be run: {code:java} df = df.drop('c') df.where(df['c'] < 6).show() {code} Because you want to resolve column {{c}} on the top of updated {{df}} which have output {{[a, b]}} now, you can't successfully resolve this column. was (Author: viirya): `drop` actually does to add a projection on top of original dataset. So the following query works: {code} df2 = df.drop('c') df2.where(df['c'] < 6).show() {code} It can be seen as a query (in Scala) like: {code} val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") val filter1 = df.select(df("name")).filter(df("id") === 0) {code} This is a valid query. Regarding the query can't be run: {code} df = df.drop('c') df.where(df['c'] < 6).show() {code} Because you want to resolve column {{c}} on the top of updated {{df}} which have output {{[a, b]}} now, you can't successfully resolve this column. > col function ignores drop > - > > Key: SPARK-24835 > URL: https://issues.apache.org/jira/browse/SPARK-24835 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Spark 2.3.0 > Python 3.5.3 >Reporter: Michael Souder >Priority: Minor > > Not sure if this is a bug or user error, but I've noticed that accessing > columns with the col function ignores a previous call to drop. > {code} > import pyspark.sql.functions as F > df = spark.createDataFrame([(1,3,5), (2, None, 7), (0, 3, 2)], ['a', 'b', > 'c']) > df.show() > +---++---+ > | a| b| c| > +---++---+ > | 1| 3| 5| > | 2|null| 7| > | 0| 3| 2| > +---++---+ > df = df.drop('c') > # the col function is able to see the 'c' column even though it has been > dropped > df.where(F.col('c') < 6).show() > +---+---+ > | a| b| > +---+---+ > | 1| 3| > | 0| 3| > +---+---+ > # trying the same with brackets on the data frame fails with the expected > error > df.where(df['c'] < 6).show() > Py4JJavaError: An error occurred while calling o36909.apply. > : org.apache.spark.sql.AnalysisException: Cannot resolve column name "c" > among (a, b);{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24681) Cannot create a view from a table when a nested column name contains ':'
[ https://issues.apache.org/jira/browse/SPARK-24681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24681. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > Cannot create a view from a table when a nested column name contains ':' > > > Key: SPARK-24681 > URL: https://issues.apache.org/jira/browse/SPARK-24681 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Adrian Ionescu >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.0 > > > Here's a patch that reproduces the issue: > {code:java} > diff --git > a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala > b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala > index 09c1547..29bb3db 100644 > --- > a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala > +++ > b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala > @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive > > import org.apache.spark.sql.{QueryTest, Row} > import org.apache.spark.sql.execution.datasources.parquet.ParquetTest > +import org.apache.spark.sql.functions.{lit, struct} > import org.apache.spark.sql.hive.test.TestHiveSingleton > > case class Cases(lower: String, UPPER: String) > @@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest > with TestHiveSingleton > } > } > } > + > + test("column names including ':' characters") { > + withTempPath { path => > + withTable("test_table") { > + spark.range(0) > + .select(struct(lit(0).as("nested:column")).as("toplevel:column")) > + .write.format("parquet") > + .option("path", path.getCanonicalPath) > + .saveAsTable("test_table") > + > + sql("CREATE VIEW test_view_1 AS SELECT `toplevel:column`.* FROM > test_table") > + sql("CREATE VIEW test_view_2 AS SELECT * FROM test_table") > + > + } > + } > + } > }{code} > The first "CREATE VIEW" statement succeeds, but the second one fails with: > {code:java} > org.apache.spark.SparkException: Cannot recognize hive type string: > struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators
[ https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547105#comment-16547105 ] Xiao Li commented on SPARK-24838: - cc [~dkbiswal] > Support uncorrelated IN/EXISTS subqueries for more operators > - > > Key: SPARK-24838 > URL: https://issues.apache.org/jira/browse/SPARK-24838 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Qifan Pu >Priority: Major > > Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. > Running a query: > {{select name in (select * from valid_names)}} > {{from all_names}} > returns error: > {code:java} > Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries > can only be used in a Filter > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators
Qifan Pu created SPARK-24838: Summary: Support uncorrelated IN/EXISTS subqueries for more operators Key: SPARK-24838 URL: https://issues.apache.org/jira/browse/SPARK-24838 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Qifan Pu Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. Running a query: {{select name in (select * from valid_names)}} {{from all_names}} returns error: {code:java} Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries can only be used in a Filter {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 8:54 PM: -- To reproduce it try build the test suite: ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] (2 KB at 0.4 KB/sec) Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom] [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is missing, no dependency information available Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] (2 KB at 0.7 KB/sec) Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom] [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is missing, no dependency information available Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar] (9 KB at 5.2 KB/sec) Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar] (10438 KB at 577.5 KB/sec) [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 26.346 s [INFO] Finished at: 2018-07-17T23:15:29+03:00 [INFO] Final Memory: 14M/204M [INFO] [ERROR] Failed to execute goal on project spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for project spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in apache.snapshots ([https://repository.apache.org/snapshots]) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] [http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException] The metadata here show expected version: [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] org.apache.spark spark-core_2.11 2.4.0-SNAPSHOT 20180717.095738 170 20180717095738 but there is no such jar uploaded. Latest is: [spark-core_2.11-2.4.0-20180717.095737-170.jar|https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170.jar] Probably timestamp is wrong here. was (Author: skonto): To reproduce it try build the test suite: ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO]
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 8:54 PM: -- To reproduce it try build the test suite: ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] (2 KB at 0.4 KB/sec) Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom] [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is missing, no dependency information available Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] (2 KB at 0.7 KB/sec) Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom] [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is missing, no dependency information available Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar] (9 KB at 5.2 KB/sec) Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar] (10438 KB at 577.5 KB/sec) [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 26.346 s [INFO] Finished at: 2018-07-17T23:15:29+03:00 [INFO] Final Memory: 14M/204M [INFO] [ERROR] Failed to execute goal on project spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for project spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in apache.snapshots ([https://repository.apache.org/snapshots]) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] [http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException] The metadata here show expected version: [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] org.apache.spark spark-core_2.11 2.4.0-SNAPSHOT 20180717.095738 170 20180717095738 but there is no such jar uploaded. Latest is: [spark-core_2.11-2.4.0-20180717.095737-170.jar|https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170.jar] Probably timestamp is wrong here: was (Author: skonto): To reproduce it try build the test suite: ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO]
[jira] [Updated] (SPARK-23723) New encoding option for json datasource
[ https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23723: Summary: New encoding option for json datasource (was: New charset option for json datasource) > New encoding option for json datasource > --- > > Key: SPARK-23723 > URL: https://issues.apache.org/jira/browse/SPARK-23723 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Currently JSON Reader can read json files in different charset/encodings. The > JSON Reader uses the jackson-json library to automatically detect the charset > of input text/stream. Here you can see the method which detects encoding: > [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174] > > The detectEncoding method checks the BOM > ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. > The BOM can be in the file but it is not mandatory. If it is not present, the > auto detection mechanism can select wrong charset. And as a consequence of > that, the user cannot read the json file. *The proposed option will allow to > bypass the auto detection mechanism and set the charset explicitly.* > > The charset option is already exposed as a CSV option: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88] > . I propose to add the same option for JSON. > > Regarding to JSON Writer, *the charset option will give to the user > opportunity* to read json files in charset different from UTF-8, modify the > dataset and *write results back to json files in the original encoding.* At > the moment it is not possible to do because the result can be saved in UTF-8 > only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547084#comment-16547084 ] shane knapp commented on SPARK-24825: - we're trying to get things to work w/o uploading the jar, but instead have it found locally. > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 8:44 PM: -- To reproduce it try build the test suite: ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] (2 KB at 0.4 KB/sec) Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom] [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is missing, no dependency information available Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] (2 KB at 0.7 KB/sec) Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom] [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is missing, no dependency information available Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar] (9 KB at 5.2 KB/sec) Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar] (10438 KB at 577.5 KB/sec) [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 26.346 s [INFO] Finished at: 2018-07-17T23:15:29+03:00 [INFO] Final Memory: 14M/204M [INFO] [ERROR] Failed to execute goal on project spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for project spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in apache.snapshots ([https://repository.apache.org/snapshots]) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] [http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException] The metadata here show expected version: [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] org.apache.spark spark-core_2.11 2.4.0-SNAPSHOT 20180717.095738 170 20180717095738 but there is no such jar uploaded. was (Author: skonto): To reproduce it try build the test suite: ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded:
[jira] [Commented] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547074#comment-16547074 ] Nihar Sheth commented on SPARK-24536: - I'd like to take a shot at this, if no one else is > Query with nonsensical LIMIT hits AssertionError > > > Key: SPARK-24536 > URL: https://issues.apache.org/jira/browse/SPARK-24536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Behm >Priority: Trivial > Labels: beginner > > SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT) > fails in the QueryPlanner with: > {code} > java.lang.AssertionError: assertion failed: No plan for GlobalLimit null > {code} > I think this issue should be caught earlier during semantic analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24837) Add kafka as spark metrics sink
[ https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandish Kumar HN updated SPARK-24837: - Description: Sink spark metrics to kafka producer spark/core/src/main/scala/org/apache/spark/metrics/sink/ someone assign this to me? was: Sink spark logs/metrics to kafka producer spark/core/src/main/scala/org/apache/spark/metrics/sink/ someone assign this to me? > Add kafka as spark metrics sink > --- > > Key: SPARK-24837 > URL: https://issues.apache.org/jira/browse/SPARK-24837 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Sandish Kumar HN >Priority: Major > > Sink spark metrics to kafka producer > spark/core/src/main/scala/org/apache/spark/metrics/sink/ > someone assign this to me? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24837) Add kafka as spark metrics sink
[ https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandish Kumar HN updated SPARK-24837: - Summary: Add kafka as spark metrics sink (was: Add kafka as spark logs/metrics sink) > Add kafka as spark metrics sink > --- > > Key: SPARK-24837 > URL: https://issues.apache.org/jira/browse/SPARK-24837 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Sandish Kumar HN >Priority: Major > > Sink spark logs/metrics to kafka producer > spark/core/src/main/scala/org/apache/spark/metrics/sink/ > someone assign this to me? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061 ] Stavros Kontopoulos edited comment on SPARK-24825 at 7/17/18 8:20 PM: -- To reproduce it try build the test suite: ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] (2 KB at 0.4 KB/sec) Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom] [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is missing, no dependency information available Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml] (2 KB at 0.7 KB/sec) Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom] [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is missing, no dependency information available Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar] Downloading: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar] Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar] (9 KB at 5.2 KB/sec) Downloaded: [https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar] (10438 KB at 577.5 KB/sec) [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 26.346 s [INFO] Finished at: 2018-07-17T23:15:29+03:00 [INFO] Final Memory: 14M/204M [INFO] [ERROR] Failed to execute goal on project spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for project spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in apache.snapshots ([https://repository.apache.org/snapshots]) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] [http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException] `` was (Author: skonto): To reproduce it try build the test suite: ``` ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO] Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml Downloaded: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml (2 KB at 0.4 KB/sec) Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing,
[jira] [Commented] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547061#comment-16547061 ] Stavros Kontopoulos commented on SPARK-24825: - To reproduce it try build the test suite: ``` ../../../build/mvn install Using `mvn` from path:...spark/build/apache-maven-3.3.9/bin/mvn [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Kubernetes Integration Tests 2.4.0-SNAPSHOT [INFO] Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml Downloaded: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml (2 KB at 0.4 KB/sec) Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.pom [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-core_2.11:jar:tests:2.4.0-20180717.095737-170 is missing, no dependency information available Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml Downloaded: https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/maven-metadata.xml (2 KB at 0.7 KB/sec) Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095221-170.pom [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.4.0-20180717.095220-170 is missing, no dependency information available Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095738-170.jar Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar Downloading: https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar Downloaded: https://repository.apache.org/snapshots/org/apache/spark/spark-tags_2.11/2.4.0-SNAPSHOT/spark-tags_2.11-2.4.0-20180717.095220-170-tests.jar (9 KB at 5.2 KB/sec) Downloaded: https://repository.apache.org/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/spark-core_2.11-2.4.0-20180717.095737-170-tests.jar (10438 KB at 577.5 KB/sec) [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 26.346 s [INFO] Finished at: 2018-07-17T23:15:29+03:00 [INFO] Final Memory: 14M/204M [INFO] [ERROR] Failed to execute goal on project spark-kubernetes-integration-tests_2.11: Could not resolve dependencies for project spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-core_2.11:jar:2.4.0-20180717.095738-170 in apache.snapshots (https://repository.apache.org/snapshots) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] [http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExceptio|http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException] ``` `` > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from
[jira] [Updated] (SPARK-24837) Add kafka as spark logs/metrics sink
[ https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandish Kumar HN updated SPARK-24837: - Description: Sink spark logs/metrics to kafka producer spark/core/src/main/scala/org/apache/spark/metrics/sink/ someone assign this to me? was: Sink spark logs/metrics to kafka producer spark/core/src/main/scala/org/apache/spark/metrics/sink/ > Add kafka as spark logs/metrics sink > > > Key: SPARK-24837 > URL: https://issues.apache.org/jira/browse/SPARK-24837 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Sandish Kumar HN >Priority: Major > > Sink spark logs/metrics to kafka producer > spark/core/src/main/scala/org/apache/spark/metrics/sink/ > someone assign this to me? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24837) Add kafka as spark logs/metrics sink
Sandish Kumar HN created SPARK-24837: Summary: Add kafka as spark logs/metrics sink Key: SPARK-24837 URL: https://issues.apache.org/jira/browse/SPARK-24837 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.3.1 Reporter: Sandish Kumar HN Sink spark logs/metrics to kafka producer spark/core/src/main/scala/org/apache/spark/metrics/sink/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible
[ https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-24747. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21719 [https://github.com/apache/spark/pull/21719] > Make spark.ml.util.Instrumentation class more flexible > -- > > Key: SPARK-24747 > URL: https://issues.apache.org/jira/browse/SPARK-24747 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.1 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > The Instrumentation class (which is an internal private class) is some what > limited by it's current APIs. The class requires an estimator and dataset be > passed to the constructor which limits how it can be used. Furthermore, the > current APIs make it hard to intercept failures and record anything related > to those failures. > The following changes could make the instrumentation class easier to work > with. All these changes are for private APIs and should not be visible to > users. > {code} > // New no-argument constructor: > Instrumentation() > // New api to log previous constructor arguments. > logTrainingContext(estimator: Estimator[_], dataset: Dataset[_]) > logFailure(e: Throwable): Unit > // Log success with no arguments > logSuccess(): Unit > // Log result model explicitly instead of passing to logSuccess > logModel(model: Model[_]): Unit > // On Companion object > Instrumentation.instrumented[T](body: (Instrumentation => T)): T > // The above API will allow us to write instrumented methods more clearly and > handle logging success and failure automatically: > def someMethod(...): T = instrumented { instr => > instr.logNamedValue(name, value) > // more code here > instr.logModel(model) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547031#comment-16547031 ] Hichame El Khalfi commented on SPARK-24644: --- Indeed we were using an old version on pandas, now after updating it to 0.19.2, no crash/error to report. Thank you [~bryanc] and [~hyukjin.kwon] for you valuable help and input (y). > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24836) New option - ignoreExtension
[ https://issues.apache.org/jira/browse/SPARK-24836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24836: Assignee: Apache Spark > New option - ignoreExtension > > > Key: SPARK-24836 > URL: https://issues.apache.org/jira/browse/SPARK-24836 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Need to add new option for Avro datasource - *ignoreExtension*. It should > control ignoring of the .avro extensions. If it is set to *true* (by > default), files with and without .avro extensions should be loaded. Example > of usage: > {code:scala} > spark > .read > .option("ignoreExtension", false) > .avro("path to avro files") > {code} > The option duplicates Hadoop's config > avro.mapred.ignore.inputs.without.extension which is taken into account by > Avro datasource now and can be set like: > {code:scala} > spark > .sqlContext > .sparkContext > .hadoopConfiguration > .set("avro.mapred.ignore.inputs.without.extension", "true") > {code} > The ignoreExtension option must override > avro.mapred.ignore.inputs.without.extension. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24836) New option - ignoreExtension
[ https://issues.apache.org/jira/browse/SPARK-24836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24836: Assignee: (was: Apache Spark) > New option - ignoreExtension > > > Key: SPARK-24836 > URL: https://issues.apache.org/jira/browse/SPARK-24836 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new option for Avro datasource - *ignoreExtension*. It should > control ignoring of the .avro extensions. If it is set to *true* (by > default), files with and without .avro extensions should be loaded. Example > of usage: > {code:scala} > spark > .read > .option("ignoreExtension", false) > .avro("path to avro files") > {code} > The option duplicates Hadoop's config > avro.mapred.ignore.inputs.without.extension which is taken into account by > Avro datasource now and can be set like: > {code:scala} > spark > .sqlContext > .sparkContext > .hadoopConfiguration > .set("avro.mapred.ignore.inputs.without.extension", "true") > {code} > The ignoreExtension option must override > avro.mapred.ignore.inputs.without.extension. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24836) New option - ignoreExtension
[ https://issues.apache.org/jira/browse/SPARK-24836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547010#comment-16547010 ] Apache Spark commented on SPARK-24836: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21798 > New option - ignoreExtension > > > Key: SPARK-24836 > URL: https://issues.apache.org/jira/browse/SPARK-24836 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new option for Avro datasource - *ignoreExtension*. It should > control ignoring of the .avro extensions. If it is set to *true* (by > default), files with and without .avro extensions should be loaded. Example > of usage: > {code:scala} > spark > .read > .option("ignoreExtension", false) > .avro("path to avro files") > {code} > The option duplicates Hadoop's config > avro.mapred.ignore.inputs.without.extension which is taken into account by > Avro datasource now and can be set like: > {code:scala} > spark > .sqlContext > .sparkContext > .hadoopConfiguration > .set("avro.mapred.ignore.inputs.without.extension", "true") > {code} > The ignoreExtension option must override > avro.mapred.ignore.inputs.without.extension. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24836) New option - ignoreExtension
Maxim Gekk created SPARK-24836: -- Summary: New option - ignoreExtension Key: SPARK-24836 URL: https://issues.apache.org/jira/browse/SPARK-24836 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to add new option for Avro datasource - *ignoreExtension*. It should control ignoring of the .avro extensions. If it is set to *true* (by default), files with and without .avro extensions should be loaded. Example of usage: {code:scala} spark .read .option("ignoreExtension", false) .avro("path to avro files") {code} The option duplicates Hadoop's config avro.mapred.ignore.inputs.without.extension which is taken into account by Avro datasource now and can be set like: {code:scala} spark .sqlContext .sparkContext .hadoopConfiguration .set("avro.mapred.ignore.inputs.without.extension", "true") {code} The ignoreExtension option must override avro.mapred.ignore.inputs.without.extension. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24835) col function ignores drop
Michael Souder created SPARK-24835: -- Summary: col function ignores drop Key: SPARK-24835 URL: https://issues.apache.org/jira/browse/SPARK-24835 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Environment: Spark 2.3.0 Python 3.5.3 Reporter: Michael Souder Not sure if this is a bug or user error, but I've noticed that accessing columns with the col function ignores a previous call to drop. {code} import pyspark.sql.functions as F df = spark.createDataFrame([(1,3,5), (2, None, 7), (0, 3, 2)], ['a', 'b', 'c']) df.show() +---++---+ | a| b| c| +---++---+ | 1| 3| 5| | 2|null| 7| | 0| 3| 2| +---++---+ df = df.drop('c') # the col function is able to see the 'c' column even though it has been dropped df.where(F.col('c') < 6).show() +---+---+ | a| b| +---+---+ | 1| 3| | 0| 3| +---+---+ # trying the same with brackets on the data frame fails with the expected error df.where(df['c'] < 6).show() Py4JJavaError: An error occurred while calling o36909.apply. : org.apache.spark.sql.AnalysisException: Cannot resolve column name "c" among (a, b);{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546981#comment-16546981 ] Misha Dmitriev commented on SPARK-24801: [~irashid] yes, I'll submit a change to lazily initialize {{SaslEncryption$EncryptedMessage}}.{{byteChannel}}, that won't hurt anyway. > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24071) Micro-benchmark of Parquet Filter Pushdown
[ https://issues.apache.org/jira/browse/SPARK-24071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24071. - Resolution: Duplicate Fix Version/s: 2.4.0 > Micro-benchmark of Parquet Filter Pushdown > -- > > Key: SPARK-24071 > URL: https://issues.apache.org/jira/browse/SPARK-24071 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > Fix For: 2.4.0 > > > Need a micro-benchmark suite for Parquet filter pushdown -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
[ https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24070. - Resolution: Fixed Fix Version/s: 2.4.0 > TPC-DS Performance Tests for Parquet 1.10.0 Upgrade > --- > > Key: SPARK-24070 > URL: https://issues.apache.org/jira/browse/SPARK-24070 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.0 > > > TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty
[ https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546920#comment-16546920 ] Apache Spark commented on SPARK-24402: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/21797 > Optimize `In` expression when only one element in the collection or > collection is empty > > > Key: SPARK-24402 > URL: https://issues.apache.org/jira/browse/SPARK-24402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Two new rules in the logical plan optimizers are added. > # When there is only one element in the *{{Collection}}*, the physical plan > will be optimized to *{{EqualTo}}*, so predicate pushdown can be used. > {code} > profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true) > """ > |== Physical Plan ==| > |*(1) Project [profileID#0|#0]| > |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| > |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| > |PartitionFilters: [],| > |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| > |ReadSchema: struct > """.stripMargin > {code} > # When the *{{Set}}* is empty, and the input is nullable, the logical > plan will be simplified to > {code} > profileDF.filter( $"profileID".isInCollection(Set())).explain(true) > """ > |== Optimized Logical Plan ==| > |Filter if (isnull(profileID#0)) null else false| > |+- Relation[profileID#0|#0] parquet > """.stripMargin > {code} > TODO: > # For multiple conditions with numbers less than certain thresholds, > we should still allow predicate pushdown. > # Optimize the `In` using tableswitch or lookupswitch when the > numbers of the categories are low, and they are `Int`, `Long`. > # The default immutable hash trees set is slow for query, and we > should do benchmark for using different set implementation for faster > query. > # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546916#comment-16546916 ] Imran Rashid commented on SPARK-24801: -- I don't think the messages themselves are actually empty, its just that the bytebuffer is empty until the {{transferTo}} call, as you mentioned earlier. Netty can buffer these messages up after spark calls {{ChannelHandlerContext.write(msg)}}, but I can't believe there are so many of them. I dunno if you can see anything unusual about the {{buf}} or {{region}} field of those encrypted messages and if they are all empty. But in any case, I guess what you're proposing is a simple fix in the meantime before we figure that out. Would you like to submit a pr? > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546891#comment-16546891 ] Gayathiri Duraikannu commented on SPARK-19680: -- Ours is a framework and multiple consumers use this to stream the data. We have thousands of topics to consume the data from. Adding the start offset explicitly wouldn't work for all of our use cases. Is there any other alternate approach. > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene >Priority: Major > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) > }) > } > def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, > slideDuration: Int): DStream[TopListEntry] = { > val Mapper = MapperObject.readerFor[SearchEventDTO] > result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s)) > .filter(s => s != null && s.getSearchRequest != null && > s.getSearchRequest.getSearchParameters != null && s.getVertical == > Vertical.BAP && > s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName)) > .map(row => { > val name = > row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase() > (name, new TopListEntry(name, 1, row.getResultCount)) > }) > .reduceByKeyAndWindow( > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + > b.getMeanSearchHits), > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - > b.getMeanSearchHits), > Minutes(windowDuration), > Seconds(slideDuration)) > .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L) > .map(row => (row._2.getSearchCount, row._2)) > .transform(rdd => rdd.sortByKey(ascending = false)) > .map(row => new TopListEntry(row._2.getKeyword, row._2.getSearchCount, > row._2.getMeanSearchHits / row._2.getSearchCount)) > } > def main(properties: Properties): Unit = { > val sparkSession = SparkUtil.getDefaultSparkSession(properties, TaskName) > val kafkaSink = > sparkSession.sparkContext.broadcast(KafkaSinkUtil.apply[TopList](SparkUtil.getDefaultSparkProperties(properties))) > val kafkaParams: Map[String, Object] = > SparkUtil.getDefaultKafkaReceiverParameter(properties) > val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(30)) >
[jira] [Resolved] (SPARK-21590) Structured Streaming window start time should support negative values to adjust time zone
[ https://issues.apache.org/jira/browse/SPARK-21590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21590. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 18903 [https://github.com/apache/spark/pull/18903] > Structured Streaming window start time should support negative values to > adjust time zone > - > > Key: SPARK-21590 > URL: https://issues.apache.org/jira/browse/SPARK-21590 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0 > Environment: spark 2.2.0 >Reporter: Kevin Zhang >Assignee: Kevin Zhang >Priority: Major > Labels: spark-sql, spark2.2, streaming, structured, timezone, > window > Fix For: 2.4.0 > > > I want to calculate (unique) daily access count using structured streaming > (2.2.0). > Now strut streaming' s window with 1 day duration starts at > 00:00:00 UTC and ends at 23:59:59 UTC each day, but my local timezone is CST > (UTC + 8 hours) and I > want date boundaries to be 00:00:00 CST (that is 00:00:00 UTC - 8). > In Flink I can set the window offset to -8 hours to make it, but here in > struct streaming if I set the start time (same as the offset in Flink) to -8 > or any other negative values, I will get the following error: > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve 'timewindow(timestamp, 864, 864, -288)' due > to data type mismatch: The start time (-288) must be greater than or > equal to 0.;; > {code} > because the time window checks the input parameters to guarantee each value > is greater than or equal to 0. > So I'm thinking about whether we can remove the limit that the start time > cannot be negative? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21590) Structured Streaming window start time should support negative values to adjust time zone
[ https://issues.apache.org/jira/browse/SPARK-21590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21590: - Assignee: Kevin Zhang > Structured Streaming window start time should support negative values to > adjust time zone > - > > Key: SPARK-21590 > URL: https://issues.apache.org/jira/browse/SPARK-21590 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0 > Environment: spark 2.2.0 >Reporter: Kevin Zhang >Assignee: Kevin Zhang >Priority: Major > Labels: spark-sql, spark2.2, streaming, structured, timezone, > window > Fix For: 2.4.0 > > > I want to calculate (unique) daily access count using structured streaming > (2.2.0). > Now strut streaming' s window with 1 day duration starts at > 00:00:00 UTC and ends at 23:59:59 UTC each day, but my local timezone is CST > (UTC + 8 hours) and I > want date boundaries to be 00:00:00 CST (that is 00:00:00 UTC - 8). > In Flink I can set the window offset to -8 hours to make it, but here in > struct streaming if I set the start time (same as the offset in Flink) to -8 > or any other negative values, I will get the following error: > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve 'timewindow(timestamp, 864, 864, -288)' due > to data type mismatch: The start time (-288) must be greater than or > equal to 0.;; > {code} > because the time window checks the input parameters to guarantee each value > is greater than or equal to 0. > So I'm thinking about whether we can remove the limit that the start time > cannot be negative? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9850) Adaptive execution in Spark
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546840#comment-16546840 ] Michail Giannakopoulos commented on SPARK-9850: --- Hello [~yhuai]! Are people currently working on this Epic? In other words, is this work in progress, or have you determined that it should be stalled? I am asking because recently I logged an issue related with adaptive execution (SPARK-24826). It would be nice to know if you are working on this actively since it reduces a lot the number of partitions during shuffles when executing sql queries (one of the main bottlenecks for spark). Thanks a lot! > Adaptive execution in Spark > --- > > Key: SPARK-9850 > URL: https://issues.apache.org/jira/browse/SPARK-9850 > Project: Spark > Issue Type: Epic > Components: Spark Core, SQL >Reporter: Matei Zaharia >Assignee: Yin Huai >Priority: Major > Attachments: AdaptiveExecutionInSpark.pdf > > > Query planning is one of the main factors in high performance, but the > current Spark engine requires the execution DAG for a job to be set in > advance. Even with cost-based optimization, it is hard to know the behavior > of data and user-defined functions well enough to always get great execution > plans. This JIRA proposes to add adaptive query execution, so that the engine > can change the plan for each query as it sees what data earlier stages > produced. > We propose adding this to Spark SQL / DataFrames first, using a new API in > the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, > the functionality could be extended to other libraries or the RDD API, but > that is more difficult than adding it in SQL. > I've attached a design doc by Yin Huai and myself explaining how it would > work in more detail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24833) Allow specifying Kubernetes host name aliases in the pod specs
[ https://issues.apache.org/jira/browse/SPARK-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24833: Assignee: (was: Apache Spark) > Allow specifying Kubernetes host name aliases in the pod specs > -- > > Key: SPARK-24833 > URL: https://issues.apache.org/jira/browse/SPARK-24833 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Rob Vesse >Priority: Major > > For some workloads you would like to allow Spark executors to access external > services using host name aliases. Currently there is no way to specify Host > name aliases > (https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/) > to the pods that Spark generates and pod presets cannot be used to add these > at admission time currently (plus the fact that pod presets are still an > Alpha feature so not guaranteed to be usable on any given cluster). > Since Spark on K8S already allows adding secrets and volumes to mount via > Spark configuration it should be fairly easy to use the same approach to > include host name aliases. > I will look at opening a PR for this in the next couple of days. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24833) Allow specifying Kubernetes host name aliases in the pod specs
[ https://issues.apache.org/jira/browse/SPARK-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24833: Assignee: Apache Spark > Allow specifying Kubernetes host name aliases in the pod specs > -- > > Key: SPARK-24833 > URL: https://issues.apache.org/jira/browse/SPARK-24833 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Rob Vesse >Assignee: Apache Spark >Priority: Major > > For some workloads you would like to allow Spark executors to access external > services using host name aliases. Currently there is no way to specify Host > name aliases > (https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/) > to the pods that Spark generates and pod presets cannot be used to add these > at admission time currently (plus the fact that pod presets are still an > Alpha feature so not guaranteed to be usable on any given cluster). > Since Spark on K8S already allows adding secrets and volumes to mount via > Spark configuration it should be fairly easy to use the same approach to > include host name aliases. > I will look at opening a PR for this in the next couple of days. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24833) Allow specifying Kubernetes host name aliases in the pod specs
[ https://issues.apache.org/jira/browse/SPARK-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546798#comment-16546798 ] Apache Spark commented on SPARK-24833: -- User 'rvesse' has created a pull request for this issue: https://github.com/apache/spark/pull/21796 > Allow specifying Kubernetes host name aliases in the pod specs > -- > > Key: SPARK-24833 > URL: https://issues.apache.org/jira/browse/SPARK-24833 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Rob Vesse >Priority: Major > > For some workloads you would like to allow Spark executors to access external > services using host name aliases. Currently there is no way to specify Host > name aliases > (https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/) > to the pods that Spark generates and pod presets cannot be used to add these > at admission time currently (plus the fact that pod presets are still an > Alpha feature so not guaranteed to be usable on any given cluster). > Since Spark on K8S already allows adding secrets and volumes to mount via > Spark configuration it should be fairly easy to use the same approach to > include host name aliases. > I will look at opening a PR for this in the next couple of days. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24305) Avoid serialization of private fields in new collection expressions
[ https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24305. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21352 [https://github.com/apache/spark/pull/21352] > Avoid serialization of private fields in new collection expressions > --- > > Key: SPARK-24305 > URL: https://issues.apache.org/jira/browse/SPARK-24305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Assignee: Marek Novotny >Priority: Minor > Fix For: 2.4.0 > > > Make sure that private fields of expression case classes in > _collectionOperations.scala_ are not serialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24305) Avoid serialization of private fields in new collection expressions
[ https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24305: --- Assignee: Marek Novotny > Avoid serialization of private fields in new collection expressions > --- > > Key: SPARK-24305 > URL: https://issues.apache.org/jira/browse/SPARK-24305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Assignee: Marek Novotny >Priority: Minor > > Make sure that private fields of expression case classes in > _collectionOperations.scala_ are not serialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24165) UDF within when().otherwise() raises NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546750#comment-16546750 ] Apache Spark commented on SPARK-24165: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21795 > UDF within when().otherwise() raises NullPointerException > - > > Key: SPARK-24165 > URL: https://issues.apache.org/jira/browse/SPARK-24165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jingxuan Wang >Assignee: Marek Novotny >Priority: Major > Fix For: 2.4.0 > > > I have a UDF which takes java.sql.Timestamp and String as input column type > and returns an Array of (Seq[case class], Double) as output. Since some of > values in input columns can be nullable, I put the UDF inside a > when($input.isNull, null).otherwise(UDF) filter. Such function works well > when I test in spark shell. But running as a scala jar in spark-submit with > yarn cluster mode, it raised NullPointerException which points to the UDF > function. If I remove the when().otherwsie() condition, but put null check > inside the UDF, the function works without issue in spark-submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24834) Utils#nanSafeCompare{Double,Float} functions do not differ from normal java double/float comparison
[ https://issues.apache.org/jira/browse/SPARK-24834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24834: Assignee: Apache Spark > Utils#nanSafeCompare{Double,Float} functions do not differ from normal java > double/float comparison > --- > > Key: SPARK-24834 > URL: https://issues.apache.org/jira/browse/SPARK-24834 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Benjamin Duffield >Assignee: Apache Spark >Priority: Minor > > Utils.scala contains two functions `nanSafeCompareDoubles` and > `nanSafeCompareFloats` which purport to have special handling of NaN values > in comparisons. > The handling in these functions do not appear to differ from > java.lang.Double.compare and java.lang.Float.compare - they seem to produce > identical output to the built-in java comparison functions. > I think it's clearer to not have these special Utils functions, and instead > just use the standard java comparison functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org