[jira] [Updated] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romeo Kienzer updated SPARK-24828: -- Description: As requested by [~hyukjin.kwon] here a new issue - related issue can be found here Using the attached parquet file from one Spark installation, reading it using an installation directly obtained from [http://spark.apache.org/downloads.html] yields to the following exception: 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) scala.MatchError: [1.0,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The file is attached [^a2_m2.parquet.zip] The following code reproduces the error: df = spark.read.parquet('a2_m2.parquet') from pyspark.ml.evaluation import MulticlassClassificationEvaluator binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("label") accuracy = binEval.evaluate(df) was: As requested by [~hyukjin.kwon] here a new issue - related issue can be found here #3 Using the attached parquet file from one Spark installation, reading it using an installation directly obtained from [http://spark.apache.org/downloads.html] yields to the following exception: 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) scala.MatchError: [1.0,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at
[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
[ https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546059#comment-16546059 ] Romeo Kienzer commented on SPARK-17557: --- Dear [~hyukjin.kwon] - I've done so - new issue is SPARK-24828 > SQL query on parquet table java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary > - > > Key: SPARK-17557 > URL: https://issues.apache.org/jira/browse/SPARK-17557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Major > Attachments: a2_m2.parquet.zip > > > Working on 1.6.2, broken on 2.0 > {code} > select * from logs.a where year=2016 and month=9 and day=14 limit 100 > {code} > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
Romeo Kienzer created SPARK-24828: - Summary: Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary Key: SPARK-24828 URL: https://issues.apache.org/jira/browse/SPARK-24828 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: Environment for creating the parquet file: IBM Watson Studio Apache Spark Service, V2.1.2 Environment for reading the parquet file: java version "1.8.0_144" Java(TM) SE Runtime Environment (build 1.8.0_144-b01) Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) MacOSX 10.13.3 (17D47) Spark spark-2.1.2-bin-hadoop2.7 directly obtained from http://spark.apache.org/downloads.html Reporter: Romeo Kienzer As requested by [~hyukjin.kwon] here a new issue - related issue can be found here #3 Using the attached parquet file from one Spark installation, reading it using an installation directly obtained from [http://spark.apache.org/downloads.html] yields to the following exception: 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) scala.MatchError: [1.0,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The file is attached [^a2_m2.parquet.zip] The following code reproduces the error: df = spark.read.parquet('a2_m2.parquet') from pyspark.ml.evaluation import MulticlassClassificationEvaluator binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("label") accuracy = binEval.evaluate(df) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24568) Code refactoring for DataType equalsXXX methods
[ https://issues.apache.org/jira/browse/SPARK-24568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546009#comment-16546009 ] Apache Spark commented on SPARK-24568: -- User 'swapnilushinde' has created a pull request for this issue: https://github.com/apache/spark/pull/21787 > Code refactoring for DataType equalsXXX methods > --- > > Key: SPARK-24568 > URL: https://issues.apache.org/jira/browse/SPARK-24568 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Priority: Major > > Right now there is a lot of code duplication between all DataType equalsXXX > methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, > {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace > the dup code with a helper function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24568) Code refactoring for DataType equalsXXX methods
[ https://issues.apache.org/jira/browse/SPARK-24568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24568: Assignee: (was: Apache Spark) > Code refactoring for DataType equalsXXX methods > --- > > Key: SPARK-24568 > URL: https://issues.apache.org/jira/browse/SPARK-24568 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Priority: Major > > Right now there is a lot of code duplication between all DataType equalsXXX > methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, > {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace > the dup code with a helper function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24568) Code refactoring for DataType equalsXXX methods
[ https://issues.apache.org/jira/browse/SPARK-24568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24568: Assignee: Apache Spark > Code refactoring for DataType equalsXXX methods > --- > > Key: SPARK-24568 > URL: https://issues.apache.org/jira/browse/SPARK-24568 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Assignee: Apache Spark >Priority: Major > > Right now there is a lot of code duplication between all DataType equalsXXX > methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, > {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace > the dup code with a helper function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty
[ https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24402: - Fix Version/s: (was: 2.4.0) > Optimize `In` expression when only one element in the collection or > collection is empty > > > Key: SPARK-24402 > URL: https://issues.apache.org/jira/browse/SPARK-24402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Two new rules in the logical plan optimizers are added. > # When there is only one element in the *{{Collection}}*, the physical plan > will be optimized to *{{EqualTo}}*, so predicate pushdown can be used. > {code} > profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true) > """ > |== Physical Plan ==| > |*(1) Project [profileID#0|#0]| > |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| > |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| > |PartitionFilters: [],| > |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| > |ReadSchema: struct > """.stripMargin > {code} > # When the *{{Set}}* is empty, and the input is nullable, the logical > plan will be simplified to > {code} > profileDF.filter( $"profileID".isInCollection(Set())).explain(true) > """ > |== Optimized Logical Plan ==| > |Filter if (isnull(profileID#0)) null else false| > |+- Relation[profileID#0|#0] parquet > """.stripMargin > {code} > TODO: > # For multiple conditions with numbers less than certain thresholds, > we should still allow predicate pushdown. > # Optimize the `In` using tableswitch or lookupswitch when the > numbers of the categories are low, and they are `Int`, `Long`. > # The default immutable hash trees set is slow for query, and we > should do benchmark for using different set implementation for faster > query. > # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty
[ https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-24402: -- This was reverted. > Optimize `In` expression when only one element in the collection or > collection is empty > > > Key: SPARK-24402 > URL: https://issues.apache.org/jira/browse/SPARK-24402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Two new rules in the logical plan optimizers are added. > # When there is only one element in the *{{Collection}}*, the physical plan > will be optimized to *{{EqualTo}}*, so predicate pushdown can be used. > {code} > profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true) > """ > |== Physical Plan ==| > |*(1) Project [profileID#0|#0]| > |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| > |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| > |PartitionFilters: [],| > |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| > |ReadSchema: struct > """.stripMargin > {code} > # When the *{{Set}}* is empty, and the input is nullable, the logical > plan will be simplified to > {code} > profileDF.filter( $"profileID".isInCollection(Set())).explain(true) > """ > |== Optimized Logical Plan ==| > |Filter if (isnull(profileID#0)) null else false| > |+- Relation[profileID#0|#0] parquet > """.stripMargin > {code} > TODO: > # For multiple conditions with numbers less than certain thresholds, > we should still allow predicate pushdown. > # Optimize the `In` using tableswitch or lookupswitch when the > numbers of the categories are low, and they are `Int`, `Long`. > # The default immutable hash trees set is slow for query, and we > should do benchmark for using different set implementation for faster > query. > # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24798) sortWithinPartitions(xx) will failed in java.lang.NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shengyao piao resolved SPARK-24798. --- Resolution: Not A Problem > sortWithinPartitions(xx) will failed in java.lang.NullPointerException > -- > > Key: SPARK-24798 > URL: https://issues.apache.org/jira/browse/SPARK-24798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: shengyao piao >Priority: Minor > > I have some issue in Spark 2.3 when I run bellow code in spark-shell or > spark-submit > I already figured out the reason of error is the name field contains > Some(null), > But I believe this code will run successfully in Spark 2.2 > Is it an expected behavior in Spark 2.3 ? > > ・Spark code > {code} > case class Hoge (id : Int,name : Option[String]) > val ds = > spark.createDataFrame(Array((1,"John"),(2,null))).withColumnRenamed("_1", > "id").withColumnRenamed("_2", "name").map(row => > Hoge(row.getAs[Int]("id"),Some(row.getAs[String]("name" > > ds.sortWithinPartitions("id").foreachPartition(iter => println(iter.isEmpty)) > {code} > ・Error > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.isEmpty(Iterator.scala:330) > at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1336) > at > $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26) > at > $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24798) sortWithinPartitions(xx) will failed in java.lang.NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545973#comment-16545973 ] shengyao piao commented on SPARK-24798: --- Hi [~mahmoudmahdi24] , [~dmateusp] Thank you! It's helped me. > sortWithinPartitions(xx) will failed in java.lang.NullPointerException > -- > > Key: SPARK-24798 > URL: https://issues.apache.org/jira/browse/SPARK-24798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: shengyao piao >Priority: Minor > > I have some issue in Spark 2.3 when I run bellow code in spark-shell or > spark-submit > I already figured out the reason of error is the name field contains > Some(null), > But I believe this code will run successfully in Spark 2.2 > Is it an expected behavior in Spark 2.3 ? > > ・Spark code > {code} > case class Hoge (id : Int,name : Option[String]) > val ds = > spark.createDataFrame(Array((1,"John"),(2,null))).withColumnRenamed("_1", > "id").withColumnRenamed("_2", "name").map(row => > Hoge(row.getAs[Int]("id"),Some(row.getAs[String]("name" > > ds.sortWithinPartitions("id").foreachPartition(iter => println(iter.isEmpty)) > {code} > ・Error > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.isEmpty(Iterator.scala:330) > at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1336) > at > $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26) > at > $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545962#comment-16545962 ] Saisai Shao commented on SPARK-24615: - Sorry [~tgraves] for the late response. Yes, when requesting executors, user should know accelerators are required or not. If there's no satisfied accelerators, the job will be pending or not launched. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
[ https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545961#comment-16545961 ] Hyukjin Kwon commented on SPARK-17557: -- Please go ahead but I would alternatively open a separate ticket and leave a "relate to" link to this JIRA because the reproducer and affect version are different apparently. > SQL query on parquet table java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary > - > > Key: SPARK-17557 > URL: https://issues.apache.org/jira/browse/SPARK-17557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Major > Attachments: a2_m2.parquet.zip > > > Working on 1.6.2, broken on 2.0 > {code} > select * from logs.a where year=2016 and month=9 and day=14 limit 100 > {code} > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions
[ https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545960#comment-16545960 ] Hyukjin Kwon commented on SPARK-23255: -- Please go ahead. > Add user guide and examples for DataFrame image reading functions > - > > Key: SPARK-23255 > URL: https://issues.apache.org/jira/browse/SPARK-23255 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-21866 added built-in support for reading image data into a DataFrame. > This new functionality should be documented in the user guide, with example > usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24644. -- Resolution: Invalid Let me leave this resolved. Please reopen this if the same issue exists in higher version of Pandas. > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20220) Add thrift scheduling pool config in scheduling docs
[ https://issues.apache.org/jira/browse/SPARK-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20220. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21778 [https://github.com/apache/spark/pull/21778] > Add thrift scheduling pool config in scheduling docs > > > Key: SPARK-20220 > URL: https://issues.apache.org/jira/browse/SPARK-20220 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Miklos Christine >Assignee: Miklos Christine >Priority: Trivial > Fix For: 2.4.0 > > > Spark 1.2 docs document the thrift job scheduling pool. > https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md > This configuration is no longer documented in the 2.x documentation. > Adding this back to the job scheduling docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20220) Add thrift scheduling pool config in scheduling docs
[ https://issues.apache.org/jira/browse/SPARK-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-20220: Assignee: Miklos Christine > Add thrift scheduling pool config in scheduling docs > > > Key: SPARK-20220 > URL: https://issues.apache.org/jira/browse/SPARK-20220 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Miklos Christine >Assignee: Miklos Christine >Priority: Trivial > Fix For: 2.4.0 > > > Spark 1.2 docs document the thrift job scheduling pool. > https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md > This configuration is no longer documented in the 2.x documentation. > Adding this back to the job scheduling docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23259) Clean up legacy code around hive external catalog
[ https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23259. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21780 [https://github.com/apache/spark/pull/21780] > Clean up legacy code around hive external catalog > - > > Key: SPARK-23259 > URL: https://issues.apache.org/jira/browse/SPARK-23259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Feng Liu >Assignee: Feng Liu >Priority: Major > Fix For: 2.4.0 > > > Some legacy code around the hive metastore catalog need to be removed for > further code improvement: > # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the > private method `getRawTable`. > # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the > `addJar` method, after the jar being added to the single class loader. > # in HiveClientImpl: There are some redundant code in both the `tableExists` > and `getTableOption` method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23259) Clean up legacy code around hive external catalog
[ https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23259: Assignee: Feng Liu > Clean up legacy code around hive external catalog > - > > Key: SPARK-23259 > URL: https://issues.apache.org/jira/browse/SPARK-23259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Feng Liu >Assignee: Feng Liu >Priority: Major > Fix For: 2.4.0 > > > Some legacy code around the hive metastore catalog need to be removed for > further code improvement: > # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the > private method `getRawTable`. > # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the > `addJar` method, after the jar being added to the single class loader. > # in HiveClientImpl: There are some redundant code in both the `tableExists` > and `getTableOption` method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
[ https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenzhiming updated SPARK-21481: Attachment: idea64.exe > Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF > - > > Key: SPARK-21481 > URL: https://issues.apache.org/jira/browse/SPARK-21481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Aseem Bansal >Priority: Major > > If we want to find the index of any input based on hashing trick then it is > possible in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF > but not in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF. > Should allow that for feature parity -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
[ https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenzhiming updated SPARK-21481: Attachment: (was: idea64.exe) > Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF > - > > Key: SPARK-21481 > URL: https://issues.apache.org/jira/browse/SPARK-21481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Aseem Bansal >Priority: Major > > If we want to find the index of any input based on hashing trick then it is > possible in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF > but not in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF. > Should allow that for feature parity -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24615: -- Summary: Accelerator-aware task scheduling for Spark (was: Accelerator aware task scheduling for Spark) > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Yannakopoulos updated SPARK-24826: -- Description: Running a self-join against a table derived from a parquet file with many columns fails during the planning phase with the following stack-trace: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), coordinator[target post-shuffle partition size: 67108864] +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields] +- Filter isnotnull(_row_id#0L) +- FileScan parquet [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... 92 more fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... 92 more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:133) at
[jira] [Created] (SPARK-24827) Some memory waste in History Server by strings in AccumulableInfo objects
Misha Dmitriev created SPARK-24827: -- Summary: Some memory waste in History Server by strings in AccumulableInfo objects Key: SPARK-24827 URL: https://issues.apache.org/jira/browse/SPARK-24827 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.2 Reporter: Misha Dmitriev I've analyzed a heap dump of Spark History Server with jxray ([www.jxray.com)|http://www.jxray.com)/] and found that 42% of the heap is wasted due to duplicate strings. The biggest sources of such strings are the {{name}} and {{value}} data fields of {{AccumulableInfo}} objects: {code:java} 7. Duplicate Strings: overhead 42.1% Total strings Unique strings Duplicate values Overhead 13,732,278 729,234 354,032 867,177K (42.1%) Expensive data fields: 318,421K (15.4%), 3669685 / 100% dup strings (8 unique), 3669685 dup backing arrays: ↖org.apache.spark.scheduler.AccumulableInfo.name 178,994K (8.7%), 3674403 / 99% dup strings (35640 unique), 3674403 dup backing arrays: ↖scala.Some.x 168,601K (8.2%), 3401960 / 92% dup strings (175826 unique), 3401960 dup backing arrays: ↖org.apache.spark.scheduler.AccumulableInfo.value{code} That is, 15.4% of the heap is wasted by {{AccumulableInfo.name}} and 8.2% is wasted by {{AccumulableInfo.value}}. It turns out that the problem has been partially addressed in spark 2.3+, e.g. [https://github.com/apache/spark/blob/b045315e5d87b7ea3588436053aaa4d5a7bd103f/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L590] However, this code has two minor problems: # Strings for {{AccumulableInfo.value}} are not interned in the above code, only {{AccumulableInfo.name}}. # For interning, the code in {{weakIntern(String)}} method uses a Guava interner ({{stringInterner = Interners.newWeakInterner[String]()}}). This is an old-fashioned, less efficient way of interning strings. Since some 3-4 years old JDK7 version, the built-in JVM {{String.intern()}} method is much more efficient, both in terms of CPU and memory. It is therefore suggested to add interning for {{value}} and replace the Guava interner with {{String.intern()}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Yannakopoulos updated SPARK-24826: -- Description: Running a self-join against a table derived from a parquet file with many columns fails during the planning phase with the following stack-trace: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), coordinator[target post-shuffle partition size: 67108864] +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields] +- Filter isnotnull(_row_id#0L) +- FileScan parquet [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... 92 more fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... 92 more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:133) at
[jira] [Updated] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Yannakopoulos updated SPARK-24826: -- Attachment: part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > Self-Join not working in Apache Spark 2.2.2 > --- > > Key: SPARK-24826 > URL: https://issues.apache.org/jira/browse/SPARK-24826 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.2 >Reporter: Michael Yannakopoulos >Priority: Major > Attachments: > part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > > > Running a self-join against a table derived from a parquet file with many > columns fails during the planning phase with the following stack-trace: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), > coordinator[target post-shuffle partition size: 67108864] > +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, > funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, > emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, > verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, > desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields] > +- Filter isnotnull(_row_id#0L) > +- FileScan parquet > [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., > PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: > struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:133) > at >
[jira] [Created] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
Michael Yannakopoulos created SPARK-24826: - Summary: Self-Join not working in Apache Spark 2.2.2 Key: SPARK-24826 URL: https://issues.apache.org/jira/browse/SPARK-24826 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.2.2 Reporter: Michael Yannakopoulos Running a self-join against a table derived from a parquet file with many columns fails during the planning phase with the following stack-trace: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), coordinator[target post-shuffle partition size: 67108864] +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields] +- Filter isnotnull(_row_id#0L) +- FileScan parquet [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... 92 more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:133) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2865) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2846) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2845) at org.apache.spark.sql.Dataset.head(Dataset.scala:2154) at
[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545820#comment-16545820 ] Misha Dmitriev commented on SPARK-24801: Correct, there are indeed 40583 instances of {{EncryptedMessage}} in memory. From the other section of jxray report, which shows reference chains starting from GC roots, and shows the number of objects at each level, I see the following: {code:java} 2,929,966K (72.3%) Object tree for GC root(s) Java Static org.apache.spark.network.yarn.YarnShuffleService.instance org.apache.spark.network.yarn.YarnShuffleService.blockHandler ↘ 2,753,031K (67.9%), 1 reference(s) org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager ↘ 2,753,019K (67.9%), 1 reference(s) org.apache.spark.network.server.OneForOneStreamManager.streams ↘ 2,753,019K (67.9%), 1 reference(s) {java.util.concurrent.ConcurrentHashMap}.values ↘ 2,753,008K (67.9%), 169 reference(s) org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel ↘ 2,640,203K (65.1%), 32 reference(s) io.netty.channel.socket.nio.NioSocketChannel.unsafe ↘ 2,640,039K (65.1%), 32 reference(s) io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer ↘ 2,640,037K (65.1%), 30 reference(s) io.netty.channel.ChannelOutboundBuffer.flushedEntry ↘ 2,639,382K (65.1%), 15 reference(s) io.netty.channel.ChannelOutboundBuffer$Entry.{next} ↘ 2,637,973K (65.1%), 40,583 reference(s) io.netty.channel.ChannelOutboundBuffer$Entry.msg ↘ 2,622,966K (64.7%), 40,583 reference(s) org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel ↘ 2,598,897K (64.1%), 40,583 reference(s) org.apache.spark.network.util.ByteArrayWritableChannel.data ↘ 2,597,946K (64.1%), 40,583 reference(s) org.apache.spark.network.util.ByteArrayWritableChannel self 951K (< 0.1%), 40,583 object(s){code} So basically we have 15 netty {{ChannelOutboundBuffer}} objects, and then collectively , via linked lists starting from their {{flushedEntry}} data fields, they end up referencing 40,583 {{ChannelOutboundBuffer$Entry}} objects, which ultimately reference all these {{EncryptedMessage}} objects. So looks like here netty for some reason accumulated (didn't send) a very large number of messages, and thus netty is likely the main culprit. But then I wonder why all these messages are empty > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To
[jira] [Resolved] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty
[ https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24402. - Resolution: Fixed Fix Version/s: 2.4.0 > Optimize `In` expression when only one element in the collection or > collection is empty > > > Key: SPARK-24402 > URL: https://issues.apache.org/jira/browse/SPARK-24402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Two new rules in the logical plan optimizers are added. > # When there is only one element in the *{{Collection}}*, the physical plan > will be optimized to *{{EqualTo}}*, so predicate pushdown can be used. > {code} > profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true) > """ > |== Physical Plan ==| > |*(1) Project [profileID#0|#0]| > |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| > |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| > |PartitionFilters: [],| > |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| > |ReadSchema: struct > """.stripMargin > {code} > # When the *{{Set}}* is empty, and the input is nullable, the logical > plan will be simplified to > {code} > profileDF.filter( $"profileID".isInCollection(Set())).explain(true) > """ > |== Optimized Logical Plan ==| > |Filter if (isnull(profileID#0)) null else false| > |+- Relation[profileID#0|#0] parquet > """.stripMargin > {code} > TODO: > # For multiple conditions with numbers less than certain thresholds, > we should still allow predicate pushdown. > # Optimize the `In` using tableswitch or lookupswitch when the > numbers of the categories are low, and they are `Int`, `Long`. > # The default immutable hash trees set is slow for query, and we > should do benchmark for using different set implementation for faster > query. > # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah updated SPARK-24825: --- Issue Type: Bug (was: Improvement) > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
Matt Cheah created SPARK-24825: -- Summary: [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure Key: SPARK-24825 URL: https://issues.apache.org/jira/browse/SPARK-24825 Project: Spark Issue Type: Improvement Components: Kubernetes, Tests Affects Versions: 2.4.0 Reporter: Matt Cheah The Kubernetes integration tests will currently fail if maven installation is not performed first, because the integration test build believes it should be pulling the Spark parent artifact from maven central. However, this is incorrect because the integration test should be building the Spark parent pom as a dependency in the multi-module build, and the integration test should just use the dynamically built artifact. Or to put it another way, the integration test builds should never be pulling Spark dependencies from maven central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah updated SPARK-24825: --- Priority: Critical (was: Major) > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24805) Don't ignore files without .avro extension by default
[ https://issues.apache.org/jira/browse/SPARK-24805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24805. - Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 2.4.0 > Don't ignore files without .avro extension by default > - > > Key: SPARK-24805 > URL: https://issues.apache.org/jira/browse/SPARK-24805 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Currently to read files without .avro extension, users have to set the flag > *avro.mapred.ignore.inputs.without.extension* to *false* (by default it is > *true*). The ticket aims to change the default value to *false*. The reasons > to do that are: > - Other systems can create avro files without extensions. When users try to > read such files, they get just partitial results silently. The behaviour may > confuse users. > - Current behavior is different behavior from another supported datasource > CSV and JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23901: Fix Version/s: (was: 2.4.0) > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23901. - Resolution: Won't Fix > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-23901: - > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545676#comment-16545676 ] Imran Rashid commented on SPARK-24801: -- I'm surprised there are so many {{EncryptedMessage}} objects sitting around. Are there 40583 of them? that sounds like an extremely overloaded shuffle service -- or a leak. You're proposal would probably help some in that case, but really there is probably something else we should be doing differently. > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16617) Upgrade to Avro 1.8.x
[ https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545625#comment-16545625 ] Thomas Omans commented on SPARK-16617: -- My code is getting the Schema.getLogicalType bug using spark 2.3.x because we use parquet and avro. Don't think this is an improvement/enhancement - this is a bugfix for anything that needs to use parquet and avro after the 1.8.2 parquet upgrade. > Upgrade to Avro 1.8.x > - > > Key: SPARK-16617 > URL: https://issues.apache.org/jira/browse/SPARK-16617 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 2.1.0 >Reporter: Ben McCann >Priority: Major > > Avro 1.8 makes Avro objects serializable so that you can easily have an RDD > containing Avro objects. > See https://issues.apache.org/jira/browse/AVRO-1502 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545589#comment-16545589 ] Bryan Cutler commented on SPARK-24644: -- [~helkhalfi], the error in the stack trace is coming from pandas internals and it looks like you are using a pretty old version, so my guess is that you need to upgrade pandas to solve this. For Spark, we currently test pyarrow with pandas 0.19.2 and I would recommend at least that version or higher. > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545577#comment-16545577 ] Iqbal Singh commented on SPARK-24295: - Hey [~XuanYuan], We are processing 3000 files every 5 minutes 24X7 using structured streaming, File size is 120MB on average. * Every Structured streaming batch commit file size is around 800KB to 1000KB and compact file keep track of all the data from the start of the process. It goes up to 8Gb after 45 days and structured streaming process takes more than 15 mins to compact the file every 10th batch. * We are using Dynamic partitions while dumping the data which also increases the output file count for each micro batch ratio is 2:3. (2 input files give us 3 output files). * Spark forces the jobs to read the data using _spark__metadata files if the input directory of the job is a structured streaming output, Which wastes another 10-15 minutes for generating a list of files from "_spark_metadata" commit compact file. * Compact file has data in json format and grows in size very fast, if we have too many files to process in each batch. *File:* org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala -- Delete Action is defined in the Class "FileStreamSinkLog" but it is not implemented any where in code. {code:java} object FileStreamSinkLog { val VERSION = 1 val DELETE_ACTION = "delete" val ADD_ACTION = "add" } {code} -- Below code never executes, Where we are deleting the Sink logs with action "DELETE" while compacting the files {code:java} override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = { val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet if (deletedFiles.isEmpty) { logs } else { logs.filter(f => !deletedFiles.contains(f.path)) } }{code} -- We do not have batch Number info in the Compact file as a metric, it is tough to keep defined number of batches in the file. We have modification Time and can use it to mark the sink metadata log records as delete based on some data retention on time. We have developed a Spark job to read the metadata as a spark job and generate a list of files to have exactly once guarantee and it passes the list of files for a particular batch to the spark job, it takes 60 seconds to read the compact file using spark. We are working on a explicit data purge job for the compact file to keep its size under control, Please let me know if more details are required and also if there is anything we are missing out. Appreciate your help. Thanks, Iqbal Singh > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
[ https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545557#comment-16545557 ] Romeo Kienzer commented on SPARK-17557: --- [~jayadevan.m] [~hyukjin.kwon] can you please re-open? You can easily reproduce the error with the following parquet file [^a2_m2.parquet.zip] and the following code in pyspark 2.1.2, 2.1.3, 2.3.0 df = spark.read.parquet('a2_m2.parquet') from pyspark.ml.evaluation import MulticlassClassificationEvaluator binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("label") accuracy = binEval.evaluate(df) > SQL query on parquet table java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary > - > > Key: SPARK-17557 > URL: https://issues.apache.org/jira/browse/SPARK-17557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Major > Attachments: a2_m2.parquet.zip > > > Working on 1.6.2, broken on 2.0 > {code} > select * from logs.a where year=2016 and month=9 and day=14 limit 100 > {code} > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
[ https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romeo Kienzer updated SPARK-17557: -- Attachment: a2_m2.parquet.zip > SQL query on parquet table java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary > - > > Key: SPARK-17557 > URL: https://issues.apache.org/jira/browse/SPARK-17557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Major > Attachments: a2_m2.parquet.zip > > > Working on 1.6.2, broken on 2.0 > {code} > select * from logs.a where year=2016 and month=9 and day=14 limit 100 > {code} > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545519#comment-16545519 ] Bryan Cutler commented on SPARK-23874: -- [~smilegator], we are aiming to have the Arrow 0.10.0 release soon, and I will pick this up again when that happens. So I think it's possible we could have the upgrade done before the code freeze. > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545491#comment-16545491 ] Apache Spark commented on SPARK-23901: -- User 'mn-mikke' has created a pull request for this issue: https://github.com/apache/spark/pull/21786 > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545446#comment-16545446 ] Reynold Xin edited comment on SPARK-23901 at 7/16/18 4:31 PM: -- I actually feel pretty strongly we should remove them. Just so much code to maintain for something that doesn't have a clear use case and we only added them for some hypothetical Hive compatibility. was (Author: rxin): I actually feel pretty strongly we should remove them. > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545446#comment-16545446 ] Reynold Xin commented on SPARK-23901: - I actually feel pretty strongly we should remove them. > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545443#comment-16545443 ] Marek Novotny commented on SPARK-23901: --- Is there a consensus on getting the masking functions to version 2.4.0? If so, what about extending also Python and R API to be consistent? (I volunteer for that.) > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545436#comment-16545436 ] Sanket Reddy commented on SPARK-24787: -- [~vanzin] do you have any suggestions regarding this issue? [~olegd] I would rather make this configurable or trigger periodically? > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24734) Fix containsNull of Concat for array type.
[ https://issues.apache.org/jira/browse/SPARK-24734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24734: --- Assignee: Takuya Ueshin > Fix containsNull of Concat for array type. > -- > > Key: SPARK-24734 > URL: https://issues.apache.org/jira/browse/SPARK-24734 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > Currently {{Concat}} for array type uses the data type of the first child as > its own data type, but the children might include an array containing nulls. > We should aware the nullabilities of all children. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24734) Fix containsNull of Concat for array type.
[ https://issues.apache.org/jira/browse/SPARK-24734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24734. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21704 [https://github.com/apache/spark/pull/21704] > Fix containsNull of Concat for array type. > -- > > Key: SPARK-24734 > URL: https://issues.apache.org/jira/browse/SPARK-24734 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > Currently {{Concat}} for array type uses the data type of the first child as > its own data type, but the children might include an array containing nulls. > We should aware the nullabilities of all children. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20174) Analyzer gives mysterious AnalysisException when posexplode used in withColumn
[ https://issues.apache.org/jira/browse/SPARK-20174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545314#comment-16545314 ] Valery Khamenya commented on SPARK-20174: - Ok, I found a combo-workaround that seems to work: {code:java} df selectExpr("*", "posexplode(s) as (p,c)") drop("s"){code} > Analyzer gives mysterious AnalysisException when posexplode used in withColumn > -- > > Key: SPARK-20174 > URL: https://issues.apache.org/jira/browse/SPARK-20174 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > Wish I knew how to even describe the issue. It appears that {{posexplode}} > cannot be used in {{withColumn}}, but the error message does not seem to say > it. > [The scaladoc of > posexplode|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@posexplode(e:org.apache.spark.sql.Column):org.apache.spark.sql.Column] > is silent about this "limitation", too. > {code} > scala> codes.printSchema > root > |-- id: integer (nullable = false) > |-- rate_plan_code: array (nullable = true) > ||-- element: string (containsNull = true) > scala> codes.withColumn("code", posexplode($"rate_plan_code")).show > org.apache.spark.sql.AnalysisException: The number of aliases supplied in the > AS clause does not match the number of columns output by the UDTF expected 2 > aliases but got code ; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.makeGeneratorOutput(Analyzer.scala:1744) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20$$anonfun$56.apply(Analyzer.scala:1691) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20$$anonfun$56.apply(Analyzer.scala:1679) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20.applyOrElse(Analyzer.scala:1679) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20.applyOrElse(Analyzer.scala:1664) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$.apply(Analyzer.scala:1664) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$.apply(Analyzer.scala:1629) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2832) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1137) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1882) >
[jira] [Updated] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[ https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24813: -- Affects Version/s: (was: 2.1.3) > HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive > - > > Key: SPARK-24813 > URL: https://issues.apache.org/jira/browse/SPARK-24813 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > HiveExternalCatalogVersionsSuite is still failing periodically with errors > from mirror sites. In fact, the test depends on the Spark versions it needs > being available on the mirrors, but older versions will eventually be removed. > The test should fall back to downloading from archive.apache.org if mirrors > don't have the Spark release, or aren't responding. > This has become urgent as I helpfully already purged many old Spark releases > from mirrors, as requested by the ASF, before realizing it would probably > make this test fail deterministically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24529) Add spotbugs into maven build process
[ https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545284#comment-16545284 ] Apache Spark commented on SPARK-24529: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21785 > Add spotbugs into maven build process > - > > Key: SPARK-24529 > URL: https://issues.apache.org/jira/browse/SPARK-24529 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.4.0 > > > We will enable a Java bytecode check tool > [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at > multiplication. Due to the tool limitation, some other checks will be enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist
[ https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18230. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21740 [https://github.com/apache/spark/pull/21740] > MatrixFactorizationModel.recommendProducts throws NoSuchElement exception > when the user does not exist > -- > > Key: SPARK-18230 > URL: https://issues.apache.org/jira/browse/SPARK-18230 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.1 >Reporter: Mikael Ståldal >Assignee: shahid >Priority: Minor > Fix For: 2.4.0 > > > When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a > non-existing user, a {{java.util.NoSuchElementException}} is thrown: > {code} > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35) > at > org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169) > {code} > It would be nice if it returned the empty array, or throwed a more specific > exception, and that was documented in ScalaDoc for the method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20174) Analyzer gives mysterious AnalysisException when posexplode used in withColumn
[ https://issues.apache.org/jira/browse/SPARK-20174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545282#comment-16545282 ] Valery Khamenya commented on SPARK-20174: - Guys, I am tracking this issue for quite some time already. Prio "Minor" is applicable if there is a workaround for Spark users. Personally I am often having situation that I need to _append_ those two columns coming from posexplode, leaving all the rest columns intact. That is, something like: {code:java} df.withColumn(Seq("p", "c"), posexplode($"a")){code} is really wanted or an alternative combo with the same semantics is wanted badly for a DataFrame. > Analyzer gives mysterious AnalysisException when posexplode used in withColumn > -- > > Key: SPARK-20174 > URL: https://issues.apache.org/jira/browse/SPARK-20174 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > Wish I knew how to even describe the issue. It appears that {{posexplode}} > cannot be used in {{withColumn}}, but the error message does not seem to say > it. > [The scaladoc of > posexplode|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@posexplode(e:org.apache.spark.sql.Column):org.apache.spark.sql.Column] > is silent about this "limitation", too. > {code} > scala> codes.printSchema > root > |-- id: integer (nullable = false) > |-- rate_plan_code: array (nullable = true) > ||-- element: string (containsNull = true) > scala> codes.withColumn("code", posexplode($"rate_plan_code")).show > org.apache.spark.sql.AnalysisException: The number of aliases supplied in the > AS clause does not match the number of columns output by the UDTF expected 2 > aliases but got code ; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.makeGeneratorOutput(Analyzer.scala:1744) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20$$anonfun$56.apply(Analyzer.scala:1691) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20$$anonfun$56.apply(Analyzer.scala:1679) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20.applyOrElse(Analyzer.scala:1679) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20.applyOrElse(Analyzer.scala:1664) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$.apply(Analyzer.scala:1664) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$.apply(Analyzer.scala:1629) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68) > at >
[jira] [Assigned] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist
[ https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-18230: - Assignee: shahid > MatrixFactorizationModel.recommendProducts throws NoSuchElement exception > when the user does not exist > -- > > Key: SPARK-18230 > URL: https://issues.apache.org/jira/browse/SPARK-18230 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.1 >Reporter: Mikael Ståldal >Assignee: shahid >Priority: Minor > Fix For: 2.4.0 > > > When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a > non-existing user, a {{java.util.NoSuchElementException}} is thrown: > {code} > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35) > at > org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169) > {code} > It would be nice if it returned the empty array, or throwed a more specific > exception, and that was documented in ScalaDoc for the method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545253#comment-16545253 ] Thomas Graves commented on SPARK-24615: --- [~jerryshao] ^ > Accelerator aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data
[ https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545240#comment-16545240 ] Brad commented on SPARK-21097: -- Hi [~menelaus] The processing time delay is just a way to simulate different sized processing workloads. In the test I just do a map over the dataframe and spin for a few microseconds on each row as configured. In the case of 0 µs there it's like there is almost no processing on the data. There is still the time of loading the data from hadoop. The dynamic allocation without recovery benchmark is much slower in the 0 µs cached data case because it has lost all of its cached data and has to reload from hadoop. You can see the performance is similar to the initial load. Thanks Brad > Dynamic allocation will preserve cached data > > > Key: SPARK-21097 > URL: https://issues.apache.org/jira/browse/SPARK-21097 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Scheduler, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Brad >Priority: Major > Attachments: Preserving Cached Data with Dynamic Allocation.pdf > > > We want to use dynamic allocation to distribute resources among many notebook > users on our spark clusters. One difficulty is that if a user has cached data > then we are either prevented from de-allocating any of their executors, or we > are forced to drop their cached data, which can lead to a bad user experience. > We propose adding a feature to preserve cached data by copying it to other > executors before de-allocation. This behavior would be enabled by a simple > spark config. Now when an executor reaches its configured idle timeout, > instead of just killing it on the spot, we will stop sending it new tasks, > replicate all of its rdd blocks onto other executors, and then kill it. If > there is an issue while we replicate the data, like an error, it takes too > long, or there isn't enough space, then we will fall back to the original > behavior and drop the data and kill the executor. > This feature should allow anyone with notebook users to use their cluster > resources more efficiently. Also since it will be completely opt-in it will > unlikely to cause problems for other use cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545225#comment-16545225 ] Antony commented on SPARK-15343: {{--conf spark.hadoop.yarn.timeline-service.enabled=false is work for me}} > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >
[jira] [Commented] (SPARK-24182) Improve error message for client mode when AM fails
[ https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545167#comment-16545167 ] Apache Spark commented on SPARK-24182: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21784 > Improve error message for client mode when AM fails > --- > > Key: SPARK-24182 > URL: https://issues.apache.org/jira/browse/SPARK-24182 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.4.0 > > > Today, when the client AM fails, there's not a lot of useful information > printed on the output. Depending on the type of failure, the information > provided by the YARN AM is also not very useful. For example, you'd see this > in the Spark shell: > {noformat} > 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext. > org.apache.spark.SparkException: Yarn application has already ended! It might > have been killed or unable to launch application master. > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) > at org.apache.spark.SparkContext.(SparkContext.scala:500) > [long stack trace] > {noformat} > Similarly, on the YARN RM, for certain failures you see a generic error like > this: > {noformat} > ExitCodeException exitCode=10: at > org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at > org.apache.hadoop.util.Shell.run(Shell.java:460) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366) > at > [blah blah blah] > {noformat} > It would be nice if we could provide a more accurate description of what went > wrong when possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasaki Toru resolved SPARK-20050. - Resolution: Not A Problem > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru >Priority: Major > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart, such > below > * output first run of this application > {code} > key: null value: 1 offset: 101452472 > key: null value: 2 offset: 101452473 > key: null value: 3 offset: 101452474 > key: null value: 4 offset: 101452475 > key: null value: 5 offset: 101452476 > key: null value: 6 offset: 101452477 > key: null value: 7 offset: 101452478 > key: null value: 8 offset: 101452479 > key: null value: 9 offset: 101452480 // this is a last record before > shutdown Spark Streaming gracefully > {\code} > * output re-run of this application > {code} > key: null value: 7 offset: 101452478 // duplication > key: null value: 8 offset: 101452479 // duplication > key: null value: 9 offset: 101452480 // duplication > key: null value: 10 offset: 101452481 > {\code} > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24799) A solution of dealing with data skew in left,right,inner join
[ https://issues.apache.org/jira/browse/SPARK-24799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545032#comment-16545032 ] Apache Spark commented on SPARK-24799: -- User 'marymwu' has created a pull request for this issue: https://github.com/apache/spark/pull/21783 > A solution of dealing with data skew in left,right,inner join > - > > Key: SPARK-24799 > URL: https://issues.apache.org/jira/browse/SPARK-24799 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: marymwu >Priority: Major > > For the left,right,inner join statment execution, this solution is mainling > about to devide the partions where the data skew has occured into serveral > partions with smaller data scale, in order to parallelly execute more tasks > to increase effeciency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24812) Last Access Time in the table description is not valid
[ https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith updated SPARK-24812: --- Description: Last Access Time in the table description is not valid, Test steps: Step 1 - create a table Step 2 - Run command "DESC FORMATTED table" Last Access Time will always displayed wrong date Wed Dec 31 15:59:59 PST 1969 - which is wrong. !image-2018-07-16-15-37-28-896.png! In hive its displayed as "UNKNOWN" which makes more sense than displaying wrong date. Please find the snapshot tested in hive for the same com !image-2018-07-16-15-38-26-717.png! mand Seems to be a limitation as of now, better we can follow the hive behavior in this scenario. was: Last Access Time in the table description is not valid, Test steps: Step 1 - create a table Step 2 - Run command "DESC FORMATTED table" Last Access Time will always displayed wrong date Wed Dec 31 15:59:59 PST 1969 - which is wrong. In hive its displayed as "UNKNOWN" which makes more sense than displaying wrong date. Seems to be a limitation as of now, better we can follow the hive behavior in this scenario. > Last Access Time in the table description is not valid > -- > > Key: SPARK-24812 > URL: https://issues.apache.org/jira/browse/SPARK-24812 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Sujith >Priority: Minor > Attachments: image-2018-07-16-15-37-28-896.png, > image-2018-07-16-15-38-26-717.png > > > Last Access Time in the table description is not valid, > Test steps: > Step 1 - create a table > Step 2 - Run command "DESC FORMATTED table" > Last Access Time will always displayed wrong date > Wed Dec 31 15:59:59 PST 1969 - which is wrong. > !image-2018-07-16-15-37-28-896.png! > In hive its displayed as "UNKNOWN" which makes more sense than displaying > wrong date. > Please find the snapshot tested in hive for the same com > !image-2018-07-16-15-38-26-717.png! mand > > Seems to be a limitation as of now, better we can follow the hive behavior in > this scenario. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24812) Last Access Time in the table description is not valid
[ https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith updated SPARK-24812: --- Attachment: image-2018-07-16-15-38-26-717.png > Last Access Time in the table description is not valid > -- > > Key: SPARK-24812 > URL: https://issues.apache.org/jira/browse/SPARK-24812 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Sujith >Priority: Minor > Attachments: image-2018-07-16-15-37-28-896.png, > image-2018-07-16-15-38-26-717.png > > > Last Access Time in the table description is not valid, > Test steps: > Step 1 - create a table > Step 2 - Run command "DESC FORMATTED table" > Last Access Time will always displayed wrong date > Wed Dec 31 15:59:59 PST 1969 - which is wrong. > In hive its displayed as "UNKNOWN" which makes more sense than displaying > wrong date. > Seems to be a limitation as of now, better we can follow the hive behavior in > this scenario. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24812) Last Access Time in the table description is not valid
[ https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith updated SPARK-24812: --- Attachment: image-2018-07-16-15-37-28-896.png > Last Access Time in the table description is not valid > -- > > Key: SPARK-24812 > URL: https://issues.apache.org/jira/browse/SPARK-24812 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Sujith >Priority: Minor > Attachments: image-2018-07-16-15-37-28-896.png > > > Last Access Time in the table description is not valid, > Test steps: > Step 1 - create a table > Step 2 - Run command "DESC FORMATTED table" > Last Access Time will always displayed wrong date > Wed Dec 31 15:59:59 PST 1969 - which is wrong. > In hive its displayed as "UNKNOWN" which makes more sense than displaying > wrong date. > Seems to be a limitation as of now, better we can follow the hive behavior in > this scenario. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Comment: was deleted (was: I'm working on.) > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: DISTRIBUTE_BY_SORT_BY.png, > RANGE_DISTRIBUTE_BY_SORT_BY.png > > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24816: Assignee: Apache Spark > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > Attachments: DISTRIBUTE_BY_SORT_BY.png, > RANGE_DISTRIBUTE_BY_SORT_BY.png > > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24816: Assignee: (was: Apache Spark) > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: DISTRIBUTE_BY_SORT_BY.png, > RANGE_DISTRIBUTE_BY_SORT_BY.png > > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544980#comment-16544980 ] Apache Spark commented on SPARK-24816: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21782 > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: DISTRIBUTE_BY_SORT_BY.png, > RANGE_DISTRIBUTE_BY_SORT_BY.png > > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24794) DriverWrapper should have both master addresses in -Dspark.master
[ https://issues.apache.org/jira/browse/SPARK-24794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544973#comment-16544973 ] Ecaterina commented on SPARK-24794: --- Yes, I also face this problem. Would be nice if somebody could answer this. > DriverWrapper should have both master addresses in -Dspark.master > - > > Key: SPARK-24794 > URL: https://issues.apache.org/jira/browse/SPARK-24794 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Behroz Sikander >Priority: Major > > In standalone cluster mode, one could launch a Driver with supervise mode > enabled. Spark launches the driver with a JVM argument -Dspark.master which > is set to [host and port of current > master|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149]. > > During the life of context, the spark masters can switch due to any reason. > After that if the driver dies unexpectedly and comes up it tries to connect > with the master which was set initially with -Dspark.master but that master > is in STANDBY mode. The context tries multiple times to connect to standby > and then just kills itself. > > *Suggestion:* > While launching the driver process, Spark master should use the [spark.master > passed as > input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124] > instead of master and port of the current master. > Log messages that we observe: > > {code:java} > 2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 > org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: > Connecting to master spark://10.100.100.22:7077.. > . > 2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 > org.apache.spark.network.client.TransportClientFactory []: Successfully > created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in > bootstraps) > . > 2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 > org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: > Connecting to master spark://10.100.100.22:7077... > . > 2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 > org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: > Connecting to master spark://10.100.100.22:7077... > . > 2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application > has been killed. Reason: All masters are unresponsive! Giving up.{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Description: SQL interface support {{repartitionByRange}} to improvement data pushdown. I have test this feature with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| was: SQL interface support {{repartitionByRange}} to improvement data pushdown. I have test this feature with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE DISTRIBUTE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE DISTRIBUTE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: DISTRIBUTE_BY_SORT_BY.png, > RANGE_DISTRIBUTE_BY_SORT_BY.png > > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24824) Make Spark task speculation a per-stage config
Jiang Xingbo created SPARK-24824: Summary: Make Spark task speculation a per-stage config Key: SPARK-24824 URL: https://issues.apache.org/jira/browse/SPARK-24824 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Jiang Xingbo Make Spark task speculation a per-stage config, so we can explicitly disable task speculation for a barrier stage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24823) Cancel a job that contains barrier stage(s) if the barrier tasks don't get launched within a configured time
Jiang Xingbo created SPARK-24823: Summary: Cancel a job that contains barrier stage(s) if the barrier tasks don't get launched within a configured time Key: SPARK-24823 URL: https://issues.apache.org/jira/browse/SPARK-24823 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Jiang Xingbo Cancel a job that contains barrier stage(s) if the barrier tasks don't get launched within a configured time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24822) Python support for barrier execution mode
Jiang Xingbo created SPARK-24822: Summary: Python support for barrier execution mode Key: SPARK-24822 URL: https://issues.apache.org/jira/browse/SPARK-24822 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Affects Versions: 2.4.0 Reporter: Jiang Xingbo Enable launch a job containing barrier stage(s) from PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage
Jiang Xingbo created SPARK-24821: Summary: Fail fast when submitted job compute on a subset of all the partitions for a barrier stage Key: SPARK-24821 URL: https://issues.apache.org/jira/browse/SPARK-24821 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.0 Reporter: Jiang Xingbo Detect SparkContext.runJob() launch a barrier stage with a subset of all the partitions, one example is the `first()` operation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
Jiang Xingbo created SPARK-24820: Summary: Fail fast when submitted job contains PartitionPruningRDD in a barrier stage Key: SPARK-24820 URL: https://issues.apache.org/jira/browse/SPARK-24820 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.0 Reporter: Jiang Xingbo Detect SparkContext.runJob() launch a barrier stage including PartitionPruningRDD. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted
Jiang Xingbo created SPARK-24819: Summary: Fail fast when no enough slots to launch the barrier stage on job submitted Key: SPARK-24819 URL: https://issues.apache.org/jira/browse/SPARK-24819 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.0 Reporter: Jiang Xingbo Check all the barrier stages on job submitted, to see whether the barrier stages require more slots (to be able to launch all the barrier tasks in the same stage together) than currently active slots in the cluster. If the job requires more slots than available (both busy and free slots), fail the job on submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24818) Ensure all the barrier tasks in the same stage are launched together
Jiang Xingbo created SPARK-24818: Summary: Ensure all the barrier tasks in the same stage are launched together Key: SPARK-24818 URL: https://issues.apache.org/jira/browse/SPARK-24818 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.0 Reporter: Jiang Xingbo When some executors/hosts are blacklisted, it may happen that only a part of the tasks in the same barrier stage can be launched. We shall detect the case and revert the allocated resource offers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24817) Implement BarrierTaskContext.barrier()
Jiang Xingbo created SPARK-24817: Summary: Implement BarrierTaskContext.barrier() Key: SPARK-24817 URL: https://issues.apache.org/jira/browse/SPARK-24817 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.0 Reporter: Jiang Xingbo Implement BarrierTaskContext.barrier(), to support global sync between all the tasks in a barrier stage. The global sync shall finish immediately once all tasks in the same barrier stage reaches the same barrier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24538: Summary: ByteArrayDecimalType support push down to parquet data sources (was: ByteArrayDecimalType support push down to the data sources) > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA
[jira] [Resolved] (SPARK-24538) ByteArrayDecimalType support push down to the data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24538. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.4.0 Target Version/s: 2.4.0 > ByteArrayDecimalType support push down to the data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SPARK-24549) Support DecimalType push down to the parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24549: --- Assignee: Yuming Wang > Support DecimalType push down to the parquet data sources > - > > Key: SPARK-24549 > URL: https://issues.apache.org/jira/browse/SPARK-24549 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21791) ORC should support column names with dot
[ https://issues.apache.org/jira/browse/SPARK-21791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544899#comment-16544899 ] Furcy Pin commented on SPARK-21791: --- Indeed, it works like this. Awesome! thanks! > ORC should support column names with dot > > > Key: SPARK-21791 > URL: https://issues.apache.org/jira/browse/SPARK-21791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > > *PARQUET* > {code} > scala> Seq(Some(1), None).toDF("col.dots").write.parquet("/tmp/parquet_dot") > scala> spark.read.parquet("/tmp/parquet_dot").show > ++ > |col.dots| > ++ > | 1| > |null| > ++ > {code} > *ORC* > {code} > scala> Seq(Some(1), None).toDF("col.dots").write.orc("/tmp/orc_dot") > scala> spark.read.orc("/tmp/orc_dot").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '.' expecting ':'(line 1, pos 10) > == SQL == > struct > --^^^ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Attachment: DISTRIBUTE_BY_SORT_BY.png RANGE_DISTRIBUTE_BY_SORT_BY.png > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: DISTRIBUTE_BY_SORT_BY.png, > RANGE_DISTRIBUTE_BY_SORT_BY.png > > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE DISTRIBUTE BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE DISTRIBUTE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
[ https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep katta updated SPARK-24558: -- Affects Version/s: 2.2.1 2.2.2 > Driver prints the wrong info in the log when the executor which holds > cacheBlock is IDLE.Time-out value displayed is not as per configuration value. > > > Key: SPARK-24558 > URL: https://issues.apache.org/jira/browse/SPARK-24558 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.1 >Reporter: sandeep katta >Assignee: sandeep katta >Priority: Minor > Fix For: 2.4.0 > > > > launch spark-sql > spark-sql>cache table sample; > 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has > registered (new total is 1) > 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 > (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes) > 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block > manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None) > 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in > memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB) > 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in > memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB) > 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - > Registered executor NettyRpcEndpointRef(spark-client://Executor) > (10.18.99.35:44288) with ID 1 > 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has > registered (new total is 2) > > ... > 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove > executorIds: 2 > 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill > executor(s) 2 > 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of > executor(s) to be killed is 2 > 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 > because it has been idle for 60 seconds (new desired total will be 1) *//It > should be 120 not 60* > 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove > executorIds: 1 > 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill > executor(s) 1 > 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of > executor(s) to be killed is 1 > 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 > because it has been idle for 60 seconds (new desired total will be 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
[ https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24558: --- Assignee: sandeep katta > Driver prints the wrong info in the log when the executor which holds > cacheBlock is IDLE.Time-out value displayed is not as per configuration value. > > > Key: SPARK-24558 > URL: https://issues.apache.org/jira/browse/SPARK-24558 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: sandeep katta >Assignee: sandeep katta >Priority: Minor > Fix For: 2.4.0 > > > > launch spark-sql > spark-sql>cache table sample; > 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has > registered (new total is 1) > 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 > (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes) > 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block > manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None) > 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in > memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB) > 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in > memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB) > 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - > Registered executor NettyRpcEndpointRef(spark-client://Executor) > (10.18.99.35:44288) with ID 1 > 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has > registered (new total is 2) > > ... > 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove > executorIds: 2 > 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill > executor(s) 2 > 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of > executor(s) to be killed is 2 > 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 > because it has been idle for 60 seconds (new desired total will be 1) *//It > should be 120 not 60* > 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove > executorIds: 1 > 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill > executor(s) 1 > 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of > executor(s) to be killed is 1 > 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 > because it has been idle for 60 seconds (new desired total will be 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
[ https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24558. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21565 [https://github.com/apache/spark/pull/21565] > Driver prints the wrong info in the log when the executor which holds > cacheBlock is IDLE.Time-out value displayed is not as per configuration value. > > > Key: SPARK-24558 > URL: https://issues.apache.org/jira/browse/SPARK-24558 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: sandeep katta >Priority: Minor > Fix For: 2.4.0 > > > > launch spark-sql > spark-sql>cache table sample; > 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has > registered (new total is 1) > 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 > (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes) > 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block > manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None) > 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in > memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB) > 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in > memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB) > 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - > Registered executor NettyRpcEndpointRef(spark-client://Executor) > (10.18.99.35:44288) with ID 1 > 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has > registered (new total is 2) > > ... > 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove > executorIds: 2 > 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill > executor(s) 2 > 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of > executor(s) to be killed is 2 > 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 > because it has been idle for 60 seconds (new desired total will be 1) *//It > should be 120 not 60* > 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove > executorIds: 1 > 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill > executor(s) 1 > 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of > executor(s) to be killed is 1 > 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 > because it has been idle for 60 seconds (new desired total will be 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24816: Description: SQL interface support {{repartitionByRange}} to improvement data pushdown. I have test this feature with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE DISTRIBUTE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE DISTRIBUTE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| was: SQL interface support {{repartitionByRange}} to improvement data pushdown. I have test this feature with a big table(data size: 1.1 T, row count: 282,001,954,428) . The test sql is: {code:sql} select * from table where id=401564838907 {code} The test result: |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation MB-seconds| |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| |RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297| |RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE DISTRIBUTE BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE DISTRIBUTE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544841#comment-16544841 ] Xiao Li commented on SPARK-23874: - [~bryanc] [~icexelloss] I saw you are working on the JIRA https://issues.apache.org/jira/browse/ARROW-2704. Since our code freeze of Spark 2.4 release is Aug 1st. Does it mean we will miss the upgrade of arrow in Spark 2.4? > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24810) Fix paths to resource files in AvroSuite
[ https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24810. - Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 2.4.0 > Fix paths to resource files in AvroSuite > > > Key: SPARK-24810 > URL: https://issues.apache.org/jira/browse/SPARK-24810 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > Attachments: Screen Shot 2018-07-15 at 15.28.13.png > > > Currently paths to tests files from resource folder are relative in > AvroSuite. It causes problems like impossibility for running tests from IDE. > Need to wrap test files by: > {code:scala} > def testFile(fileName: String): String = { > > Thread.currentThread().getContextClassLoader.getResource(fileName).toString > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org