[jira] [Resolved] (SPARK-36561) Remove `ColumnVector.numNulls`
[ https://issues.apache.org/jira/browse/SPARK-36561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36561. -- Resolution: Invalid > Remove `ColumnVector.numNulls` > -- > > Key: SPARK-36561 > URL: https://issues.apache.org/jira/browse/SPARK-36561 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dmitry Sysolyatin >Priority: Major > > Hi! > When I was implementing ColumnVector abstract class I started to check where > `numNulls` is used in spark source code. I didn't find any places where it is > used except tests > Is there any plans to use it in the future. If no then I suppose to remove > `numNulls` method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36561) Remove `ColumnVector.numNulls`
[ https://issues.apache.org/jira/browse/SPARK-36561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403528#comment-17403528 ] Hyukjin Kwon commented on SPARK-36561: -- {{ColumnVector}} is an API. Let's keep them. > Remove `ColumnVector.numNulls` > -- > > Key: SPARK-36561 > URL: https://issues.apache.org/jira/browse/SPARK-36561 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dmitry Sysolyatin >Priority: Major > > Hi! > When I was implementing ColumnVector abstract class I started to check where > `numNulls` is used in spark source code. I didn't find any places where it is > used except tests > Is there any plans to use it in the future. If no then I suppose to remove > `numNulls` method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36562) Improve InsertIntoHadoopFsRelation file commit logic
[ https://issues.apache.org/jira/browse/SPARK-36562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36562: - Affects Version/s: 3.3.0 > Improve InsertIntoHadoopFsRelation file commit logic > > > Key: SPARK-36562 > URL: https://issues.apache.org/jira/browse/SPARK-36562 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: angerszhu >Priority: Major > > Improve InsertIntoHadoopFsRelation file commit logic -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36562) Improve InsertIntoHadoopFsRelation file commit logic
[ https://issues.apache.org/jira/browse/SPARK-36562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36562: - Fix Version/s: (was: 3.3.0) > Improve InsertIntoHadoopFsRelation file commit logic > > > Key: SPARK-36562 > URL: https://issues.apache.org/jira/browse/SPARK-36562 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > > Improve InsertIntoHadoopFsRelation file commit logic -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.
[ https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36537: Assignee: Apache Spark > Take care of other tests disabled for CategoricalDtype. > --- > > Key: SPARK-36537 > URL: https://issues.apache.org/jira/browse/SPARK-36537 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > There are some more tests disabled for CategoricalDtype. > They seem like pandas' bugs or not maintained anymore because inplace > updates with CategoricalDtype are deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.
[ https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36537: Assignee: (was: Apache Spark) > Take care of other tests disabled for CategoricalDtype. > --- > > Key: SPARK-36537 > URL: https://issues.apache.org/jira/browse/SPARK-36537 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more tests disabled for CategoricalDtype. > They seem like pandas' bugs or not maintained anymore because inplace > updates with CategoricalDtype are deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.
[ https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403527#comment-17403527 ] Apache Spark commented on SPARK-36537: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/33817 > Take care of other tests disabled for CategoricalDtype. > --- > > Key: SPARK-36537 > URL: https://issues.apache.org/jira/browse/SPARK-36537 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more tests disabled for CategoricalDtype. > They seem like pandas' bugs or not maintained anymore because inplace > updates with CategoricalDtype are deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.
[ https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-36537: Description: There are some more tests disabled for CategoricalDtype. They seem like pandas' bugs or not maintained anymore because inplace updates with CategoricalDtype are deprecated. was: There are some more tests disabled related to inplace updates with CategoricalDtype. They seem like pandas' bugs or not maintained anymore because inplace updates with CategoricalDtype are deprecated. > Take care of other tests disabled for CategoricalDtype. > --- > > Key: SPARK-36537 > URL: https://issues.apache.org/jira/browse/SPARK-36537 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more tests disabled for CategoricalDtype. > They seem like pandas' bugs or not maintained anymore because inplace > updates with CategoricalDtype are deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.
[ https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-36537: Summary: Take care of other tests disabled for CategoricalDtype. (was: Take care of other tests disabled related to inplace updates with CategoricalDtype.) > Take care of other tests disabled for CategoricalDtype. > --- > > Key: SPARK-36537 > URL: https://issues.apache.org/jira/browse/SPARK-36537 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more tests disabled related to inplace updates with > CategoricalDtype. > They seem like pandas' bugs or not maintained anymore because inplace updates > with CategoricalDtype are deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang
[ https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403508#comment-17403508 ] Venkata krishnan Sowrirajan commented on SPARK-36558: - One thing I missed here is not setting the newly constructed `DAGScheduler` into the SparkContext. Let me try that and update here. > Stage has all tasks finished but with ongoing finalization can cause job hang > - > > Key: SPARK-36558 > URL: https://issues.apache.org/jira/browse/SPARK-36558 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Blocker > > > For a stage that all tasks are finished but with ongoing finalization can > lead to job hang. The problem is that such stage is considered as a "missing" > stage (see > [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] > And it breaks the original assumption that a "missing" stage must have tasks > to run. > Normally, if stage A is the parent of (result) stage B and all tasks have > finished in stage A, stage A will be skipped directly when submitting stage > B. However, with this bug, stage A will be submitted. And submitting a stage > with no tasks to run would not be able to add its child stage into the > waiting stage list, which leads to the job hang in the end. > > The example to reproduce: > First, change `MyRDD` to allow it to compute: > {code:java} > override def compute(split: Partition, context: TaskContext): Iterator[(Int, > Int)] = { > Iterator.single((1, 1)) > }{code} > Then run this test: > {code:java} > test("Job hang") { > initPushBasedShuffleConfs(conf) > conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) > DAGSchedulerSuite.clearMergerLocs > DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", > "host5")) > val latch = new CountDownLatch(1) > val myDAGScheduler = new MyDAGScheduler( > sc, > sc.dagScheduler.taskScheduler, > sc.listenerBus, > sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], > sc.env.blockManager.master, > sc.env) { > override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = > { > // By this, we can mimic a stage with all tasks finished > // but finalization is incomplete. > latch.countDown() > } > } > sc.dagScheduler = myDAGScheduler > sc.taskScheduler.setDAGScheduler(myDAGScheduler) > val parts = 20 > val shuffleMapRdd = new MyRDD(sc, parts, Nil) > val shuffleDep = new ShuffleDependency(shuffleMapRdd, new > HashPartitioner(parts)) > val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = > mapOutputTracker) > reduceRdd1.countAsync() > latch.await() > // scalastyle:off > println("=after wait==") > // set _shuffleMergedFinalized to true can avoid the hang. > // shuffleDep._shuffleMergedFinalized = true > val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) > reduceRdd2.count() > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang
[ https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403507#comment-17403507 ] Venkata krishnan Sowrirajan commented on SPARK-36558: - [~Ngone51] I am not sure if you can run the tests that way, because then the `DAGSchedulerSuite` custom `DAGScheduler` or `DAGSchedulerEventLoopTester` won't be used therefore it submits the job with the default `DAGScheduler` and waits in the `finalizeShuffleMerge` which cannot complete because of our mocked up merger locations. Btw below is the test code along with the above changes you mentioned added to `MyRDD`. Am I missing something? {code:java} test("Job hang") { initPushBasedShuffleConfs(conf) conf.set("spark.shuffle.push.mergerLocations.minThreshold", "5") DAGSchedulerSuite.clearMergerLocs DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5")) val latch = new CountDownLatch(1) scheduler = new MyDAGScheduler( sc, taskScheduler, sc.listenerBus, mapOutputTracker, blockManagerMaster, sc.env) { override private[spark] def scheduleShuffleMergeFinalize( stage: ShuffleMapStage): Unit = { // By this, we can mimic a stage with all tasks finished // but finalization is incomplete. latch.countDown() super.scheduleShuffleMergeFinalize(stage) } } dagEventProcessLoopTester = new DAGSchedulerEventProcessLoopTester(scheduler) val parts = 5 val shuffleMapRdd = new MyRDD(sc, parts, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts)) val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker) reduceRdd1.count() latch.await() // scalastyle:off println("=after wait==") // set _shuffleMergedFinalized to true can avoid the hang. val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) reduceRdd2.count() } {code} > Stage has all tasks finished but with ongoing finalization can cause job hang > - > > Key: SPARK-36558 > URL: https://issues.apache.org/jira/browse/SPARK-36558 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Blocker > > > For a stage that all tasks are finished but with ongoing finalization can > lead to job hang. The problem is that such stage is considered as a "missing" > stage (see > [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] > And it breaks the original assumption that a "missing" stage must have tasks > to run. > Normally, if stage A is the parent of (result) stage B and all tasks have > finished in stage A, stage A will be skipped directly when submitting stage > B. However, with this bug, stage A will be submitted. And submitting a stage > with no tasks to run would not be able to add its child stage into the > waiting stage list, which leads to the job hang in the end. > > The example to reproduce: > First, change `MyRDD` to allow it to compute: > {code:java} > override def compute(split: Partition, context: TaskContext): Iterator[(Int, > Int)] = { > Iterator.single((1, 1)) > }{code} > Then run this test: > {code:java} > test("Job hang") { > initPushBasedShuffleConfs(conf) > conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) > DAGSchedulerSuite.clearMergerLocs > DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", > "host5")) > val latch = new CountDownLatch(1) > val myDAGScheduler = new MyDAGScheduler( > sc, > sc.dagScheduler.taskScheduler, > sc.listenerBus, > sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], > sc.env.blockManager.master, > sc.env) { > override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = > { > // By this, we can mimic a stage with all tasks finished > // but finalization is incomplete. > latch.countDown() > } > } > sc.dagScheduler = myDAGScheduler > sc.taskScheduler.setDAGScheduler(myDAGScheduler) > val parts = 20 > val shuffleMapRdd = new MyRDD(sc, parts, Nil) > val shuffleDep = new ShuffleDependency(shuffleMapRdd, new > HashPartitioner(parts)) > val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = > mapOutputTracker) > reduceRdd1.countAsync() > latch.await() > // scalastyle:off > println("=after wait==") > // set _shuffleMergedFinalized to true can avoid the hang. > // shuffleDep._shuffleMergedFinalized = true > val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) >
[jira] [Commented] (SPARK-36565) "Unscaled value too large for precision" while reading simple parquet file readable with parquet-tools and pandas
[ https://issues.apache.org/jira/browse/SPARK-36565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403475#comment-17403475 ] Fu Chen commented on SPARK-36565: - How to generate this parquet file? > "Unscaled value too large for precision" while reading simple parquet file > readable with parquet-tools and pandas > - > > Key: SPARK-36565 > URL: https://issues.apache.org/jira/browse/SPARK-36565 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.1.2 >Reporter: Lorenzo Martini >Priority: Minor > Attachments: broken_parquet_file.parquet > > > I have a simple parquet file (attached to the ticket) with 2 columns > (array, decimal) that can be read and viewed correctly using pandas > or parquet-tools. Reading the parquet file in spark (and pyspark) seems to > work, but calling `.show()` throws the exception (1) with > {code:java} > Caused by: java.lang.ArithmeticException: Unscaled value too large for > precision at. > {code} > > > Another interesting detail is that reading the parquet file and doing a > select on individual columns allows for `show()` to work correctly without > throwing. > {code:java} > >>> repro = spark.read.parquet(".../broken_file.parquet") > >>> repro.printSchema() > root > |-- column_a: array (nullable = true) > ||-- element: string (containsNull = true) > |-- column_b: decimal(4,0) (nullable = true) > >>> repro.select("column_a").show() > ++ > |column_a| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > >>> repro.select("column_b").show() > ++ > |column_b| > ++ > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > ++ > >>> repro.show() // THIS ONE THROWS EXCEPTION (1) > {code} > Using `parquet-tools` shows the dataset correctly > {code:java} > >>> parquet-tools show broken_file.parquet > +++ > | column_a | column_b | > |+| > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > +++ > {code} > > > And the same with `pandas` > {code:java} > >>> import pandas as pd > >>> pd.read_parquet(".../broken_file.parquet") > column_a column_b > 0 None11590 > 1 None11590 > 2 None11590 > 3 None11590 > 4 None11590 > 5 None11590 > 6 None11590 > 7 None11590 > 8 None11590 > 9 None11590 > {code} > > I have also verified this affects all versions of spark between 2.4.0 and > 3.1.2 > Here the Exception (1) thrown (sorry about the poor formatting, didn't seem > to manage to make it work): > > {code:java} > >>> spark.version '3.1.2' > >>> df = spark.read.parquet(".../broken_parquet_file.parquet") > >>> df > DataFrame[column_a: array, column_b: decimal(4,0)] > >>> df.show() 21/08/23 18:39:36 > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)/ 1] > org.apache.spark.sql.execution.QueryExecutionException: Encounter error while > reading parquet files. One possible cause: Parquet column cannot be converted > in the corresponding files. Details: at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:131) at >
[jira] [Created] (SPARK-36571) Optimized FileOutputCommitter with StagingDir
angerszhu created SPARK-36571: - Summary: Optimized FileOutputCommitter with StagingDir Key: SPARK-36571 URL: https://issues.apache.org/jira/browse/SPARK-36571 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang
[ https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-36558: - Description: For a stage that all tasks are finished but with ongoing finalization can lead to job hang. The problem is that such stage is considered as a "missing" stage (see [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] And it breaks the original assumption that a "missing" stage must have tasks to run. Normally, if stage A is the parent of (result) stage B and all tasks have finished in stage A, stage A will be skipped directly when submitting stage B. However, with this bug, stage A will be submitted. And submitting a stage with no tasks to run would not be able to add its child stage into the waiting stage list, which leads to the job hang in the end. The example to reproduce: First, change `MyRDD` to allow it to compute: {code:java} override def compute(split: Partition, context: TaskContext): Iterator[(Int, Int)] = { Iterator.single((1, 1)) }{code} Then run this test: {code:java} test("Job hang") { initPushBasedShuffleConfs(conf) conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) DAGSchedulerSuite.clearMergerLocs DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5")) val latch = new CountDownLatch(1) val myDAGScheduler = new MyDAGScheduler( sc, sc.dagScheduler.taskScheduler, sc.listenerBus, sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], sc.env.blockManager.master, sc.env) { override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = { // By this, we can mimic a stage with all tasks finished // but finalization is incomplete. latch.countDown() } } sc.dagScheduler = myDAGScheduler sc.taskScheduler.setDAGScheduler(myDAGScheduler) val parts = 20 val shuffleMapRdd = new MyRDD(sc, parts, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts)) val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker) reduceRdd1.countAsync() latch.await() // scalastyle:off println("=after wait==") // set _shuffleMergedFinalized to true can avoid the hang. // shuffleDep._shuffleMergedFinalized = true val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) reduceRdd2.count() } {code} was: For a stage that all tasks are finished but with ongoing finalization can lead to job hang. The problem is that such stage is considered as a "missing" stage (see [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] And it breaks the original assumption that a "missing" stage must have tasks to run. Normally, if stage A is the parent of (result) stage B and all tasks have finished in stage A, stage A will be skipped directly when submitting stage B. However, with this bug, stage A will be submitted. And submitting a stage with no tasks to run would not be able to add its child stage into the waiting stage list, which leads to the job hang in the end. The example to reproduce: First, change `MyRDD` to allow it compute: override def compute(split: Partition, context: TaskContext): Iterator[(Int, Int)] = \{ Iterator.single((1, 1)) } Then run this test: {code:java} test("Job hang") { initPushBasedShuffleConfs(conf) conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) DAGSchedulerSuite.clearMergerLocs DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5")) val latch = new CountDownLatch(1) val myDAGScheduler = new MyDAGScheduler( sc, sc.dagScheduler.taskScheduler, sc.listenerBus, sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], sc.env.blockManager.master, sc.env) { override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = { // By this, we can mimic a stage with all tasks finished // but finalization is incomplete. latch.countDown() } } sc.dagScheduler = myDAGScheduler sc.taskScheduler.setDAGScheduler(myDAGScheduler) val parts = 20 val shuffleMapRdd = new MyRDD(sc, parts, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts)) val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker) reduceRdd1.countAsync() latch.await() // scalastyle:off println("=after wait==") // set _shuffleMergedFinalized to true can avoid the hang. // shuffleDep._shuffleMergedFinalized = true val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) reduceRdd2.count() } {code} > Stage has all tasks finished but with ongoing finalization can cause job hang >
[jira] [Updated] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang
[ https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-36558: - Description: For a stage that all tasks are finished but with ongoing finalization can lead to job hang. The problem is that such stage is considered as a "missing" stage (see [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] And it breaks the original assumption that a "missing" stage must have tasks to run. Normally, if stage A is the parent of (result) stage B and all tasks have finished in stage A, stage A will be skipped directly when submitting stage B. However, with this bug, stage A will be submitted. And submitting a stage with no tasks to run would not be able to add its child stage into the waiting stage list, which leads to the job hang in the end. The example to reproduce: First, change `MyRDD` to allow it compute: override def compute(split: Partition, context: TaskContext): Iterator[(Int, Int)] = \{ Iterator.single((1, 1)) } Then run this test: {code:java} test("Job hang") { initPushBasedShuffleConfs(conf) conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) DAGSchedulerSuite.clearMergerLocs DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5")) val latch = new CountDownLatch(1) val myDAGScheduler = new MyDAGScheduler( sc, sc.dagScheduler.taskScheduler, sc.listenerBus, sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], sc.env.blockManager.master, sc.env) { override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = { // By this, we can mimic a stage with all tasks finished // but finalization is incomplete. latch.countDown() } } sc.dagScheduler = myDAGScheduler sc.taskScheduler.setDAGScheduler(myDAGScheduler) val parts = 20 val shuffleMapRdd = new MyRDD(sc, parts, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts)) val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker) reduceRdd1.countAsync() latch.await() // scalastyle:off println("=after wait==") // set _shuffleMergedFinalized to true can avoid the hang. // shuffleDep._shuffleMergedFinalized = true val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) reduceRdd2.count() } {code} was: For a stage that all tasks are finished but with ongoing finalization can lead to job hang. The problem is that such stage is considered as a "missing" stage (see [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] And it breaks the original assumption that a "missing" stage must have tasks to run. Normally, if stage A is the parent of (result) stage B and all tasks have finished in stage A, stage A will be skipped directly when submitting stage B. However, with this bug, stage A will be submitted. And submitting a stage with no tasks to run would not be able to add its child stage into the waiting stage list, which leads to the job hang in the end. The example to reproduce: {code:java} test("Job hang") { initPushBasedShuffleConfs(conf) conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) DAGSchedulerSuite.clearMergerLocs DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5")) val latch = new CountDownLatch(1) val myDAGScheduler = new MyDAGScheduler( sc, sc.dagScheduler.taskScheduler, sc.listenerBus, sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], sc.env.blockManager.master, sc.env) { override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = { // By this, we can mimic a stage with all tasks finished // but finalization is incomplete. latch.countDown() } } sc.dagScheduler = myDAGScheduler sc.taskScheduler.setDAGScheduler(myDAGScheduler) val parts = 20 val shuffleMapRdd = new MyRDD(sc, parts, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts)) val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker) reduceRdd1.countAsync() latch.await() // scalastyle:off println("=after wait==") // set _shuffleMergedFinalized to true can avoid the hang. // shuffleDep._shuffleMergedFinalized = true val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) reduceRdd2.count() } {code} > Stage has all tasks finished but with ongoing finalization can cause job hang > - > > Key: SPARK-36558 > URL: https://issues.apache.org/jira/browse/SPARK-36558 >
[jira] [Updated] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang
[ https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-36558: - Description: For a stage that all tasks are finished but with ongoing finalization can lead to job hang. The problem is that such stage is considered as a "missing" stage (see [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] And it breaks the original assumption that a "missing" stage must have tasks to run. Normally, if stage A is the parent of (result) stage B and all tasks have finished in stage A, stage A will be skipped directly when submitting stage B. However, with this bug, stage A will be submitted. And submitting a stage with no tasks to run would not be able to add its child stage into the waiting stage list, which leads to the job hang in the end. The example to reproduce: {code:java} test("Job hang") { initPushBasedShuffleConfs(conf) conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) DAGSchedulerSuite.clearMergerLocs DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5")) val latch = new CountDownLatch(1) val myDAGScheduler = new MyDAGScheduler( sc, sc.dagScheduler.taskScheduler, sc.listenerBus, sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], sc.env.blockManager.master, sc.env) { override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = { // By this, we can mimic a stage with all tasks finished // but finalization is incomplete. latch.countDown() } } sc.dagScheduler = myDAGScheduler sc.taskScheduler.setDAGScheduler(myDAGScheduler) val parts = 20 val shuffleMapRdd = new MyRDD(sc, parts, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts)) val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker) reduceRdd1.countAsync() latch.await() // scalastyle:off println("=after wait==") // set _shuffleMergedFinalized to true can avoid the hang. // shuffleDep._shuffleMergedFinalized = true val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) reduceRdd2.count() } {code} was: For a stage that all tasks are finished but with ongoing finalization can lead to job hang. The problem is that such stage is considered as a "missing" stage (see [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] And it breaks the original assumption that a "missing" stage must have tasks to run. Normally, if stage A is the parent of (result) stage B and all tasks have finished in stage A, stage A will be skipped directly when submitting stage B. However, with this bug, stage A will be submitted. And submitting a stage with no tasks to run would not be able to add its child stage into the waiting stage list, which leads to the job hang in the end. The example to reproduce: {code:java} test("Job hang") { initPushBasedShuffleConfs(conf) conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) DAGSchedulerSuite.clearMergerLocs DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5")) val latch = new CountDownLatch(1) val myDAGScheduler = new MyDAGScheduler( sc, sc.dagScheduler.taskScheduler, sc.listenerBus, sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], sc.env.blockManager.master, sc.env) { override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = { // By this, we can mimic a stage with all tasks finished // but finalization is incomplete. latch.countDown() } } sc.dagScheduler = myDAGScheduler sc.taskScheduler.setDAGScheduler(myDAGScheduler) val parts = 20 val shuffleMapRdd = new MyRDD(sc, parts, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts)) val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker) reduceRdd1.countAsync() latch.await() // set _shuffleMergedFinalized to true can avoid the hang. // shuffleDep._shuffleMergedFinalized = true val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) reduceRdd2.count() } {code} > Stage has all tasks finished but with ongoing finalization can cause job hang > - > > Key: SPARK-36558 > URL: https://issues.apache.org/jira/browse/SPARK-36558 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Blocker > > > For a stage that all tasks are finished but with ongoing
[jira] [Commented] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang
[ https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403456#comment-17403456 ] wuyi commented on SPARK-36558: -- [~vsowrirajan] Sorry, I missed one tweaked change in `MyRDD`. We should allow `MyRDD` to compute: {code:java} override def compute(split: Partition, context: TaskContext): Iterator[(Int, Int)] = { Iterator.single((1, 1)) }{code} Could you help verify it again? > Stage has all tasks finished but with ongoing finalization can cause job hang > - > > Key: SPARK-36558 > URL: https://issues.apache.org/jira/browse/SPARK-36558 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Blocker > > > For a stage that all tasks are finished but with ongoing finalization can > lead to job hang. The problem is that such stage is considered as a "missing" > stage (see > [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] > And it breaks the original assumption that a "missing" stage must have tasks > to run. > Normally, if stage A is the parent of (result) stage B and all tasks have > finished in stage A, stage A will be skipped directly when submitting stage > B. However, with this bug, stage A will be submitted. And submitting a stage > with no tasks to run would not be able to add its child stage into the > waiting stage list, which leads to the job hang in the end. > > The example to reproduce: > {code:java} > test("Job hang") { > initPushBasedShuffleConfs(conf) > conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) > DAGSchedulerSuite.clearMergerLocs > DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", > "host5")) > val latch = new CountDownLatch(1) > val myDAGScheduler = new MyDAGScheduler( > sc, > sc.dagScheduler.taskScheduler, > sc.listenerBus, > sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], > sc.env.blockManager.master, > sc.env) { > override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = > { > // By this, we can mimic a stage with all tasks finished > // but finalization is incomplete. > latch.countDown() > } > } > sc.dagScheduler = myDAGScheduler > sc.taskScheduler.setDAGScheduler(myDAGScheduler) > val parts = 20 > val shuffleMapRdd = new MyRDD(sc, parts, Nil) > val shuffleDep = new ShuffleDependency(shuffleMapRdd, new > HashPartitioner(parts)) > val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = > mapOutputTracker) > reduceRdd1.countAsync() > latch.await() > // set _shuffleMergedFinalized to true can avoid the hang. > // shuffleDep._shuffleMergedFinalized = true > val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) > reduceRdd2.count() > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36560) Deflake PySpark coverage report
[ https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36560. -- Fix Version/s: 3.3.0 Assignee: Hyukjin Kwon Resolution: Fixed Fixed in https://github.com/apache/spark/pull/33808 > Deflake PySpark coverage report > --- > > Key: SPARK-36560 > URL: https://issues.apache.org/jira/browse/SPARK-36560 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.3.0 > > > https://github.com/apache/spark/runs/3388727798?check_suite_focus=true > https://github.com/apache/spark/runs/3392972609?check_suite_focus=true > https://github.com/apache/spark/runs/3359880048?check_suite_focus=true > https://github.com/apache/spark/runs/3338876122?check_suite_focus=true > PySpark scheduled coverage jobs are flaky. We should deflake them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403446#comment-17403446 ] Dongjoon Hyun commented on SPARK-25075: --- FYI, here is the pre-built Apache Spark 3.2.0 RC1 spark-shell (Scala 2.13.5) result. {code} $ bin/spark-shell ... Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/ Using Scala version 2.13.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_302) Type in expressions to have them evaluated. Type :help for more information. Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1629769598806). Spark session available as 'spark'. {code} > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403445#comment-17403445 ] Dongjoon Hyun commented on SPARK-25075: --- [~ekrich]. Yes, SPARK-34218 completed it already for Apache Spark 3.2.0. You can see the pre-built binaries at Apache Spark 3.2.0 RC1 vote artifacts. - https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/ {code} spark-3.2.0-bin-hadoop3.2-scala2.13.tgz spark-3.2.0-bin-hadoop3.2-scala2.13.tgz.asc spark-3.2.0-bin-hadoop3.2-scala2.13.tgz.sha512 {code} The RC1 vote failed and we will proceed to RC2. Please download the Scala 2.13 pre-build artifacts and provide your feedbacks, [~ekrich]. > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36570) Investigate native support for raw data containing commas
Xinrong Meng created SPARK-36570: Summary: Investigate native support for raw data containing commas Key: SPARK-36570 URL: https://issues.apache.org/jira/browse/SPARK-36570 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Xinrong Meng For raw data containing commas as thousands separator, pandas handles it automatically. We should recognize them as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36569) Support errors='coerce' for ps.to_numeric
Xinrong Meng created SPARK-36569: Summary: Support errors='coerce' for ps.to_numeric Key: SPARK-36569 URL: https://issues.apache.org/jira/browse/SPARK-36569 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36568) Missed broadcast join in V2 plan
Bruce Robbins created SPARK-36568: - Summary: Missed broadcast join in V2 plan Key: SPARK-36568 URL: https://issues.apache.org/jira/browse/SPARK-36568 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Bruce Robbins There are some joins that use broadcast hash join with DataSourceV1 but sort merge join with DataSourceV2, even though the two joins are loading the same files [1]. Example: Create data: {noformat} import scala.util.Random val rand = new Random(245665L) val df = spark.range(1, 2).map { x => (x, rand.alphanumeric.take(20).mkString, rand.alphanumeric.take(20).mkString, rand.alphanumeric.take(20).mkString ) }.toDF("key", "col1", "col2", "col3") df.write.mode("overwrite").format("parquet").save("/tmp/tbl") df.write.mode("overwrite").format("parquet").save("/tmp/lookup") {noformat} Run this code: {noformat} bin/spark-shell --conf spark.sql.autoBroadcastJoinThreshold=40 spark.read.format("parquet").load("/tmp/tbl").createOrReplaceTempView("tbl") spark.read.format("parquet").load("/tmp/lookup").createOrReplaceTempView("lookup") sql("""select t.key, t.col1, t.col2, t.col3 from tbl t join lookup l on t.key = l.key""").explain {noformat} For V2, do the same, except set {{spark.sql.sources.useV1SourceList=""}}. For V1, the result is: {noformat} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Project [key#0L, col1#1, col2#2, col3#3] +- BroadcastHashJoin [key#0L], [key#8L], Inner, BuildRight, false :- Filter isnotnull(key#0L) : +- FileScan parquet [key#0L,col1#1,col2#2,col3#3] Batched: true, DataFilters: [isnotnull(key#0L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/tbl], PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#32] +- Filter isnotnull(key#8L) +- FileScan parquet [key#8L] Batched: true, DataFilters: [isnotnull(key#8L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/lookup], PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: struct {noformat} For V2, the result is: {noformat} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Project [key#0L, col1#1, col2#2, col3#3] +- SortMergeJoin [key#0L], [key#8L], Inner :- Sort [key#0L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#0L, 200), ENSURE_REQUIREMENTS, [id=#33] : +- Filter isnotnull(key#0L) :+- BatchScan[key#0L, col1#1, col2#2, col3#3] ParquetScan DataFilters: [isnotnull(key#0L)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/tbl], PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: struct, PushedFilters: [IsNotNull(key)] RuntimeFilters: [] +- Sort [key#8L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key#8L, 200), ENSURE_REQUIREMENTS, [id=#34] +- Filter isnotnull(key#8L) +- BatchScan[key#8L] ParquetScan DataFilters: [isnotnull(key#8L)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/lookup], PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: struct, PushedFilters: [IsNotNull(key)] RuntimeFilters: [] {noformat} The initial plan with V1 uses broadcast hash join, but the initial plan with V2 uses sort merge join. The V1 logical plan contains a projection over the relation for {{lookup}}, which restricts the output columns to just {{key}}. As a result, {{SizeInBytesOnlyStatsPlanVisitor#visitUnaryNode}}, when visiting the project node, reduces sizeInBytes based on the pruning: {noformat} Project [key#0L, col1#1, col2#2, col3#3] +- Join Inner, (key#0L = key#8L) :- Filter isnotnull(key#0L) : +- Relation [key#0L,col1#1,col2#2,col3#3] parquet +- Project [key#8L] +- Filter isnotnull(key#8L) +- Relation [key#8L,col1#9,col2#10,col3#11] parquet {noformat} The V2 logical plan does not contain this projection: {noformat} +- Join Inner, (key#0L = key#8L) :- Filter isnotnull(key#0L) : +- RelationV2[key#0L, col1#1, col2#2, col3#3] parquet file:/tmp/tbl +- Filter isnotnull(key#8L) +- RelationV2[key#8L] parquet file:/tmp/lookup {noformat} [1] With my example, AQE converts the join to a broadcast hash join at run time for the V2 case. However, if AQE was disabled, it would obviously remain a sort merge join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang
[ https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403404#comment-17403404 ] Venkata krishnan Sowrirajan commented on SPARK-36558: - [~Ngone51] I tried the above test which you shared but I needed to make few changes to basically have shuffle merge enabled properly, without that what the `scheduleShuffleMergeFinalize` won't get invoked therefore it can infinitely wait with the CountDownLatch. Please take a look at the modified test below. 1. Keep `numMergerLocs` equal to `numParts` for the stage to have shuffleMergeEnabled 2. Also use `submit` option inside `DAGSchedulerSuite` to run an action on an RDD. Since the 'runningStages` set would have the stage currently in merge finalization (but all the partitions are available) step therefore the same stage won't get submitted with empty partitions to be computed. This check [here|[https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1253]] would prevent the same stage getting submitted again. Also below is the fixed unit test and this seems to run fine. {code:java} test("Job hang") { initPushBasedShuffleConfs(conf) conf.set("spark.shuffle.push.mergerLocations.minThreshold", "5") DAGSchedulerSuite.clearMergerLocs DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", "host5")) val latch = new CountDownLatch(1) scheduler = new MyDAGScheduler( sc, taskScheduler, sc.listenerBus, mapOutputTracker, blockManagerMaster, sc.env) { override private[spark] def scheduleShuffleMergeFinalize( stage: ShuffleMapStage): Unit = { // By this, we can mimic a stage with all tasks finished // but finalization is incomplete. latch.countDown() // super.scheduleShuffleMergeFinalize(stage) } } dagEventProcessLoopTester = new DAGSchedulerEventProcessLoopTester(scheduler) val parts = 5 val shuffleMapRdd = new MyRDD(sc, parts, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts)) val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker) submit(reduceRdd1, (0 until parts).toArray) completeShuffleMapStageSuccessfully(0, 0, parts) completeNextResultStageWithSuccess(1, 0) latch.await() // set _shuffleMergedFinalized to true can avoid the hang. val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep)) submit(reduceRdd2, (0 until parts).toArray) completeNextResultStageWithSuccess(3, 0) } {code} Let me know your thoughts. Am I missing something here? cc [~mshen] [~mridulm80] > Stage has all tasks finished but with ongoing finalization can cause job hang > - > > Key: SPARK-36558 > URL: https://issues.apache.org/jira/browse/SPARK-36558 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Blocker > > > For a stage that all tasks are finished but with ongoing finalization can > lead to job hang. The problem is that such stage is considered as a "missing" > stage (see > [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).] > And it breaks the original assumption that a "missing" stage must have tasks > to run. > Normally, if stage A is the parent of (result) stage B and all tasks have > finished in stage A, stage A will be skipped directly when submitting stage > B. However, with this bug, stage A will be submitted. And submitting a stage > with no tasks to run would not be able to add its child stage into the > waiting stage list, which leads to the job hang in the end. > > The example to reproduce: > {code:java} > test("Job hang") { > initPushBasedShuffleConfs(conf) > conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5) > DAGSchedulerSuite.clearMergerLocs > DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", > "host5")) > val latch = new CountDownLatch(1) > val myDAGScheduler = new MyDAGScheduler( > sc, > sc.dagScheduler.taskScheduler, > sc.listenerBus, > sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], > sc.env.blockManager.master, > sc.env) { > override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = > { > // By this, we can mimic a stage with all tasks finished > // but finalization is incomplete. > latch.countDown() > } > } > sc.dagScheduler = myDAGScheduler > sc.taskScheduler.setDAGScheduler(myDAGScheduler) > val parts =
[jira] [Assigned] (SPARK-36567) Support foldable special datetime values in CAST
[ https://issues.apache.org/jira/browse/SPARK-36567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36567: Assignee: Apache Spark (was: Max Gekk) > Support foldable special datetime values in CAST > > > Key: SPARK-36567 > URL: https://issues.apache.org/jira/browse/SPARK-36567 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > The PR https://github.com/apache/spark/pull/32714 disallowed special datetime > values in the CAST expression, and allowed them only in typed literals. So, > the following code doesn't work anymore: > {code:sql} > spark-sql> select date('today'); > NULL > spark-sql> select cast('today' as date); > NULL > {code} > Though, users can replace them by: > {code:sql} > spark-sql> select date'today'; > 2021-08-23 > {code} > but the changes can break users apps. The purpose of this ticket is to > support special datetime strings when the strings in casts are foldable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36567) Support foldable special datetime values in CAST
[ https://issues.apache.org/jira/browse/SPARK-36567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36567: Assignee: Max Gekk (was: Apache Spark) > Support foldable special datetime values in CAST > > > Key: SPARK-36567 > URL: https://issues.apache.org/jira/browse/SPARK-36567 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/32714 disallowed special datetime > values in the CAST expression, and allowed them only in typed literals. So, > the following code doesn't work anymore: > {code:sql} > spark-sql> select date('today'); > NULL > spark-sql> select cast('today' as date); > NULL > {code} > Though, users can replace them by: > {code:sql} > spark-sql> select date'today'; > 2021-08-23 > {code} > but the changes can break users apps. The purpose of this ticket is to > support special datetime strings when the strings in casts are foldable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36567) Support foldable special datetime values in CAST
[ https://issues.apache.org/jira/browse/SPARK-36567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403365#comment-17403365 ] Apache Spark commented on SPARK-36567: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33816 > Support foldable special datetime values in CAST > > > Key: SPARK-36567 > URL: https://issues.apache.org/jira/browse/SPARK-36567 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/32714 disallowed special datetime > values in the CAST expression, and allowed them only in typed literals. So, > the following code doesn't work anymore: > {code:sql} > spark-sql> select date('today'); > NULL > spark-sql> select cast('today' as date); > NULL > {code} > Though, users can replace them by: > {code:sql} > spark-sql> select date'today'; > 2021-08-23 > {code} > but the changes can break users apps. The purpose of this ticket is to > support special datetime strings when the strings in casts are foldable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36567) Support foldable special datetime values in CAST
[ https://issues.apache.org/jira/browse/SPARK-36567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403366#comment-17403366 ] Apache Spark commented on SPARK-36567: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33816 > Support foldable special datetime values in CAST > > > Key: SPARK-36567 > URL: https://issues.apache.org/jira/browse/SPARK-36567 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/32714 disallowed special datetime > values in the CAST expression, and allowed them only in typed literals. So, > the following code doesn't work anymore: > {code:sql} > spark-sql> select date('today'); > NULL > spark-sql> select cast('today' as date); > NULL > {code} > Though, users can replace them by: > {code:sql} > spark-sql> select date'today'; > 2021-08-23 > {code} > but the changes can break users apps. The purpose of this ticket is to > support special datetime strings when the strings in casts are foldable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36567) Support foldable special datetime values in CAST
Max Gekk created SPARK-36567: Summary: Support foldable special datetime values in CAST Key: SPARK-36567 URL: https://issues.apache.org/jira/browse/SPARK-36567 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Assignee: Max Gekk The PR https://github.com/apache/spark/pull/32714 disallowed special datetime values in the CAST expression, and allowed them only in typed literals. So, the following code doesn't work anymore: {code:sql} spark-sql> select date('today'); NULL spark-sql> select cast('today' as date); NULL {code} Though, users can replace them by: {code:sql} spark-sql> select date'today'; 2021-08-23 {code} but the changes can break users apps. The purpose of this ticket is to support special datetime strings when the strings in casts are foldable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32333) Drop references to Master
[ https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403348#comment-17403348 ] Neil Shah-Quinn commented on SPARK-32333: - I like "leader" too. > Drop references to Master > - > > Key: SPARK-32333 > URL: https://issues.apache.org/jira/browse/SPARK-32333 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > We have a lot of references to "master" in the code base. It will be > beneficial to remove references to problematic language that can alienate > potential community members. > SPARK-32004 removed references to slave > > Here is a IETF draft to fix up some of the most egregious examples > (master/slave, whitelist/backlist) with proposed alternatives. > https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29223) Kafka source: offset by timestamp - allow specifying timestamp for "all partitions"
[ https://issues.apache.org/jira/browse/SPARK-29223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403341#comment-17403341 ] Andrew Grigorev commented on SPARK-29223: - Just an idea - couldn't this be implemented in "timestamp field filter pushdown"-like way? > Kafka source: offset by timestamp - allow specifying timestamp for "all > partitions" > --- > > Key: SPARK-29223 > URL: https://issues.apache.org/jira/browse/SPARK-29223 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.2.0 > > > This issue is a follow-up of SPARK-26848. > In SPARK-26848, we decided to open possibility to let end users set > individual timestamp per partition. But in many cases, specifying timestamp > represents the intention that we would want to go back to specific timestamp > and reprocess records, which should be applied to all topics and partitions. > According to the format of > `startingOffsetsByTimestamp`/`endingOffsetsByTimestamp`, while it's not > intuitive to provide an option to set a global timestamp across topic, it's > still intuitive to provide an option to set a global timestamp across > partitions in a topic. > This issue tracks the efforts to deal with this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36566) Add Spark appname as a label to the executor pods
Holden Karau created SPARK-36566: Summary: Add Spark appname as a label to the executor pods Key: SPARK-36566 URL: https://issues.apache.org/jira/browse/SPARK-36566 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.3.0 Reporter: Holden Karau Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.
[ https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403329#comment-17403329 ] Pablo Langa Blanco commented on SPARK-36488: Yes, I'm with you that it's not very intuitive and it have room for improvement. I think it's interesting to take a closer look to support more expressions and to be more close to hive feature. Thanks! > "Invalid usage of '*' in expression" error due to the feature of > 'quotedRegexColumnNames' in some scenarios. > > > Key: SPARK-36488 > URL: https://issues.apache.org/jira/browse/SPARK-36488 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.8, 3.1.2 >Reporter: merrily01 >Priority: Major > > In some cases, the error happens when the following property is set. > {code:java} > spark.sql("set spark.sql.parser.quotedRegexColumnNames=true") > {code} > *case 1:* > {code:java} > spark-sql> create table tb_test as select 1 as col_a, 2 as col_b; > spark-sql> select `tb_test`.`col_a` from tb_test; > 1 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `tb_test`.`col_a` from tb_test; > Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue' > {code} > > *case 2:* > {code:java} > > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > 0.955414 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > Error in query: Invalid usage of '*' in expression 'divide' > {code} > > This problem exists in 3.X, 2.4.X and master versions. > > Related issue : > https://issues.apache.org/jira/browse/SPARK-12139 > (As can be seen in the latest comments, some people have encountered the same > problem) > > Similar problems: > https://issues.apache.org/jira/browse/SPARK-28897 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36565) "Unscaled value too large for precision" while reading simple parquet file readable with parquet-tools and pandas
[ https://issues.apache.org/jira/browse/SPARK-36565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lorenzo Martini updated SPARK-36565: Attachment: broken_parquet_file.parquet > "Unscaled value too large for precision" while reading simple parquet file > readable with parquet-tools and pandas > - > > Key: SPARK-36565 > URL: https://issues.apache.org/jira/browse/SPARK-36565 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.1.2 >Reporter: Lorenzo Martini >Priority: Minor > Attachments: broken_parquet_file.parquet > > > I have a simple parquet file (attached to the ticket) with 2 columns > (array, decimal) that can be read and viewed correctly using pandas > or parquet-tools. Reading the parquet file in spark (and pyspark) seems to > work, but calling `.show()` throws the exception (1) with > {code:java} > Caused by: java.lang.ArithmeticException: Unscaled value too large for > precision at. > {code} > > > Another interesting detail is that reading the parquet file and doing a > select on individual columns allows for `show()` to work correctly without > throwing. > {code:java} > >>> repro = spark.read.parquet(".../broken_file.parquet") > >>> repro.printSchema() > root > |-- column_a: array (nullable = true) > ||-- element: string (containsNull = true) > |-- column_b: decimal(4,0) (nullable = true) > >>> repro.select("column_a").show() > ++ > |column_a| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > >>> repro.select("column_b").show() > ++ > |column_b| > ++ > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > | 11590| > ++ > >>> repro.show() // THIS ONE THROWS EXCEPTION (1) > {code} > Using `parquet-tools` shows the dataset correctly > {code:java} > >>> parquet-tools show broken_file.parquet > +++ > | column_a | column_b | > |+| > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > || 11590 | > +++ > {code} > > > And the same with `pandas` > {code:java} > >>> import pandas as pd > >>> pd.read_parquet(".../broken_file.parquet") > column_a column_b > 0 None11590 > 1 None11590 > 2 None11590 > 3 None11590 > 4 None11590 > 5 None11590 > 6 None11590 > 7 None11590 > 8 None11590 > 9 None11590 > {code} > > I have also verified this affects all versions of spark between 2.4.0 and > 3.1.2 > Here the Exception (1) thrown (sorry about the poor formatting, didn't seem > to manage to make it work): > > {code:java} > >>> spark.version '3.1.2' > >>> df = spark.read.parquet(".../broken_parquet_file.parquet") > >>> df > DataFrame[column_a: array, column_b: decimal(4,0)] > >>> df.show() 21/08/23 18:39:36 > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)/ 1] > org.apache.spark.sql.execution.QueryExecutionException: Encounter error while > reading parquet files. One possible cause: Parquet column cannot be converted > in the corresponding files. Details: at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:131) at >
[jira] [Created] (SPARK-36565) "Unscaled value too large for precision" while reading simple parquet file readable with parquet-tools and pandas
Lorenzo Martini created SPARK-36565: --- Summary: "Unscaled value too large for precision" while reading simple parquet file readable with parquet-tools and pandas Key: SPARK-36565 URL: https://issues.apache.org/jira/browse/SPARK-36565 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2, 2.4.0 Reporter: Lorenzo Martini Attachments: broken_parquet_file.parquet I have a simple parquet file (attached to the ticket) with 2 columns (array, decimal) that can be read and viewed correctly using pandas or parquet-tools. Reading the parquet file in spark (and pyspark) seems to work, but calling `.show()` throws the exception (1) with {code:java} Caused by: java.lang.ArithmeticException: Unscaled value too large for precision at. {code} Another interesting detail is that reading the parquet file and doing a select on individual columns allows for `show()` to work correctly without throwing. {code:java} >>> repro = spark.read.parquet(".../broken_file.parquet") >>> repro.printSchema() root |-- column_a: array (nullable = true) ||-- element: string (containsNull = true) |-- column_b: decimal(4,0) (nullable = true) >>> repro.select("column_a").show() ++ |column_a| ++ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| ++ >>> repro.select("column_b").show() ++ |column_b| ++ | 11590| | 11590| | 11590| | 11590| | 11590| | 11590| | 11590| | 11590| | 11590| | 11590| ++ >>> repro.show() // THIS ONE THROWS EXCEPTION (1) {code} Using `parquet-tools` shows the dataset correctly {code:java} >>> parquet-tools show broken_file.parquet +++ | column_a | column_b | |+| || 11590 | || 11590 | || 11590 | || 11590 | || 11590 | || 11590 | || 11590 | || 11590 | || 11590 | || 11590 | +++ {code} And the same with `pandas` {code:java} >>> import pandas as pd >>> pd.read_parquet(".../broken_file.parquet") column_a column_b 0 None11590 1 None11590 2 None11590 3 None11590 4 None11590 5 None11590 6 None11590 7 None11590 8 None11590 9 None11590 {code} I have also verified this affects all versions of spark between 2.4.0 and 3.1.2 Here the Exception (1) thrown (sorry about the poor formatting, didn't seem to manage to make it work): {code:java} >>> spark.version '3.1.2' >>> df = spark.read.parquet(".../broken_parquet_file.parquet") >>> df DataFrame[column_a: array, column_b: decimal(4,0)] >>> df.show() 21/08/23 18:39:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)/ 1] org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details: at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403305#comment-17403305 ] Eric Richardson commented on SPARK-25075: - Is it possible to have both the 2.12 and 2.13 versions pre-built for download? I think this could accelerate adoption to 2.13 and thus Scala 3. It is fine if the 2.12 is considered production and 2.13 is experimental. > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34952) DS V2 Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-34952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403292#comment-17403292 ] Apache Spark commented on SPARK-34952: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/33815 > DS V2 Aggregate push down > - > > Key: SPARK-34952 > URL: https://issues.apache.org/jira/browse/SPARK-34952 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.2.0 > > > Push down aggregate to data source for better performance. This will be done > in two steps: > 1. add aggregate push down APIs and the implementation in JDBC > 2. add the implementation in Parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34952) DS V2 Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-34952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403291#comment-17403291 ] Apache Spark commented on SPARK-34952: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/33815 > DS V2 Aggregate push down > - > > Key: SPARK-34952 > URL: https://issues.apache.org/jira/browse/SPARK-34952 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.2.0 > > > Push down aggregate to data source for better performance. This will be done > in two steps: > 1. add aggregate push down APIs and the implementation in JDBC > 2. add the implementation in Parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36396) Implement DataFrame.cov
[ https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36396: Assignee: (was: Apache Spark) > Implement DataFrame.cov > --- > > Key: SPARK-36396 > URL: https://issues.apache.org/jira/browse/SPARK-36396 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36396) Implement DataFrame.cov
[ https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36396: Assignee: Apache Spark > Implement DataFrame.cov > --- > > Key: SPARK-36396 > URL: https://issues.apache.org/jira/browse/SPARK-36396 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35991) Add PlanStability suite for TPCH
[ https://issues.apache.org/jira/browse/SPARK-35991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403241#comment-17403241 ] Apache Spark commented on SPARK-35991: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33813 > Add PlanStability suite for TPCH > > > Key: SPARK-35991 > URL: https://issues.apache.org/jira/browse/SPARK-35991 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35991) Add PlanStability suite for TPCH
[ https://issues.apache.org/jira/browse/SPARK-35991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403243#comment-17403243 ] Apache Spark commented on SPARK-35991: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33813 > Add PlanStability suite for TPCH > > > Key: SPARK-35991 > URL: https://issues.apache.org/jira/browse/SPARK-35991 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36396) Implement DataFrame.cov
[ https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403240#comment-17403240 ] Apache Spark commented on SPARK-36396: -- User 'dgd-contributor' has created a pull request for this issue: https://github.com/apache/spark/pull/33814 > Implement DataFrame.cov > --- > > Key: SPARK-36396 > URL: https://issues.apache.org/jira/browse/SPARK-36396 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36564: Assignee: Apache Spark > LiveRDDDistribution.toApi throws NullPointerException > - > > Key: SPARK-36564 > URL: https://issues.apache.org/jira/browse/SPARK-36564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > {code:java} > 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an > exception > java.lang.NullPointerException > at > com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192) > at > com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507) > at > com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85) > at > org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696) > at > org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563) > at > org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51) > at > org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206) > at > org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at > org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015) > at > org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36564: Assignee: (was: Apache Spark) > LiveRDDDistribution.toApi throws NullPointerException > - > > Key: SPARK-36564 > URL: https://issues.apache.org/jira/browse/SPARK-36564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Major > > {code:java} > 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an > exception > java.lang.NullPointerException > at > com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192) > at > com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507) > at > com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85) > at > org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696) > at > org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563) > at > org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51) > at > org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206) > at > org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at > org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015) > at > org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403187#comment-17403187 ] Apache Spark commented on SPARK-36564: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/33812 > LiveRDDDistribution.toApi throws NullPointerException > - > > Key: SPARK-36564 > URL: https://issues.apache.org/jira/browse/SPARK-36564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Major > > {code:java} > 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an > exception > java.lang.NullPointerException > at > com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192) > at > com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507) > at > com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85) > at > org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696) > at > org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563) > at > org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51) > at > org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206) > at > org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at > org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015) > at > org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403188#comment-17403188 ] Apache Spark commented on SPARK-36564: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/33812 > LiveRDDDistribution.toApi throws NullPointerException > - > > Key: SPARK-36564 > URL: https://issues.apache.org/jira/browse/SPARK-36564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Major > > {code:java} > 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an > exception > java.lang.NullPointerException > at > com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192) > at > com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507) > at > com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85) > at > org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696) > at > org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563) > at > org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51) > at > org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206) > at > org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at > org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015) > at > org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-36564: - Summary: LiveRDDDistribution.toApi throws NullPointerException (was: LiveRDD.doUpdate throws NullPointerException) > LiveRDDDistribution.toApi throws NullPointerException > - > > Key: SPARK-36564 > URL: https://issues.apache.org/jira/browse/SPARK-36564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Major > > {code:java} > 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an > exception > java.lang.NullPointerException > at > com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192) > at > com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507) > at > com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85) > at > org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696) > at > org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563) > at > org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51) > at > org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206) > at > org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956) > at > org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956) > at > scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) > at > org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015) > at > org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36564) LiveRDD.doUpdate throws NullPointerException
wuyi created SPARK-36564: Summary: LiveRDD.doUpdate throws NullPointerException Key: SPARK-36564 URL: https://issues.apache.org/jira/browse/SPARK-36564 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2, 3.0.3, 3.2.0, 3.3.0 Reporter: wuyi {code:java} 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an exception java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192) at com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507) at com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85) at org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696) at org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563) at org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629) at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51) at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206) at org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212) at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956) at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956) at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) at org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015) at org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32838) Connot insert overwite different partition with same table
[ https://issues.apache.org/jira/browse/SPARK-32838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-32838: -- Parent: SPARK-36562 Issue Type: Sub-task (was: Bug) > Connot insert overwite different partition with same table > -- > > Key: SPARK-32838 > URL: https://issues.apache.org/jira/browse/SPARK-32838 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 > Environment: hadoop 2.7 + spark 3.0.0 >Reporter: CHC >Priority: Major > > When: > {code:java} > CREATE TABLE tmp.spark3_snap ( > id string > ) > PARTITIONED BY (dt string) > STORED AS ORC > ; > insert overwrite table tmp.spark3_snap partition(dt='2020-09-09') > select 10; > insert overwrite table tmp.spark3_snap partition(dt='2020-09-10') > select 1; > insert overwrite table tmp.spark3_snap partition(dt='2020-09-10') > select id from tmp.spark3_snap where dt='2020-09-09'; > {code} > and it will be get a error: "Cannot overwrite a path that is also being read > from" > related: https://issues.apache.org/jira/browse/SPARK-24194 > This work on spark 2.4.3 and do not work on spark 3.0.0 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36563) dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one
[ https://issues.apache.org/jira/browse/SPARK-36563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403169#comment-17403169 ] Apache Spark commented on SPARK-36563: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33811 > dynamicPartitionOverwrite can direct rename to targetPath instead of > partition path one by one > -- > > Key: SPARK-36563 > URL: https://issues.apache.org/jira/browse/SPARK-36563 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > dynamicPartitionOverwrite can direct rename to targetPath instead of > partition path one by one -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36563) dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one
[ https://issues.apache.org/jira/browse/SPARK-36563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36563: Assignee: (was: Apache Spark) > dynamicPartitionOverwrite can direct rename to targetPath instead of > partition path one by one > -- > > Key: SPARK-36563 > URL: https://issues.apache.org/jira/browse/SPARK-36563 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > dynamicPartitionOverwrite can direct rename to targetPath instead of > partition path one by one -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36563) dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one
[ https://issues.apache.org/jira/browse/SPARK-36563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36563: Assignee: Apache Spark > dynamicPartitionOverwrite can direct rename to targetPath instead of > partition path one by one > -- > > Key: SPARK-36563 > URL: https://issues.apache.org/jira/browse/SPARK-36563 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > dynamicPartitionOverwrite can direct rename to targetPath instead of > partition path one by one -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36563) dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one
angerszhu created SPARK-36563: - Summary: dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one Key: SPARK-36563 URL: https://issues.apache.org/jira/browse/SPARK-36563 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.2, 3.2.0 Reporter: angerszhu Fix For: 3.3.0 dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36562) Improve InsertIntoHadoopFsRelation file commit logic
angerszhu created SPARK-36562: - Summary: Improve InsertIntoHadoopFsRelation file commit logic Key: SPARK-36562 URL: https://issues.apache.org/jira/browse/SPARK-36562 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 3.2.0 Reporter: angerszhu Fix For: 3.3.0 Improve InsertIntoHadoopFsRelation file commit logic -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403154#comment-17403154 ] Yang Jie commented on SPARK-25075: -- [~dongjoon] Thank you for your reply > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35876) array_zip unexpected column names
[ https://issues.apache.org/jira/browse/SPARK-35876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403085#comment-17403085 ] Apache Spark commented on SPARK-35876: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33810 > array_zip unexpected column names > - > > Key: SPARK-35876 > URL: https://issues.apache.org/jira/browse/SPARK-35876 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Derk Crezee >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > {{When I'm using the array_zip function in combination with renamed columns, > I get an unexpected schema written to disk.}} > {code:java} > // code placeholder > from pyspark.sql import * > from pyspark.sql.functions import * > spark = SparkSession.builder.getOrCreate() > data = [ > Row(a1=["a", "a"], b1=["b", "b"]), > ] > df = ( > spark.sparkContext.parallelize(data).toDF() > .withColumnRenamed("a1", "a2") > .withColumnRenamed("b1", "b2") > .withColumn("zipped", arrays_zip(col("a2"), col("b2"))) > ) > df.printSchema() > // root > // |-- a2: array (nullable = true) > // ||-- element: string (containsNull = true) > // |-- b2: array (nullable = true) > // ||-- element: string (containsNull = true) > // |-- zipped: array (nullable = true) > // ||-- element: struct (containsNull = false) > // |||-- a2: string (nullable = true) > // |||-- b2: string (nullable = true) > df.write.save("test.parquet") > spark.read.load("test.parquet").printSchema() > // root > // |-- a2: array (nullable = true) > // ||-- element: string (containsNull = true) > // |-- b2: array (nullable = true) > // ||-- element: string (containsNull = true) > // |-- zipped: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a1: string (nullable = true) > // |||-- b1: string (nullable = true){code} > I would expect the schema of the DataFrame written to disk to be the same as > that printed out. It seems that instead of using the renamed version of the > column names, it uses the old column names. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35876) array_zip unexpected column names
[ https://issues.apache.org/jira/browse/SPARK-35876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403084#comment-17403084 ] Apache Spark commented on SPARK-35876: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33810 > array_zip unexpected column names > - > > Key: SPARK-35876 > URL: https://issues.apache.org/jira/browse/SPARK-35876 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Derk Crezee >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > {{When I'm using the array_zip function in combination with renamed columns, > I get an unexpected schema written to disk.}} > {code:java} > // code placeholder > from pyspark.sql import * > from pyspark.sql.functions import * > spark = SparkSession.builder.getOrCreate() > data = [ > Row(a1=["a", "a"], b1=["b", "b"]), > ] > df = ( > spark.sparkContext.parallelize(data).toDF() > .withColumnRenamed("a1", "a2") > .withColumnRenamed("b1", "b2") > .withColumn("zipped", arrays_zip(col("a2"), col("b2"))) > ) > df.printSchema() > // root > // |-- a2: array (nullable = true) > // ||-- element: string (containsNull = true) > // |-- b2: array (nullable = true) > // ||-- element: string (containsNull = true) > // |-- zipped: array (nullable = true) > // ||-- element: struct (containsNull = false) > // |||-- a2: string (nullable = true) > // |||-- b2: string (nullable = true) > df.write.save("test.parquet") > spark.read.load("test.parquet").printSchema() > // root > // |-- a2: array (nullable = true) > // ||-- element: string (containsNull = true) > // |-- b2: array (nullable = true) > // ||-- element: string (containsNull = true) > // |-- zipped: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a1: string (nullable = true) > // |||-- b1: string (nullable = true){code} > I would expect the schema of the DataFrame written to disk to be the same as > that printed out. It seems that instead of using the renamed version of the > column names, it uses the old column names. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.
[ https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403050#comment-17403050 ] merrily01 edited comment on SPARK-36488 at 8/23/21, 9:02 AM: - Hi [~planga82] , Thans for your attention and sorry for late reply. It's OK to improve the error message, but I don't quite agree with you that it is not a bug. Firstly, for a very common SQL, the query results of opening and closing parameters are different, which is not easy to accept. At least it can prove that this feature is not perfect.(I always feel that the regular expression of this feature should not deal with part 'database'. ) Secondly, at the begining, this feature was compatible with and reference to hive feature. But now, the above cases can be executed normally when turn on the feature in Hive, Spak has a problem. was (Author: merrily01): Hi [~planga82] , Thans for your attention and sorry for late reply. It's OK to improve the error message, but I don't quite agree with you that it is not a bug. Firstly, for a very common SQL, the query results of opening and closing parameters are different, which is not easy to accept. At least it can prove that this function is not perfect.(I always feel that the regular expression of this function should not deal with part 'database'. ) Secondly, at the begining, this feature was compatible with and reference to hive feature. But now, the above cases can be executed normally when turn on the feature in Hive, Spak has a problem. > "Invalid usage of '*' in expression" error due to the feature of > 'quotedRegexColumnNames' in some scenarios. > > > Key: SPARK-36488 > URL: https://issues.apache.org/jira/browse/SPARK-36488 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.8, 3.1.2 >Reporter: merrily01 >Priority: Major > > In some cases, the error happens when the following property is set. > {code:java} > spark.sql("set spark.sql.parser.quotedRegexColumnNames=true") > {code} > *case 1:* > {code:java} > spark-sql> create table tb_test as select 1 as col_a, 2 as col_b; > spark-sql> select `tb_test`.`col_a` from tb_test; > 1 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `tb_test`.`col_a` from tb_test; > Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue' > {code} > > *case 2:* > {code:java} > > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > 0.955414 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > Error in query: Invalid usage of '*' in expression 'divide' > {code} > > This problem exists in 3.X, 2.4.X and master versions. > > Related issue : > https://issues.apache.org/jira/browse/SPARK-12139 > (As can be seen in the latest comments, some people have encountered the same > problem) > > Similar problems: > https://issues.apache.org/jira/browse/SPARK-28897 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write
[ https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403051#comment-17403051 ] Apache Spark commented on SPARK-36536: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33809 > Split the JSON/CSV option of datetime format to in read and in write > > > Key: SPARK-36536 > URL: https://issues.apache.org/jira/browse/SPARK-36536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. > Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In > write, should be the same but in read the option shouldn't be set to a > default value. In this way, DateFormatter and TimestampFormatter will use the > CAST logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.
[ https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403050#comment-17403050 ] merrily01 commented on SPARK-36488: --- Hi [~planga82] , Thans for your attention and sorry for late reply. It's OK to improve the error message, but I don't quite agree with you that it is not a bug. Firstly, for a very common SQL, the query results of opening and closing parameters are different, which is not easy to accept. At least it can prove that this function is not perfect.(I always feel that the regular expression of this function should not deal with part 'database'. ) Secondly, at the begining, this feature was compatible with and reference to hive feature. But now, the above cases can be executed normally when turn on the feature in Hive, Spak has a problem. > "Invalid usage of '*' in expression" error due to the feature of > 'quotedRegexColumnNames' in some scenarios. > > > Key: SPARK-36488 > URL: https://issues.apache.org/jira/browse/SPARK-36488 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.8, 3.1.2 >Reporter: merrily01 >Priority: Major > > In some cases, the error happens when the following property is set. > {code:java} > spark.sql("set spark.sql.parser.quotedRegexColumnNames=true") > {code} > *case 1:* > {code:java} > spark-sql> create table tb_test as select 1 as col_a, 2 as col_b; > spark-sql> select `tb_test`.`col_a` from tb_test; > 1 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `tb_test`.`col_a` from tb_test; > Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue' > {code} > > *case 2:* > {code:java} > > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > 0.955414 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > Error in query: Invalid usage of '*' in expression 'divide' > {code} > > This problem exists in 3.X, 2.4.X and master versions. > > Related issue : > https://issues.apache.org/jira/browse/SPARK-12139 > (As can be seen in the latest comments, some people have encountered the same > problem) > > Similar problems: > https://issues.apache.org/jira/browse/SPARK-28897 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern
[ https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403048#comment-17403048 ] Apache Spark commented on SPARK-36418: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33809 > Use CAST in parsing of dates/timestamps with default pattern > > > Key: SPARK-36418 > URL: https://issues.apache.org/jira/browse/SPARK-36418 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > In functions, CSV/JSON datasources and other places, when the pattern is > default, use CAST logic in parsing strings to dates/timestamps. > Currently, TimestampFormatter.getFormatter() applies the default pattern > *-MM-dd HH:mm:ss* when the pattern is not set, see > https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344 > . Instead of that, need to create a special formatter which invokes the cast > logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write
[ https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403049#comment-17403049 ] Apache Spark commented on SPARK-36536: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33809 > Split the JSON/CSV option of datetime format to in read and in write > > > Key: SPARK-36536 > URL: https://issues.apache.org/jira/browse/SPARK-36536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. > Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In > write, should be the same but in read the option shouldn't be set to a > default value. In this way, DateFormatter and TimestampFormatter will use the > CAST logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31936) Implement ScriptTransform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403034#comment-17403034 ] Gengliang Wang commented on SPARK-31936: Thank you [~angerszhuuu] > Implement ScriptTransform in sql/core > - > > Key: SPARK-31936 > URL: https://issues.apache.org/jira/browse/SPARK-31936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31936) Implement ScriptTransform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31936: -- Fix Version/s: 3.2.0 > Implement ScriptTransform in sql/core > - > > Key: SPARK-31936 > URL: https://issues.apache.org/jira/browse/SPARK-31936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31936) Implement ScriptTransform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403030#comment-17403030 ] angerszhu commented on SPARK-31936: --- [~Gengliang.Wang] The main work has been done, we can close this. > Implement ScriptTransform in sql/core > - > > Key: SPARK-31936 > URL: https://issues.apache.org/jira/browse/SPARK-31936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31936) Implement ScriptTransform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-31936. --- Resolution: Implemented > Implement ScriptTransform in sql/core > - > > Key: SPARK-31936 > URL: https://issues.apache.org/jira/browse/SPARK-31936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31936) Implement ScriptTransform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403025#comment-17403025 ] Gengliang Wang commented on SPARK-31936: [~angerszhuuu] Is this finished? > Implement ScriptTransform in sql/core > - > > Key: SPARK-31936 > URL: https://issues.apache.org/jira/browse/SPARK-31936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36561) Remove `ColumnVector.numNulls`
Dmitry Sysolyatin created SPARK-36561: - Summary: Remove `ColumnVector.numNulls` Key: SPARK-36561 URL: https://issues.apache.org/jira/browse/SPARK-36561 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Dmitry Sysolyatin Hi! When I was implementing ColumnVector abstract class I started to check where `numNulls` is used in spark source code. I didn't find any places where it is used except tests Is there any plans to use it in the future. If no then I suppose to remove `numNulls` method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36560) Deflake PySpark coverage report
[ https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402987#comment-17402987 ] Apache Spark commented on SPARK-36560: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/33808 > Deflake PySpark coverage report > --- > > Key: SPARK-36560 > URL: https://issues.apache.org/jira/browse/SPARK-36560 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > https://github.com/apache/spark/runs/3388727798?check_suite_focus=true > https://github.com/apache/spark/runs/3392972609?check_suite_focus=true > https://github.com/apache/spark/runs/3359880048?check_suite_focus=true > https://github.com/apache/spark/runs/3338876122?check_suite_focus=true > PySpark scheduled coverage jobs are flaky. We should deflake them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36560) Deflake PySpark coverage report
[ https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36560: Assignee: Apache Spark > Deflake PySpark coverage report > --- > > Key: SPARK-36560 > URL: https://issues.apache.org/jira/browse/SPARK-36560 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > https://github.com/apache/spark/runs/3388727798?check_suite_focus=true > https://github.com/apache/spark/runs/3392972609?check_suite_focus=true > https://github.com/apache/spark/runs/3359880048?check_suite_focus=true > https://github.com/apache/spark/runs/3338876122?check_suite_focus=true > PySpark scheduled coverage jobs are flaky. We should deflake them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36560) Deflake PySpark coverage report
[ https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402986#comment-17402986 ] Apache Spark commented on SPARK-36560: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/33808 > Deflake PySpark coverage report > --- > > Key: SPARK-36560 > URL: https://issues.apache.org/jira/browse/SPARK-36560 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > https://github.com/apache/spark/runs/3388727798?check_suite_focus=true > https://github.com/apache/spark/runs/3392972609?check_suite_focus=true > https://github.com/apache/spark/runs/3359880048?check_suite_focus=true > https://github.com/apache/spark/runs/3338876122?check_suite_focus=true > PySpark scheduled coverage jobs are flaky. We should deflake them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36560) Deflake PySpark coverage report
[ https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36560: Assignee: (was: Apache Spark) > Deflake PySpark coverage report > --- > > Key: SPARK-36560 > URL: https://issues.apache.org/jira/browse/SPARK-36560 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > https://github.com/apache/spark/runs/3388727798?check_suite_focus=true > https://github.com/apache/spark/runs/3392972609?check_suite_focus=true > https://github.com/apache/spark/runs/3359880048?check_suite_focus=true > https://github.com/apache/spark/runs/3338876122?check_suite_focus=true > PySpark scheduled coverage jobs are flaky. We should deflake them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36560) Deflake PySpark coverage report
[ https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36560: - Priority: Minor (was: Major) > Deflake PySpark coverage report > --- > > Key: SPARK-36560 > URL: https://issues.apache.org/jira/browse/SPARK-36560 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > https://github.com/apache/spark/runs/3388727798?check_suite_focus=true > https://github.com/apache/spark/runs/3392972609?check_suite_focus=true > https://github.com/apache/spark/runs/3359880048?check_suite_focus=true > https://github.com/apache/spark/runs/3338876122?check_suite_focus=true > PySpark scheduled coverage jobs are flaky. We should deflake them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36560) Deflake PySpark coverage report
[ https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36560: - Description: https://github.com/apache/spark/runs/3388727798?check_suite_focus=true https://github.com/apache/spark/runs/3392972609?check_suite_focus=true https://github.com/apache/spark/runs/3359880048?check_suite_focus=true https://github.com/apache/spark/runs/3338876122?check_suite_focus=true PySpark scheduled coverage jobs are flaky. We should deflake them was: https://github.com/apache/spark/runs/3388727798?check_suite_focus=true https://github.com/apache/spark/runs/3392972609?check_suite_focus=true https://github.com/apache/spark/runs/3359880048?check_suite_focus=true https://github.com/apache/spark/runs/3338876122?check_suite_focus=true > Deflake PySpark coverage report > --- > > Key: SPARK-36560 > URL: https://issues.apache.org/jira/browse/SPARK-36560 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/3388727798?check_suite_focus=true > https://github.com/apache/spark/runs/3392972609?check_suite_focus=true > https://github.com/apache/spark/runs/3359880048?check_suite_focus=true > https://github.com/apache/spark/runs/3338876122?check_suite_focus=true > PySpark scheduled coverage jobs are flaky. We should deflake them -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36560) Deflake PySpark coverage report
Hyukjin Kwon created SPARK-36560: Summary: Deflake PySpark coverage report Key: SPARK-36560 URL: https://issues.apache.org/jira/browse/SPARK-36560 Project: Spark Issue Type: Improvement Components: Project Infra, PySpark Affects Versions: 3.2.0 Reporter: Hyukjin Kwon https://github.com/apache/spark/runs/3388727798?check_suite_focus=true https://github.com/apache/spark/runs/3392972609?check_suite_focus=true https://github.com/apache/spark/runs/3359880048?check_suite_focus=true https://github.com/apache/spark/runs/3338876122?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36559) Allow column pruning on distributed sequence index (pandas API on Spark)
[ https://issues.apache.org/jira/browse/SPARK-36559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36559: Assignee: (was: Apache Spark) > Allow column pruning on distributed sequence index (pandas API on Spark) > > > Key: SPARK-36559 > URL: https://issues.apache.org/jira/browse/SPARK-36559 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-36338 implemented distributed > sequence implementation on JVM side. However, it disables leveraging Spark > SQL engine because it creates an RDD directly, and truncate the SQL plans. > We should move the logic into a proper SQL plan to leverage other > optimizations such as column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36559) Allow column pruning on distributed sequence index (pandas API on Spark)
[ https://issues.apache.org/jira/browse/SPARK-36559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36559: Assignee: Apache Spark > Allow column pruning on distributed sequence index (pandas API on Spark) > > > Key: SPARK-36559 > URL: https://issues.apache.org/jira/browse/SPARK-36559 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-36338 implemented distributed > sequence implementation on JVM side. However, it disables leveraging Spark > SQL engine because it creates an RDD directly, and truncate the SQL plans. > We should move the logic into a proper SQL plan to leverage other > optimizations such as column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36559) Allow column pruning on distributed sequence index (pandas API on Spark)
[ https://issues.apache.org/jira/browse/SPARK-36559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402972#comment-17402972 ] Apache Spark commented on SPARK-36559: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/33807 > Allow column pruning on distributed sequence index (pandas API on Spark) > > > Key: SPARK-36559 > URL: https://issues.apache.org/jira/browse/SPARK-36559 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-36338 implemented distributed > sequence implementation on JVM side. However, it disables leveraging Spark > SQL engine because it creates an RDD directly, and truncate the SQL plans. > We should move the logic into a proper SQL plan to leverage other > optimizations such as column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36559) Allow column pruning on distributed sequence index (pandas API on Spark)
Hyukjin Kwon created SPARK-36559: Summary: Allow column pruning on distributed sequence index (pandas API on Spark) Key: SPARK-36559 URL: https://issues.apache.org/jira/browse/SPARK-36559 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.2.0 Reporter: Hyukjin Kwon https://issues.apache.org/jira/browse/SPARK-36338 implemented distributed sequence implementation on JVM side. However, it disables leveraging Spark SQL engine because it creates an RDD directly, and truncate the SQL plans. We should move the logic into a proper SQL plan to leverage other optimizations such as column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36457) Review and fix issues in API docs
[ https://issues.apache.org/jira/browse/SPARK-36457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-36457: Priority: Blocker (was: Critical) > Review and fix issues in API docs > - > > Key: SPARK-36457 > URL: https://issues.apache.org/jira/browse/SPARK-36457 > Project: Spark > Issue Type: Improvement > Components: docs >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Blocker > > Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the > following issues: > * Add missing `Since` annotation for new APIs > * Remove the leaking class/object in API doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402964#comment-17402964 ] Dongjoon Hyun commented on SPARK-25075: --- [~LuciferYang]. Not yet although I believe Apache Spark community can switch the default in Year 2022. We will see the adoption of Scala 2.13 after Apache Spark 3.2.0 release and stabilize our release first at 3.2.1 and 3.2.2. We can start to build a consensus early next year and target it for Apache Spark 3.3.0. > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org