date:20210823

[jira] [Resolved] (SPARK-36561) Remove `ColumnVector.numNulls`

2021-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36561.
--
Resolution: Invalid

> Remove `ColumnVector.numNulls`
> --
>
> Key: SPARK-36561
> URL: https://issues.apache.org/jira/browse/SPARK-36561
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dmitry Sysolyatin
>Priority: Major
>
> Hi!
> When I was implementing ColumnVector abstract class I started to check where 
> `numNulls` is used in spark source code. I didn't find any places where it is 
> used except tests
> Is there any plans to use it in the future. If no then I suppose to remove 
> `numNulls` method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36561) Remove `ColumnVector.numNulls`

2021-08-23 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403528#comment-17403528
 ] 

Hyukjin Kwon commented on SPARK-36561:
--

{{ColumnVector}} is an API. Let's keep them.

> Remove `ColumnVector.numNulls`
> --
>
> Key: SPARK-36561
> URL: https://issues.apache.org/jira/browse/SPARK-36561
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dmitry Sysolyatin
>Priority: Major
>
> Hi!
> When I was implementing ColumnVector abstract class I started to check where 
> `numNulls` is used in spark source code. I didn't find any places where it is 
> used except tests
> Is there any plans to use it in the future. If no then I suppose to remove 
> `numNulls` method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36562) Improve InsertIntoHadoopFsRelation file commit logic

2021-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36562:
-
Affects Version/s: 3.3.0

> Improve InsertIntoHadoopFsRelation file commit logic
> 
>
> Key: SPARK-36562
> URL: https://issues.apache.org/jira/browse/SPARK-36562
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> Improve InsertIntoHadoopFsRelation file commit logic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36562) Improve InsertIntoHadoopFsRelation file commit logic

2021-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36562:
-
Fix Version/s: (was: 3.3.0)

> Improve InsertIntoHadoopFsRelation file commit logic
> 
>
> Key: SPARK-36562
> URL: https://issues.apache.org/jira/browse/SPARK-36562
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Improve InsertIntoHadoopFsRelation file commit logic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36537:


Assignee: Apache Spark

> Take care of other tests disabled for CategoricalDtype.
> ---
>
> Key: SPARK-36537
> URL: https://issues.apache.org/jira/browse/SPARK-36537
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> There are some more tests disabled for CategoricalDtype.
>  They seem like pandas' bugs or not maintained anymore because inplace 
> updates with CategoricalDtype are deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36537:


Assignee: (was: Apache Spark)

> Take care of other tests disabled for CategoricalDtype.
> ---
>
> Key: SPARK-36537
> URL: https://issues.apache.org/jira/browse/SPARK-36537
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more tests disabled for CategoricalDtype.
>  They seem like pandas' bugs or not maintained anymore because inplace 
> updates with CategoricalDtype are deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403527#comment-17403527
 ] 

Apache Spark commented on SPARK-36537:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33817

> Take care of other tests disabled for CategoricalDtype.
> ---
>
> Key: SPARK-36537
> URL: https://issues.apache.org/jira/browse/SPARK-36537
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more tests disabled for CategoricalDtype.
>  They seem like pandas' bugs or not maintained anymore because inplace 
> updates with CategoricalDtype are deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.

2021-08-23 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-36537:

Description: 
There are some more tests disabled for CategoricalDtype.
 They seem like pandas' bugs or not maintained anymore because inplace updates 
with CategoricalDtype are deprecated.

  was:
There are some more tests disabled related to inplace updates with 
CategoricalDtype.
They seem like pandas' bugs or not maintained anymore because inplace updates 
with CategoricalDtype are deprecated.


> Take care of other tests disabled for CategoricalDtype.
> ---
>
> Key: SPARK-36537
> URL: https://issues.apache.org/jira/browse/SPARK-36537
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more tests disabled for CategoricalDtype.
>  They seem like pandas' bugs or not maintained anymore because inplace 
> updates with CategoricalDtype are deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36537) Take care of other tests disabled for CategoricalDtype.

2021-08-23 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-36537:

Summary: Take care of other tests disabled for CategoricalDtype.  (was: 
Take care of other tests disabled related to inplace updates with 
CategoricalDtype.)

> Take care of other tests disabled for CategoricalDtype.
> ---
>
> Key: SPARK-36537
> URL: https://issues.apache.org/jira/browse/SPARK-36537
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more tests disabled related to inplace updates with 
> CategoricalDtype.
> They seem like pandas' bugs or not maintained anymore because inplace updates 
> with CategoricalDtype are deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang

2021-08-23 Thread Venkata krishnan Sowrirajan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403508#comment-17403508
 ] 

Venkata krishnan Sowrirajan commented on SPARK-36558:
-

One thing I missed here is not setting the newly constructed `DAGScheduler` 
into the SparkContext. Let me try that and update here.

> Stage has all tasks finished but with ongoing finalization can cause job hang
> -
>
> Key: SPARK-36558
> URL: https://issues.apache.org/jira/browse/SPARK-36558
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Blocker
>
>  
> For a stage that all tasks are finished but with ongoing finalization can 
> lead to job hang. The problem is that such stage is considered as a "missing" 
> stage (see 
> [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
>  And it breaks the original assumption that a "missing" stage must have tasks 
> to run. 
> Normally, if stage A is the parent of (result) stage B and all tasks have 
> finished in stage A, stage A will be skipped directly when submitting stage 
> B. However, with this bug, stage A will be submitted. And submitting a stage 
> with no tasks to run would not be able to add its child stage into the 
> waiting stage list, which leads to the job hang in the end.
>  
> The example to reproduce:
> First, change `MyRDD` to allow it to compute:
> {code:java}
> override def compute(split: Partition, context: TaskContext): Iterator[(Int, 
> Int)] = {
>    Iterator.single((1, 1))
>  }{code}
>  Then run this test:
> {code:java}
> test("Job hang") {
>   initPushBasedShuffleConfs(conf)
>   conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
>   DAGSchedulerSuite.clearMergerLocs
>   DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
> "host5"))
>   val latch = new CountDownLatch(1)
>   val myDAGScheduler = new MyDAGScheduler(
> sc,
> sc.dagScheduler.taskScheduler,
> sc.listenerBus,
> sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
> sc.env.blockManager.master,
> sc.env) {
> override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = 
> {
>   // By this, we can mimic a stage with all tasks finished
>   // but finalization is incomplete.
>   latch.countDown()
> }
>   }
>   sc.dagScheduler = myDAGScheduler
>   sc.taskScheduler.setDAGScheduler(myDAGScheduler)
>   val parts = 20
>   val shuffleMapRdd = new MyRDD(sc, parts, Nil)
>   val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
> HashPartitioner(parts))
>   val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
> mapOutputTracker)
>   reduceRdd1.countAsync()
>   latch.await()
>   // scalastyle:off
>   println("=after wait==")
>   // set _shuffleMergedFinalized to true can avoid the hang.
>   // shuffleDep._shuffleMergedFinalized = true
>   val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
>   reduceRdd2.count()
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang

2021-08-23 Thread Venkata krishnan Sowrirajan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403507#comment-17403507
 ] 

Venkata krishnan Sowrirajan commented on SPARK-36558:
-

[~Ngone51] I am not sure if you can run the tests that way, because then the 
`DAGSchedulerSuite` custom `DAGScheduler` or `DAGSchedulerEventLoopTester` 
won't be used therefore it submits the job with the default `DAGScheduler` and 
waits in the `finalizeShuffleMerge` which cannot complete because of our mocked 
up merger locations. Btw below is the test code along with the above changes 
you mentioned added to `MyRDD`. Am I missing something?
{code:java}
 test("Job hang") {
initPushBasedShuffleConfs(conf)
conf.set("spark.shuffle.push.mergerLocations.minThreshold", "5")
DAGSchedulerSuite.clearMergerLocs
DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
val latch = new CountDownLatch(1)
scheduler = new MyDAGScheduler(
  sc,
  taskScheduler,
  sc.listenerBus,
  mapOutputTracker,
  blockManagerMaster,
  sc.env) {
  override private[spark] def scheduleShuffleMergeFinalize(
  stage: ShuffleMapStage): Unit = {
// By this, we can mimic a stage with all tasks finished
// but finalization is incomplete.
latch.countDown()
super.scheduleShuffleMergeFinalize(stage)
  }
}

dagEventProcessLoopTester = new 
DAGSchedulerEventProcessLoopTester(scheduler)
val parts = 5
val shuffleMapRdd = new MyRDD(sc, parts, Nil)
val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
reduceRdd1.count()
latch.await()
// scalastyle:off
println("=after wait==")
// set _shuffleMergedFinalized to true can avoid the hang.
val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
reduceRdd2.count()
  }
{code}

> Stage has all tasks finished but with ongoing finalization can cause job hang
> -
>
> Key: SPARK-36558
> URL: https://issues.apache.org/jira/browse/SPARK-36558
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Blocker
>
>  
> For a stage that all tasks are finished but with ongoing finalization can 
> lead to job hang. The problem is that such stage is considered as a "missing" 
> stage (see 
> [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
>  And it breaks the original assumption that a "missing" stage must have tasks 
> to run. 
> Normally, if stage A is the parent of (result) stage B and all tasks have 
> finished in stage A, stage A will be skipped directly when submitting stage 
> B. However, with this bug, stage A will be submitted. And submitting a stage 
> with no tasks to run would not be able to add its child stage into the 
> waiting stage list, which leads to the job hang in the end.
>  
> The example to reproduce:
> First, change `MyRDD` to allow it to compute:
> {code:java}
> override def compute(split: Partition, context: TaskContext): Iterator[(Int, 
> Int)] = {
>    Iterator.single((1, 1))
>  }{code}
>  Then run this test:
> {code:java}
> test("Job hang") {
>   initPushBasedShuffleConfs(conf)
>   conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
>   DAGSchedulerSuite.clearMergerLocs
>   DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
> "host5"))
>   val latch = new CountDownLatch(1)
>   val myDAGScheduler = new MyDAGScheduler(
> sc,
> sc.dagScheduler.taskScheduler,
> sc.listenerBus,
> sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
> sc.env.blockManager.master,
> sc.env) {
> override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = 
> {
>   // By this, we can mimic a stage with all tasks finished
>   // but finalization is incomplete.
>   latch.countDown()
> }
>   }
>   sc.dagScheduler = myDAGScheduler
>   sc.taskScheduler.setDAGScheduler(myDAGScheduler)
>   val parts = 20
>   val shuffleMapRdd = new MyRDD(sc, parts, Nil)
>   val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
> HashPartitioner(parts))
>   val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
> mapOutputTracker)
>   reduceRdd1.countAsync()
>   latch.await()
>   // scalastyle:off
>   println("=after wait==")
>   // set _shuffleMergedFinalized to true can avoid the hang.
>   // shuffleDep._shuffleMergedFinalized = true
>   val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
>

[jira] [Commented] (SPARK-36565) "Unscaled value too large for precision" while reading simple parquet file readable with parquet-tools and pandas

2021-08-23 Thread Fu Chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403475#comment-17403475
 ] 

Fu Chen commented on SPARK-36565:
-

How to generate this parquet file?

> "Unscaled value too large for precision" while reading simple parquet file 
> readable with parquet-tools and pandas
> -
>
> Key: SPARK-36565
> URL: https://issues.apache.org/jira/browse/SPARK-36565
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.1.2
>Reporter: Lorenzo Martini
>Priority: Minor
> Attachments: broken_parquet_file.parquet
>
>
> I have a simple parquet file (attached to the ticket) with 2 columns 
> (array, decimal) that can be read and viewed correctly using pandas 
> or parquet-tools. Reading the parquet file in spark (and pyspark) seems to 
> work, but calling `.show()` throws the exception (1) with
> {code:java}
> Caused by: java.lang.ArithmeticException: Unscaled value too large for 
> precision at.
> {code}
>  
>  
> Another interesting detail is that reading the parquet file and doing a 
> select on individual columns allows for `show()` to work correctly without 
> throwing.
> {code:java}
> >>> repro = spark.read.parquet(".../broken_file.parquet")
> >>> repro.printSchema()
> root
>  |-- column_a: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- column_b: decimal(4,0) (nullable = true)
> >>> repro.select("column_a").show()
> ++
> |column_a|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> >>> repro.select("column_b").show()
> ++
> |column_b|
> ++
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> ++
> >>> repro.show() // THIS ONE THROWS EXCEPTION (1)
> {code}
> Using `parquet-tools` shows the dataset correctly
> {code:java}
> >>> parquet-tools show broken_file.parquet
> +++
> | column_a   |   column_b |
> |+|
> || 11590  | 
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> +++
> {code}
>  
>  
> And the same with `pandas`
> {code:java}
> >>> import pandas as pd
> >>> pd.read_parquet(".../broken_file.parquet")
>  column_a column_b
> 0 None11590
> 1 None11590
> 2 None11590
> 3 None11590
> 4 None11590
> 5 None11590
> 6 None11590
> 7 None11590
> 8 None11590
> 9 None11590
> {code}
>  
> I have also verified this affects all versions of spark between 2.4.0 and 
> 3.1.2
> Here the Exception (1) thrown (sorry about the poor formatting, didn't seem 
> to manage to make it work):
>  
> {code:java}
> >>> spark.version '3.1.2'
> >>> df = spark.read.parquet(".../broken_parquet_file.parquet")
> >>> df
> DataFrame[column_a: array, column_b: decimal(4,0)]
> >>> df.show() 21/08/23 18:39:36
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)/ 1] 
> org.apache.spark.sql.execution.QueryExecutionException: Encounter error while 
> reading parquet files. One possible cause: Parquet column cannot be converted 
> in the corresponding files. Details: at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
>

[jira] [Created] (SPARK-36571) Optimized FileOutputCommitter with StagingDir

2021-08-23 Thread angerszhu (Jira)

angerszhu created SPARK-36571:
-

 Summary: Optimized FileOutputCommitter with StagingDir
 Key: SPARK-36571
 URL: https://issues.apache.org/jira/browse/SPARK-36571
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang

2021-08-23 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-36558:
-
Description: 
 

For a stage that all tasks are finished but with ongoing finalization can lead 
to job hang. The problem is that such stage is considered as a "missing" stage 
(see 
[https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
 And it breaks the original assumption that a "missing" stage must have tasks 
to run. 

Normally, if stage A is the parent of (result) stage B and all tasks have 
finished in stage A, stage A will be skipped directly when submitting stage B. 
However, with this bug, stage A will be submitted. And submitting a stage with 
no tasks to run would not be able to add its child stage into the waiting stage 
list, which leads to the job hang in the end.

 

The example to reproduce:

First, change `MyRDD` to allow it to compute:
{code:java}
override def compute(split: Partition, context: TaskContext): Iterator[(Int, 
Int)] = {
   Iterator.single((1, 1))
 }{code}

 Then run this test:
{code:java}
test("Job hang") {
  initPushBasedShuffleConfs(conf)
  conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
  DAGSchedulerSuite.clearMergerLocs
  DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
  val latch = new CountDownLatch(1)
  val myDAGScheduler = new MyDAGScheduler(
sc,
sc.dagScheduler.taskScheduler,
sc.listenerBus,
sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
sc.env.blockManager.master,
sc.env) {
override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = {
  // By this, we can mimic a stage with all tasks finished
  // but finalization is incomplete.
  latch.countDown()
}
  }
  sc.dagScheduler = myDAGScheduler
  sc.taskScheduler.setDAGScheduler(myDAGScheduler)
  val parts = 20
  val shuffleMapRdd = new MyRDD(sc, parts, Nil)
  val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
  val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
  reduceRdd1.countAsync()
  latch.await()
  // scalastyle:off
  println("=after wait==")
  // set _shuffleMergedFinalized to true can avoid the hang.
  // shuffleDep._shuffleMergedFinalized = true
  val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
  reduceRdd2.count()
}
{code}
 

  was:
 

For a stage that all tasks are finished but with ongoing finalization can lead 
to job hang. The problem is that such stage is considered as a "missing" stage 
(see 
[https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
 And it breaks the original assumption that a "missing" stage must have tasks 
to run. 

Normally, if stage A is the parent of (result) stage B and all tasks have 
finished in stage A, stage A will be skipped directly when submitting stage B. 
However, with this bug, stage A will be submitted. And submitting a stage with 
no tasks to run would not be able to add its child stage into the waiting stage 
list, which leads to the job hang in the end.

 

The example to reproduce:

First, change `MyRDD` to allow it compute:
override def compute(split: Partition, context: TaskContext): Iterator[(Int, 
Int)] = \{
  Iterator.single((1, 1))
}
Then run this test:
{code:java}
test("Job hang") {
  initPushBasedShuffleConfs(conf)
  conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
  DAGSchedulerSuite.clearMergerLocs
  DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
  val latch = new CountDownLatch(1)
  val myDAGScheduler = new MyDAGScheduler(
sc,
sc.dagScheduler.taskScheduler,
sc.listenerBus,
sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
sc.env.blockManager.master,
sc.env) {
override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = {
  // By this, we can mimic a stage with all tasks finished
  // but finalization is incomplete.
  latch.countDown()
}
  }
  sc.dagScheduler = myDAGScheduler
  sc.taskScheduler.setDAGScheduler(myDAGScheduler)
  val parts = 20
  val shuffleMapRdd = new MyRDD(sc, parts, Nil)
  val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
  val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
  reduceRdd1.countAsync()
  latch.await()
  // scalastyle:off
  println("=after wait==")
  // set _shuffleMergedFinalized to true can avoid the hang.
  // shuffleDep._shuffleMergedFinalized = true
  val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
  reduceRdd2.count()
}
{code}
 


> Stage has all tasks finished but with ongoing finalization can cause job hang
>

[jira] [Updated] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang

2021-08-23 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-36558:
-
Description: 
 

For a stage that all tasks are finished but with ongoing finalization can lead 
to job hang. The problem is that such stage is considered as a "missing" stage 
(see 
[https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
 And it breaks the original assumption that a "missing" stage must have tasks 
to run. 

Normally, if stage A is the parent of (result) stage B and all tasks have 
finished in stage A, stage A will be skipped directly when submitting stage B. 
However, with this bug, stage A will be submitted. And submitting a stage with 
no tasks to run would not be able to add its child stage into the waiting stage 
list, which leads to the job hang in the end.

 

The example to reproduce:

First, change `MyRDD` to allow it compute:
override def compute(split: Partition, context: TaskContext): Iterator[(Int, 
Int)] = \{
  Iterator.single((1, 1))
}
Then run this test:
{code:java}
test("Job hang") {
  initPushBasedShuffleConfs(conf)
  conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
  DAGSchedulerSuite.clearMergerLocs
  DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
  val latch = new CountDownLatch(1)
  val myDAGScheduler = new MyDAGScheduler(
sc,
sc.dagScheduler.taskScheduler,
sc.listenerBus,
sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
sc.env.blockManager.master,
sc.env) {
override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = {
  // By this, we can mimic a stage with all tasks finished
  // but finalization is incomplete.
  latch.countDown()
}
  }
  sc.dagScheduler = myDAGScheduler
  sc.taskScheduler.setDAGScheduler(myDAGScheduler)
  val parts = 20
  val shuffleMapRdd = new MyRDD(sc, parts, Nil)
  val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
  val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
  reduceRdd1.countAsync()
  latch.await()
  // scalastyle:off
  println("=after wait==")
  // set _shuffleMergedFinalized to true can avoid the hang.
  // shuffleDep._shuffleMergedFinalized = true
  val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
  reduceRdd2.count()
}
{code}
 

  was:
 

For a stage that all tasks are finished but with ongoing finalization can lead 
to job hang. The problem is that such stage is considered as a "missing" stage 
(see 
[https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
 And it breaks the original assumption that a "missing" stage must have tasks 
to run. 

Normally, if stage A is the parent of (result) stage B and all tasks have 
finished in stage A, stage A will be skipped directly when submitting stage B. 
However, with this bug, stage A will be submitted. And submitting a stage with 
no tasks to run would not be able to add its child stage into the waiting stage 
list, which leads to the job hang in the end.

 

The example to reproduce:
{code:java}
test("Job hang") {
  initPushBasedShuffleConfs(conf)
  conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
  DAGSchedulerSuite.clearMergerLocs
  DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
  val latch = new CountDownLatch(1)
  val myDAGScheduler = new MyDAGScheduler(
sc,
sc.dagScheduler.taskScheduler,
sc.listenerBus,
sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
sc.env.blockManager.master,
sc.env) {
override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = {
  // By this, we can mimic a stage with all tasks finished
  // but finalization is incomplete.
  latch.countDown()
}
  }
  sc.dagScheduler = myDAGScheduler
  sc.taskScheduler.setDAGScheduler(myDAGScheduler)
  val parts = 20
  val shuffleMapRdd = new MyRDD(sc, parts, Nil)
  val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
  val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
  reduceRdd1.countAsync()
  latch.await()
  // scalastyle:off
  println("=after wait==")
  // set _shuffleMergedFinalized to true can avoid the hang.
  // shuffleDep._shuffleMergedFinalized = true
  val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
  reduceRdd2.count()
}
{code}
 


> Stage has all tasks finished but with ongoing finalization can cause job hang
> -
>
> Key: SPARK-36558
> URL: https://issues.apache.org/jira/browse/SPARK-36558
>

[jira] [Updated] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang

2021-08-23 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-36558:
-
Description: 
 

For a stage that all tasks are finished but with ongoing finalization can lead 
to job hang. The problem is that such stage is considered as a "missing" stage 
(see 
[https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
 And it breaks the original assumption that a "missing" stage must have tasks 
to run. 

Normally, if stage A is the parent of (result) stage B and all tasks have 
finished in stage A, stage A will be skipped directly when submitting stage B. 
However, with this bug, stage A will be submitted. And submitting a stage with 
no tasks to run would not be able to add its child stage into the waiting stage 
list, which leads to the job hang in the end.

 

The example to reproduce:
{code:java}
test("Job hang") {
  initPushBasedShuffleConfs(conf)
  conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
  DAGSchedulerSuite.clearMergerLocs
  DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
  val latch = new CountDownLatch(1)
  val myDAGScheduler = new MyDAGScheduler(
sc,
sc.dagScheduler.taskScheduler,
sc.listenerBus,
sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
sc.env.blockManager.master,
sc.env) {
override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = {
  // By this, we can mimic a stage with all tasks finished
  // but finalization is incomplete.
  latch.countDown()
}
  }
  sc.dagScheduler = myDAGScheduler
  sc.taskScheduler.setDAGScheduler(myDAGScheduler)
  val parts = 20
  val shuffleMapRdd = new MyRDD(sc, parts, Nil)
  val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
  val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
  reduceRdd1.countAsync()
  latch.await()
  // scalastyle:off
  println("=after wait==")
  // set _shuffleMergedFinalized to true can avoid the hang.
  // shuffleDep._shuffleMergedFinalized = true
  val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
  reduceRdd2.count()
}
{code}
 

  was:
 

For a stage that all tasks are finished but with ongoing finalization can lead 
to job hang. The problem is that such stage is considered as a "missing" stage 
(see 
[https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
 And it breaks the original assumption that a "missing" stage must have tasks 
to run. 

Normally, if stage A is the parent of (result) stage B and all tasks have 
finished in stage A, stage A will be skipped directly when submitting stage B. 
However, with this bug, stage A will be submitted. And submitting a stage with 
no tasks to run would not be able to add its child stage into the waiting stage 
list, which leads to the job hang in the end.

 

The example to reproduce:
{code:java}
test("Job hang") {
  initPushBasedShuffleConfs(conf)
  conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
  DAGSchedulerSuite.clearMergerLocs
  DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
  val latch = new CountDownLatch(1)
  val myDAGScheduler = new MyDAGScheduler(
sc,
sc.dagScheduler.taskScheduler,
sc.listenerBus,
sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
sc.env.blockManager.master,
sc.env) {
override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = {
  // By this, we can mimic a stage with all tasks finished
  // but finalization is incomplete.
  latch.countDown()
}
  }
  sc.dagScheduler = myDAGScheduler
  sc.taskScheduler.setDAGScheduler(myDAGScheduler)
  val parts = 20
  val shuffleMapRdd = new MyRDD(sc, parts, Nil)
  val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
  val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
  reduceRdd1.countAsync()
  latch.await()
  // set _shuffleMergedFinalized to true can avoid the hang.
  // shuffleDep._shuffleMergedFinalized = true
  val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
  reduceRdd2.count()
}
{code}
 


> Stage has all tasks finished but with ongoing finalization can cause job hang
> -
>
> Key: SPARK-36558
> URL: https://issues.apache.org/jira/browse/SPARK-36558
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Blocker
>
>  
> For a stage that all tasks are finished but with ongoing

[jira] [Commented] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang

2021-08-23 Thread wuyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403456#comment-17403456
 ] 

wuyi commented on SPARK-36558:
--

[~vsowrirajan] Sorry, I missed one tweaked change in `MyRDD`. We should allow 
`MyRDD` to compute:
{code:java}
override def compute(split: Partition, context: TaskContext): Iterator[(Int, 
Int)] = {
  Iterator.single((1, 1))
}{code}
Could you help verify it again?

> Stage has all tasks finished but with ongoing finalization can cause job hang
> -
>
> Key: SPARK-36558
> URL: https://issues.apache.org/jira/browse/SPARK-36558
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Blocker
>
>  
> For a stage that all tasks are finished but with ongoing finalization can 
> lead to job hang. The problem is that such stage is considered as a "missing" 
> stage (see 
> [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
>  And it breaks the original assumption that a "missing" stage must have tasks 
> to run. 
> Normally, if stage A is the parent of (result) stage B and all tasks have 
> finished in stage A, stage A will be skipped directly when submitting stage 
> B. However, with this bug, stage A will be submitted. And submitting a stage 
> with no tasks to run would not be able to add its child stage into the 
> waiting stage list, which leads to the job hang in the end.
>  
> The example to reproduce:
> {code:java}
> test("Job hang") {
>   initPushBasedShuffleConfs(conf)
>   conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
>   DAGSchedulerSuite.clearMergerLocs
>   DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
> "host5"))
>   val latch = new CountDownLatch(1)
>   val myDAGScheduler = new MyDAGScheduler(
> sc,
> sc.dagScheduler.taskScheduler,
> sc.listenerBus,
> sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
> sc.env.blockManager.master,
> sc.env) {
> override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = 
> {
>   // By this, we can mimic a stage with all tasks finished
>   // but finalization is incomplete.
>   latch.countDown()
> }
>   }
>   sc.dagScheduler = myDAGScheduler
>   sc.taskScheduler.setDAGScheduler(myDAGScheduler)
>   val parts = 20
>   val shuffleMapRdd = new MyRDD(sc, parts, Nil)
>   val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
> HashPartitioner(parts))
>   val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
> mapOutputTracker)
>   reduceRdd1.countAsync()
>   latch.await()
>   // set _shuffleMergedFinalized to true can avoid the hang.
>   // shuffleDep._shuffleMergedFinalized = true
>   val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
>   reduceRdd2.count()
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36560) Deflake PySpark coverage report

2021-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36560.
--
Fix Version/s: 3.3.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33808

> Deflake PySpark coverage report
> ---
>
> Key: SPARK-36560
> URL: https://issues.apache.org/jira/browse/SPARK-36560
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
> https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
> https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
> https://github.com/apache/spark/runs/3338876122?check_suite_focus=true
> PySpark scheduled coverage jobs are flaky. We should deflake them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-08-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403446#comment-17403446
 ] 

Dongjoon Hyun commented on SPARK-25075:
---

FYI, here is the pre-built Apache Spark 3.2.0 RC1 spark-shell (Scala 2.13.5) 
result.
{code}
$ bin/spark-shell
...
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
  /_/

Using Scala version 2.13.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_302)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1629769598806).
Spark session available as 'spark'.
{code}

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-08-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403445#comment-17403445
 ] 

Dongjoon Hyun commented on SPARK-25075:
---

[~ekrich]. Yes, SPARK-34218 completed it already for Apache Spark 3.2.0. You 
can see the pre-built binaries at Apache Spark 3.2.0 RC1 vote artifacts.
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/
{code}
spark-3.2.0-bin-hadoop3.2-scala2.13.tgz
spark-3.2.0-bin-hadoop3.2-scala2.13.tgz.asc
spark-3.2.0-bin-hadoop3.2-scala2.13.tgz.sha512
{code}

The RC1 vote failed and we will proceed to RC2. Please download the Scala 2.13 
pre-build artifacts and provide your feedbacks, [~ekrich].

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36570) Investigate native support for raw data containing commas

2021-08-23 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-36570:


 Summary: Investigate native support for raw data containing commas
 Key: SPARK-36570
 URL: https://issues.apache.org/jira/browse/SPARK-36570
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


For raw data containing commas as thousands separator, pandas handles it 
automatically. We should recognize them as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36569) Support errors='coerce' for ps.to_numeric

2021-08-23 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-36569:


 Summary: Support errors='coerce' for ps.to_numeric
 Key: SPARK-36569
 URL: https://issues.apache.org/jira/browse/SPARK-36569
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36568) Missed broadcast join in V2 plan

2021-08-23 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-36568:
-

 Summary: Missed broadcast join in V2 plan
 Key: SPARK-36568
 URL: https://issues.apache.org/jira/browse/SPARK-36568
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Bruce Robbins


There are some joins that use broadcast hash join with DataSourceV1 but sort 
merge join with DataSourceV2, even though the two joins are loading the same 
files [1].

Example:

Create data:
{noformat}
import scala.util.Random

val rand = new Random(245665L)

val df = spark.range(1, 2).map { x =>
  (x,
   rand.alphanumeric.take(20).mkString,
   rand.alphanumeric.take(20).mkString,
   rand.alphanumeric.take(20).mkString
  )
}.toDF("key", "col1", "col2", "col3")

df.write.mode("overwrite").format("parquet").save("/tmp/tbl")
df.write.mode("overwrite").format("parquet").save("/tmp/lookup")
{noformat}
Run this code:
{noformat}
bin/spark-shell --conf spark.sql.autoBroadcastJoinThreshold=40

spark.read.format("parquet").load("/tmp/tbl").createOrReplaceTempView("tbl")
spark.read.format("parquet").load("/tmp/lookup").createOrReplaceTempView("lookup")

sql("""select t.key, t.col1, t.col2, t.col3
from tbl t
join lookup l
on t.key = l.key""").explain
{noformat}
For V2, do the same, except set {{spark.sql.sources.useV1SourceList=""}}.

For V1, the result is:
{noformat}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [key#0L, col1#1, col2#2, col3#3]
   +- BroadcastHashJoin [key#0L], [key#8L], Inner, BuildRight, false
  :- Filter isnotnull(key#0L)
  :  +- FileScan parquet [key#0L,col1#1,col2#2,col3#3] Batched: true, 
DataFilters: [isnotnull(key#0L)], Format: Parquet, Location: 
InMemoryFileIndex(1 paths)[file:/tmp/tbl], PartitionFilters: [], PushedFilters: 
[IsNotNull(key)], ReadSchema: 
struct
  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false]),false), [id=#32]
 +- Filter isnotnull(key#8L)
+- FileScan parquet [key#8L] Batched: true, DataFilters: 
[isnotnull(key#8L)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/lookup], PartitionFilters: [], PushedFilters: 
[IsNotNull(key)], ReadSchema: struct
{noformat}
For V2, the result is:
{noformat}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [key#0L, col1#1, col2#2, col3#3]
   +- SortMergeJoin [key#0L], [key#8L], Inner
  :- Sort [key#0L ASC NULLS FIRST], false, 0
  :  +- Exchange hashpartitioning(key#0L, 200), ENSURE_REQUIREMENTS, 
[id=#33]
  : +- Filter isnotnull(key#0L)
  :+- BatchScan[key#0L, col1#1, col2#2, col3#3] ParquetScan 
DataFilters: [isnotnull(key#0L)], Format: parquet, Location: 
InMemoryFileIndex(1 paths)[file:/tmp/tbl], PartitionFilters: [], PushedFilters: 
[IsNotNull(key)], ReadSchema: 
struct, PushedFilters: 
[IsNotNull(key)] RuntimeFilters: []
  +- Sort [key#8L ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(key#8L, 200), ENSURE_REQUIREMENTS, 
[id=#34]
+- Filter isnotnull(key#8L)
   +- BatchScan[key#8L] ParquetScan DataFilters: 
[isnotnull(key#8L)], Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/lookup], PartitionFilters: [], PushedFilters: 
[IsNotNull(key)], ReadSchema: struct, PushedFilters: 
[IsNotNull(key)] RuntimeFilters: []
{noformat}
The initial plan with V1 uses broadcast hash join, but the initial plan with V2 
uses sort merge join.

The V1 logical plan contains a projection over the relation for {{lookup}}, 
which restricts the output columns to just {{key}}. As a result, 
{{SizeInBytesOnlyStatsPlanVisitor#visitUnaryNode}}, when visiting the project 
node, reduces sizeInBytes based on the pruning:
{noformat}
Project [key#0L, col1#1, col2#2, col3#3]
+- Join Inner, (key#0L = key#8L)
   :- Filter isnotnull(key#0L)
   :  +- Relation [key#0L,col1#1,col2#2,col3#3] parquet
   +- Project [key#8L]
  +- Filter isnotnull(key#8L)
 +- Relation [key#8L,col1#9,col2#10,col3#11] parquet
{noformat}
The V2 logical plan does not contain this projection:
{noformat}
+- Join Inner, (key#0L = key#8L)
   :- Filter isnotnull(key#0L)
   :  +- RelationV2[key#0L, col1#1, col2#2, col3#3] parquet file:/tmp/tbl
   +- Filter isnotnull(key#8L)
  +- RelationV2[key#8L] parquet file:/tmp/lookup
{noformat}
[1] With my example, AQE converts the join to a broadcast hash join at run time 
for the V2 case. However, if AQE was disabled, it would obviously remain a sort 
merge join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36558) Stage has all tasks finished but with ongoing finalization can cause job hang

2021-08-23 Thread Venkata krishnan Sowrirajan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403404#comment-17403404
 ] 

Venkata krishnan Sowrirajan commented on SPARK-36558:
-

[~Ngone51] I tried the above test which you shared but I needed to make few 
changes to basically have shuffle merge enabled properly, without that what the 
`scheduleShuffleMergeFinalize` won't get invoked therefore it can infinitely 
wait with the CountDownLatch. Please take a look at the modified test below.


1. Keep `numMergerLocs` equal to `numParts` for the stage to have 
shuffleMergeEnabled

2. Also use `submit` option inside `DAGSchedulerSuite` to run an action on an 
RDD.

Since the 'runningStages` set would have the stage currently in merge 
finalization (but all the partitions are available) step therefore the same 
stage won't get submitted with empty partitions to be computed.

This check 
[here|[https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1253]]
 would prevent the same stage getting submitted again. Also below is the fixed 
unit test and this seems to run fine.
{code:java}
  test("Job hang") {
initPushBasedShuffleConfs(conf)
conf.set("spark.shuffle.push.mergerLocations.minThreshold", "5")
DAGSchedulerSuite.clearMergerLocs
DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
val latch = new CountDownLatch(1)
scheduler = new MyDAGScheduler(
  sc,
  taskScheduler,
  sc.listenerBus,
  mapOutputTracker,
  blockManagerMaster,
  sc.env) {
  override private[spark] def scheduleShuffleMergeFinalize(
  stage: ShuffleMapStage): Unit = {
// By this, we can mimic a stage with all tasks finished
// but finalization is incomplete.
latch.countDown()
// super.scheduleShuffleMergeFinalize(stage)
  }
}

dagEventProcessLoopTester = new 
DAGSchedulerEventProcessLoopTester(scheduler)
val parts = 5
val shuffleMapRdd = new MyRDD(sc, parts, Nil)
val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
val reduceRdd1 = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
submit(reduceRdd1, (0 until parts).toArray)
completeShuffleMapStageSuccessfully(0, 0, parts)
completeNextResultStageWithSuccess(1, 0)
latch.await()
// set _shuffleMergedFinalized to true can avoid the hang.
val reduceRdd2 = new MyRDD(sc, parts, List(shuffleDep))
submit(reduceRdd2, (0 until parts).toArray)
completeNextResultStageWithSuccess(3, 0)
  }
{code}

Let me know your thoughts. Am I missing something here? cc [~mshen] [~mridulm80]

> Stage has all tasks finished but with ongoing finalization can cause job hang
> -
>
> Key: SPARK-36558
> URL: https://issues.apache.org/jira/browse/SPARK-36558
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Blocker
>
>  
> For a stage that all tasks are finished but with ongoing finalization can 
> lead to job hang. The problem is that such stage is considered as a "missing" 
> stage (see 
> [https://github.com/apache/spark/blob/a47ceaf5492040063e31e17570678dc06846c36c/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L719-L721).]
>  And it breaks the original assumption that a "missing" stage must have tasks 
> to run. 
> Normally, if stage A is the parent of (result) stage B and all tasks have 
> finished in stage A, stage A will be skipped directly when submitting stage 
> B. However, with this bug, stage A will be submitted. And submitting a stage 
> with no tasks to run would not be able to add its child stage into the 
> waiting stage list, which leads to the job hang in the end.
>  
> The example to reproduce:
> {code:java}
> test("Job hang") {
>   initPushBasedShuffleConfs(conf)
>   conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 5)
>   DAGSchedulerSuite.clearMergerLocs
>   DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
> "host5"))
>   val latch = new CountDownLatch(1)
>   val myDAGScheduler = new MyDAGScheduler(
> sc,
> sc.dagScheduler.taskScheduler,
> sc.listenerBus,
> sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
> sc.env.blockManager.master,
> sc.env) {
> override def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): Unit = 
> {
>   // By this, we can mimic a stage with all tasks finished
>   // but finalization is incomplete.
>   latch.countDown()
> }
>   }
>   sc.dagScheduler = myDAGScheduler
>   sc.taskScheduler.setDAGScheduler(myDAGScheduler)
>   val parts =

[jira] [Assigned] (SPARK-36567) Support foldable special datetime values in CAST

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36567:


Assignee: Apache Spark  (was: Max Gekk)

> Support foldable special datetime values in CAST
> 
>
> Key: SPARK-36567
> URL: https://issues.apache.org/jira/browse/SPARK-36567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/32714 disallowed special datetime 
> values in the CAST expression, and allowed them only in typed literals. So, 
> the following code doesn't work anymore:
> {code:sql}
> spark-sql> select date('today');
> NULL
> spark-sql> select cast('today' as date);
> NULL
> {code}
> Though, users can replace them by:
> {code:sql}
> spark-sql> select date'today';
> 2021-08-23
> {code}
> but the changes can break users apps. The purpose of this ticket is to 
> support special datetime strings when the strings in casts are foldable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36567) Support foldable special datetime values in CAST

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36567:


Assignee: Max Gekk  (was: Apache Spark)

> Support foldable special datetime values in CAST
> 
>
> Key: SPARK-36567
> URL: https://issues.apache.org/jira/browse/SPARK-36567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/32714 disallowed special datetime 
> values in the CAST expression, and allowed them only in typed literals. So, 
> the following code doesn't work anymore:
> {code:sql}
> spark-sql> select date('today');
> NULL
> spark-sql> select cast('today' as date);
> NULL
> {code}
> Though, users can replace them by:
> {code:sql}
> spark-sql> select date'today';
> 2021-08-23
> {code}
> but the changes can break users apps. The purpose of this ticket is to 
> support special datetime strings when the strings in casts are foldable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36567) Support foldable special datetime values in CAST

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403365#comment-17403365
 ] 

Apache Spark commented on SPARK-36567:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33816

> Support foldable special datetime values in CAST
> 
>
> Key: SPARK-36567
> URL: https://issues.apache.org/jira/browse/SPARK-36567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/32714 disallowed special datetime 
> values in the CAST expression, and allowed them only in typed literals. So, 
> the following code doesn't work anymore:
> {code:sql}
> spark-sql> select date('today');
> NULL
> spark-sql> select cast('today' as date);
> NULL
> {code}
> Though, users can replace them by:
> {code:sql}
> spark-sql> select date'today';
> 2021-08-23
> {code}
> but the changes can break users apps. The purpose of this ticket is to 
> support special datetime strings when the strings in casts are foldable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36567) Support foldable special datetime values in CAST

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403366#comment-17403366
 ] 

Apache Spark commented on SPARK-36567:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33816

> Support foldable special datetime values in CAST
> 
>
> Key: SPARK-36567
> URL: https://issues.apache.org/jira/browse/SPARK-36567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/32714 disallowed special datetime 
> values in the CAST expression, and allowed them only in typed literals. So, 
> the following code doesn't work anymore:
> {code:sql}
> spark-sql> select date('today');
> NULL
> spark-sql> select cast('today' as date);
> NULL
> {code}
> Though, users can replace them by:
> {code:sql}
> spark-sql> select date'today';
> 2021-08-23
> {code}
> but the changes can break users apps. The purpose of this ticket is to 
> support special datetime strings when the strings in casts are foldable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36567) Support foldable special datetime values in CAST

2021-08-23 Thread Max Gekk (Jira)

Max Gekk created SPARK-36567:


 Summary: Support foldable special datetime values in CAST
 Key: SPARK-36567
 URL: https://issues.apache.org/jira/browse/SPARK-36567
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk


The PR https://github.com/apache/spark/pull/32714 disallowed special datetime 
values in the CAST expression, and allowed them only in typed literals. So, the 
following code doesn't work anymore:

{code:sql}
spark-sql> select date('today');
NULL
spark-sql> select cast('today' as date);
NULL
{code}
Though, users can replace them by:
{code:sql}
spark-sql> select date'today';
2021-08-23
{code}

but the changes can break users apps. The purpose of this ticket is to support 
special datetime strings when the strings in casts are foldable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32333) Drop references to Master

2021-08-23 Thread Neil Shah-Quinn (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403348#comment-17403348
 ] 

Neil Shah-Quinn commented on SPARK-32333:
-

I like "leader" too.

> Drop references to Master
> -
>
> Key: SPARK-32333
> URL: https://issues.apache.org/jira/browse/SPARK-32333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We have a lot of references to "master" in the code base. It will be 
> beneficial to remove references to problematic language that can alienate 
> potential community members. 
> SPARK-32004 removed references to slave
>  
> Here is a IETF draft to fix up some of the most egregious examples
> (master/slave, whitelist/backlist) with proposed alternatives.
> https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29223) Kafka source: offset by timestamp - allow specifying timestamp for "all partitions"

2021-08-23 Thread Andrew Grigorev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403341#comment-17403341
 ] 

Andrew Grigorev commented on SPARK-29223:
-

Just an idea - couldn't this be implemented in "timestamp field filter 
pushdown"-like way?

> Kafka source: offset by timestamp - allow specifying timestamp for "all 
> partitions"
> ---
>
> Key: SPARK-29223
> URL: https://issues.apache.org/jira/browse/SPARK-29223
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.2.0
>
>
> This issue is a follow-up of SPARK-26848.
> In SPARK-26848, we decided to open possibility to let end users set 
> individual timestamp per partition. But in many cases, specifying timestamp 
> represents the intention that we would want to go back to specific timestamp 
> and reprocess records, which should be applied to all topics and partitions.
> According to the format of 
> `startingOffsetsByTimestamp`/`endingOffsetsByTimestamp`, while it's not 
> intuitive to provide an option to set a global timestamp across topic, it's 
> still intuitive to provide an option to set a global timestamp across 
> partitions in a topic.
> This issue tracks the efforts to deal with this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-08-23 Thread Holden Karau (Jira)

Holden Karau created SPARK-36566:


 Summary: Add Spark appname as a label to the executor pods
 Key: SPARK-36566
 URL: https://issues.apache.org/jira/browse/SPARK-36566
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Holden Karau


Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.

2021-08-23 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403329#comment-17403329
 ] 

Pablo Langa Blanco commented on SPARK-36488:


Yes, I'm with you that it's not very intuitive and it have room for 
improvement. I think it's interesting to take a closer look to support more 
expressions and to be more close to hive feature.

Thanks!

> "Invalid usage of '*' in expression" error due to the feature of 
> 'quotedRegexColumnNames' in some scenarios.
> 
>
> Key: SPARK-36488
> URL: https://issues.apache.org/jira/browse/SPARK-36488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: merrily01
>Priority: Major
>
>  In some cases, the error happens when the following property is set.
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> *case 1:* 
> {code:java}
> spark-sql> create table tb_test as select 1 as col_a, 2 as col_b;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> 1
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue'
> {code}
>  
> *case 2:*
> {code:java}
>          > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> 0.955414
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> Error in query: Invalid usage of '*' in expression 'divide'
> {code}
>  
> This problem exists in 3.X, 2.4.X and master versions. 
>  
> Related issue ： 
> https://issues.apache.org/jira/browse/SPARK-12139
> （As can be seen in the latest comments, some people have encountered the same 
> problem）
>  
> Similar problems：
> https://issues.apache.org/jira/browse/SPARK-28897
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36565) "Unscaled value too large for precision" while reading simple parquet file readable with parquet-tools and pandas

2021-08-23 Thread Lorenzo Martini (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lorenzo Martini updated SPARK-36565:

Attachment: broken_parquet_file.parquet

> "Unscaled value too large for precision" while reading simple parquet file 
> readable with parquet-tools and pandas
> -
>
> Key: SPARK-36565
> URL: https://issues.apache.org/jira/browse/SPARK-36565
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.1.2
>Reporter: Lorenzo Martini
>Priority: Minor
> Attachments: broken_parquet_file.parquet
>
>
> I have a simple parquet file (attached to the ticket) with 2 columns 
> (array, decimal) that can be read and viewed correctly using pandas 
> or parquet-tools. Reading the parquet file in spark (and pyspark) seems to 
> work, but calling `.show()` throws the exception (1) with
> {code:java}
> Caused by: java.lang.ArithmeticException: Unscaled value too large for 
> precision at.
> {code}
>  
>  
> Another interesting detail is that reading the parquet file and doing a 
> select on individual columns allows for `show()` to work correctly without 
> throwing.
> {code:java}
> >>> repro = spark.read.parquet(".../broken_file.parquet")
> >>> repro.printSchema()
> root
>  |-- column_a: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- column_b: decimal(4,0) (nullable = true)
> >>> repro.select("column_a").show()
> ++
> |column_a|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> >>> repro.select("column_b").show()
> ++
> |column_b|
> ++
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> |   11590|
> ++
> >>> repro.show() // THIS ONE THROWS EXCEPTION (1)
> {code}
> Using `parquet-tools` shows the dataset correctly
> {code:java}
> >>> parquet-tools show broken_file.parquet
> +++
> | column_a   |   column_b |
> |+|
> || 11590  | 
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> || 11590  |
> +++
> {code}
>  
>  
> And the same with `pandas`
> {code:java}
> >>> import pandas as pd
> >>> pd.read_parquet(".../broken_file.parquet")
>  column_a column_b
> 0 None11590
> 1 None11590
> 2 None11590
> 3 None11590
> 4 None11590
> 5 None11590
> 6 None11590
> 7 None11590
> 8 None11590
> 9 None11590
> {code}
>  
> I have also verified this affects all versions of spark between 2.4.0 and 
> 3.1.2
> Here the Exception (1) thrown (sorry about the poor formatting, didn't seem 
> to manage to make it work):
>  
> {code:java}
> >>> spark.version '3.1.2'
> >>> df = spark.read.parquet(".../broken_parquet_file.parquet")
> >>> df
> DataFrame[column_a: array, column_b: decimal(4,0)]
> >>> df.show() 21/08/23 18:39:36
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)/ 1] 
> org.apache.spark.sql.execution.QueryExecutionException: Encounter error while 
> reading parquet files. One possible cause: Parquet column cannot be converted 
> in the corresponding files. Details: at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
>

[jira] [Created] (SPARK-36565) "Unscaled value too large for precision" while reading simple parquet file readable with parquet-tools and pandas

2021-08-23 Thread Lorenzo Martini (Jira)

Lorenzo Martini created SPARK-36565:
---

 Summary: "Unscaled value too large for precision" while reading 
simple parquet file readable with parquet-tools and pandas
 Key: SPARK-36565
 URL: https://issues.apache.org/jira/browse/SPARK-36565
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2, 2.4.0
Reporter: Lorenzo Martini
 Attachments: broken_parquet_file.parquet

I have a simple parquet file (attached to the ticket) with 2 columns 
(array, decimal) that can be read and viewed correctly using pandas or 
parquet-tools. Reading the parquet file in spark (and pyspark) seems to work, 
but calling `.show()` throws the exception (1) with
{code:java}
Caused by: java.lang.ArithmeticException: Unscaled value too large for 
precision at.
{code}
 

 

Another interesting detail is that reading the parquet file and doing a select 
on individual columns allows for `show()` to work correctly without throwing.
{code:java}
>>> repro = spark.read.parquet(".../broken_file.parquet")

>>> repro.printSchema()
root
 |-- column_a: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- column_b: decimal(4,0) (nullable = true)

>>> repro.select("column_a").show()
++
|column_a|
++
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
++

>>> repro.select("column_b").show()
++
|column_b|
++
|   11590|
|   11590|
|   11590|
|   11590|
|   11590|
|   11590|
|   11590|
|   11590|
|   11590|
|   11590|
++

>>> repro.show() // THIS ONE THROWS EXCEPTION (1)
{code}
Using `parquet-tools` shows the dataset correctly
{code:java}
>>> parquet-tools show broken_file.parquet
+++
| column_a   |   column_b |
|+|
|| 11590  | 
|| 11590  |
|| 11590  |
|| 11590  |
|| 11590  |
|| 11590  |
|| 11590  |
|| 11590  |
|| 11590  |
|| 11590  |
+++
{code}
 

 

And the same with `pandas`
{code:java}
>>> import pandas as pd
>>> pd.read_parquet(".../broken_file.parquet")
 column_a column_b
0 None11590
1 None11590
2 None11590
3 None11590
4 None11590
5 None11590
6 None11590
7 None11590
8 None11590
9 None11590
{code}
 

I have also verified this affects all versions of spark between 2.4.0 and 3.1.2


Here the Exception (1) thrown (sorry about the poor formatting, didn't seem to 
manage to make it work):

 
{code:java}
>>> spark.version '3.1.2'
>>> df = spark.read.parquet(".../broken_parquet_file.parquet")
>>> df
DataFrame[column_a: array, column_b: decimal(4,0)]
>>> df.show() 21/08/23 18:39:36
ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)/ 1] 
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while 
reading parquet files. One possible cause: Parquet column cannot be converted 
in the corresponding files. Details: at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:131) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
 at java.base/java.lang.Thread.run(Thread.java:832) Caused by: 
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-08-23 Thread Eric Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403305#comment-17403305
 ] 

Eric Richardson commented on SPARK-25075:
-

Is it possible to have both the 2.12 and 2.13 versions pre-built for download? 
I think this could accelerate adoption to 2.13 and thus Scala 3. It is fine if 
the 2.12 is considered production and 2.13 is experimental.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34952) DS V2 Aggregate push down

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403292#comment-17403292
 ] 

Apache Spark commented on SPARK-34952:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33815

> DS V2 Aggregate push down
> -
>
> Key: SPARK-34952
> URL: https://issues.apache.org/jira/browse/SPARK-34952
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.2.0
>
>
> Push down aggregate to data source for better performance. This will be done 
> in two steps:
> 1. add aggregate push down APIs and the implementation in JDBC
> 2. add the implementation in Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34952) DS V2 Aggregate push down

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403291#comment-17403291
 ] 

Apache Spark commented on SPARK-34952:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33815

> DS V2 Aggregate push down
> -
>
> Key: SPARK-34952
> URL: https://issues.apache.org/jira/browse/SPARK-34952
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.2.0
>
>
> Push down aggregate to data source for better performance. This will be done 
> in two steps:
> 1. add aggregate push down APIs and the implementation in JDBC
> 2. add the implementation in Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36396) Implement DataFrame.cov

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36396:


Assignee: (was: Apache Spark)

> Implement DataFrame.cov
> ---
>
> Key: SPARK-36396
> URL: https://issues.apache.org/jira/browse/SPARK-36396
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36396) Implement DataFrame.cov

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36396:


Assignee: Apache Spark

> Implement DataFrame.cov
> ---
>
> Key: SPARK-36396
> URL: https://issues.apache.org/jira/browse/SPARK-36396
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35991) Add PlanStability suite for TPCH

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403241#comment-17403241
 ] 

Apache Spark commented on SPARK-35991:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33813

> Add PlanStability suite for TPCH
> 
>
> Key: SPARK-35991
> URL: https://issues.apache.org/jira/browse/SPARK-35991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35991) Add PlanStability suite for TPCH

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403243#comment-17403243
 ] 

Apache Spark commented on SPARK-35991:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33813

> Add PlanStability suite for TPCH
> 
>
> Key: SPARK-35991
> URL: https://issues.apache.org/jira/browse/SPARK-35991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36396) Implement DataFrame.cov

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403240#comment-17403240
 ] 

Apache Spark commented on SPARK-36396:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/33814

> Implement DataFrame.cov
> ---
>
> Key: SPARK-36396
> URL: https://issues.apache.org/jira/browse/SPARK-36396
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36564:


Assignee: Apache Spark

> LiveRDDDistribution.toApi throws NullPointerException
> -
>
> Key: SPARK-36564
> URL: https://issues.apache.org/jira/browse/SPARK-36564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
>   at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
>   at 
> com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
>   at 
> com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
>   at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
>   at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
>   at 
> org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
>   at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
>   at 
> org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
>   at 
> org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at 
> org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015)
>   at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36564:


Assignee: (was: Apache Spark)

> LiveRDDDistribution.toApi throws NullPointerException
> -
>
> Key: SPARK-36564
> URL: https://issues.apache.org/jira/browse/SPARK-36564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
>   at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
>   at 
> com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
>   at 
> com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
>   at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
>   at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
>   at 
> org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
>   at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
>   at 
> org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
>   at 
> org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at 
> org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015)
>   at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403187#comment-17403187
 ] 

Apache Spark commented on SPARK-36564:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/33812

> LiveRDDDistribution.toApi throws NullPointerException
> -
>
> Key: SPARK-36564
> URL: https://issues.apache.org/jira/browse/SPARK-36564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
>   at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
>   at 
> com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
>   at 
> com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
>   at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
>   at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
>   at 
> org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
>   at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
>   at 
> org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
>   at 
> org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at 
> org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015)
>   at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403188#comment-17403188
 ] 

Apache Spark commented on SPARK-36564:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/33812

> LiveRDDDistribution.toApi throws NullPointerException
> -
>
> Key: SPARK-36564
> URL: https://issues.apache.org/jira/browse/SPARK-36564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
>   at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
>   at 
> com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
>   at 
> com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
>   at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
>   at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
>   at 
> org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
>   at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
>   at 
> org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
>   at 
> org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at 
> org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015)
>   at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36564) LiveRDDDistribution.toApi throws NullPointerException

2021-08-23 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-36564:
-
Summary: LiveRDDDistribution.toApi throws NullPointerException  (was: 
LiveRDD.doUpdate throws NullPointerException)

> LiveRDDDistribution.toApi throws NullPointerException
> -
>
> Key: SPARK-36564
> URL: https://issues.apache.org/jira/browse/SPARK-36564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> 21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
>   at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
>   at 
> com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
>   at 
> com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
>   at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
>   at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
>   at 
> org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
>   at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
>   at 
> org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
>   at 
> org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956)
>   at 
> scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
>   at 
> org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015)
>   at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36564) LiveRDD.doUpdate throws NullPointerException

2021-08-23 Thread wuyi (Jira)

wuyi created SPARK-36564:


 Summary: LiveRDD.doUpdate throws NullPointerException
 Key: SPARK-36564
 URL: https://issues.apache.org/jira/browse/SPARK-36564
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2, 3.0.3, 3.2.0, 3.3.0
Reporter: wuyi


{code:java}
21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
exception
java.lang.NullPointerException
at 
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
at 
com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
at 
com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
at 
org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
at 
org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
at 
org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at 
scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
at 
org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
at 
org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
at 
org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
at 
org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6$adapted(AppStatusListener.scala:956)
at 
scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
at 
org.apache.spark.status.AppStatusListener.flush(AppStatusListener.scala:1015)
at 
org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:956)
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:59)
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119)
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103)
at 
org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
at 
org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
at 
scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32838) Connot insert overwite different partition with same table

2021-08-23 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32838:
--
Parent: SPARK-36562
Issue Type: Sub-task  (was: Bug)

> Connot insert overwite different partition with same table
> --
>
> Key: SPARK-32838
> URL: https://issues.apache.org/jira/browse/SPARK-32838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: hadoop 2.7 + spark 3.0.0
>Reporter: CHC
>Priority: Major
>
> When:
> {code:java}
> CREATE TABLE tmp.spark3_snap (
> id string
> )
> PARTITIONED BY (dt string)
> STORED AS ORC
> ;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-09')
> select 10;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select 1;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select id from tmp.spark3_snap where dt='2020-09-09';
> {code}
> and it will be get a error: "Cannot overwrite a path that is also being read 
> from"
> related: https://issues.apache.org/jira/browse/SPARK-24194
> This work on spark 2.4.3 and do not work on spark 3.0.0
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36563) dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403169#comment-17403169
 ] 

Apache Spark commented on SPARK-36563:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33811

> dynamicPartitionOverwrite can direct rename to targetPath instead of 
> partition path one by one
> --
>
> Key: SPARK-36563
> URL: https://issues.apache.org/jira/browse/SPARK-36563
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> dynamicPartitionOverwrite can direct rename to targetPath instead of 
> partition path one by one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36563) dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36563:


Assignee: (was: Apache Spark)

> dynamicPartitionOverwrite can direct rename to targetPath instead of 
> partition path one by one
> --
>
> Key: SPARK-36563
> URL: https://issues.apache.org/jira/browse/SPARK-36563
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> dynamicPartitionOverwrite can direct rename to targetPath instead of 
> partition path one by one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36563) dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36563:


Assignee: Apache Spark

> dynamicPartitionOverwrite can direct rename to targetPath instead of 
> partition path one by one
> --
>
> Key: SPARK-36563
> URL: https://issues.apache.org/jira/browse/SPARK-36563
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> dynamicPartitionOverwrite can direct rename to targetPath instead of 
> partition path one by one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36563) dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one

2021-08-23 Thread angerszhu (Jira)

angerszhu created SPARK-36563:
-

 Summary: dynamicPartitionOverwrite can direct rename to targetPath 
instead of partition path one by one
 Key: SPARK-36563
 URL: https://issues.apache.org/jira/browse/SPARK-36563
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.2, 3.2.0
Reporter: angerszhu
 Fix For: 3.3.0


dynamicPartitionOverwrite can direct rename to targetPath instead of partition 
path one by one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36562) Improve InsertIntoHadoopFsRelation file commit logic

2021-08-23 Thread angerszhu (Jira)

angerszhu created SPARK-36562:
-

 Summary: Improve InsertIntoHadoopFsRelation file commit logic
 Key: SPARK-36562
 URL: https://issues.apache.org/jira/browse/SPARK-36562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.2.0
Reporter: angerszhu
 Fix For: 3.3.0


Improve InsertIntoHadoopFsRelation file commit logic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-08-23 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403154#comment-17403154
 ] 

Yang Jie commented on SPARK-25075:
--

[~dongjoon] Thank you for your reply

 

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35876) array_zip unexpected column names

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403085#comment-17403085
 ] 

Apache Spark commented on SPARK-35876:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33810

> array_zip unexpected column names
> -
>
> Key: SPARK-35876
> URL: https://issues.apache.org/jira/browse/SPARK-35876
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Derk Crezee
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> {{When I'm using the array_zip function in combination with renamed columns, 
> I get an unexpected schema written to disk.}}
> {code:java}
> // code placeholder
> from pyspark.sql import * 
> from pyspark.sql.functions import *
> spark = SparkSession.builder.getOrCreate()
> data = [
>   Row(a1=["a", "a"], b1=["b", "b"]),
> ]
> df = (
>   spark.sparkContext.parallelize(data).toDF()
> .withColumnRenamed("a1", "a2")
> .withColumnRenamed("b1", "b2")
> .withColumn("zipped", arrays_zip(col("a2"), col("b2")))
> )
> df.printSchema()
> // root
> //  |-- a2: array (nullable = true)
> //  ||-- element: string (containsNull = true)
> //  |-- b2: array (nullable = true)
> //  ||-- element: string (containsNull = true)
> //  |-- zipped: array (nullable = true)
> //  ||-- element: struct (containsNull = false)
> //  |||-- a2: string (nullable = true)
> //  |||-- b2: string (nullable = true)
> df.write.save("test.parquet")
> spark.read.load("test.parquet").printSchema()
> // root
> //  |-- a2: array (nullable = true)
> //  ||-- element: string (containsNull = true)
> //  |-- b2: array (nullable = true)
> //  ||-- element: string (containsNull = true)
> //  |-- zipped: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- a1: string (nullable = true)
> //  |||-- b1: string (nullable = true){code}
> I would expect the schema of the DataFrame written to disk to be the same as 
> that printed out. It seems that instead of using the renamed version of the 
> column names, it uses the old column names.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35876) array_zip unexpected column names

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403084#comment-17403084
 ] 

Apache Spark commented on SPARK-35876:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33810

> array_zip unexpected column names
> -
>
> Key: SPARK-35876
> URL: https://issues.apache.org/jira/browse/SPARK-35876
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Derk Crezee
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> {{When I'm using the array_zip function in combination with renamed columns, 
> I get an unexpected schema written to disk.}}
> {code:java}
> // code placeholder
> from pyspark.sql import * 
> from pyspark.sql.functions import *
> spark = SparkSession.builder.getOrCreate()
> data = [
>   Row(a1=["a", "a"], b1=["b", "b"]),
> ]
> df = (
>   spark.sparkContext.parallelize(data).toDF()
> .withColumnRenamed("a1", "a2")
> .withColumnRenamed("b1", "b2")
> .withColumn("zipped", arrays_zip(col("a2"), col("b2")))
> )
> df.printSchema()
> // root
> //  |-- a2: array (nullable = true)
> //  ||-- element: string (containsNull = true)
> //  |-- b2: array (nullable = true)
> //  ||-- element: string (containsNull = true)
> //  |-- zipped: array (nullable = true)
> //  ||-- element: struct (containsNull = false)
> //  |||-- a2: string (nullable = true)
> //  |||-- b2: string (nullable = true)
> df.write.save("test.parquet")
> spark.read.load("test.parquet").printSchema()
> // root
> //  |-- a2: array (nullable = true)
> //  ||-- element: string (containsNull = true)
> //  |-- b2: array (nullable = true)
> //  ||-- element: string (containsNull = true)
> //  |-- zipped: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- a1: string (nullable = true)
> //  |||-- b1: string (nullable = true){code}
> I would expect the schema of the DataFrame written to disk to be the same as 
> that printed out. It seems that instead of using the renamed version of the 
> column names, it uses the old column names.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.

2021-08-23 Thread merrily01 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403050#comment-17403050
 ] 

merrily01 edited comment on SPARK-36488 at 8/23/21, 9:02 AM:
-

Hi [~planga82]  ,

  Thans for your attention and sorry for late reply.

  It's OK to improve the error message, but I don't quite agree with you that 
it is not a bug.

  Firstly, for a very common SQL, the query results of opening and closing 
parameters are different, which is not easy to accept. At least it can prove 
that this feature is not perfect.(I always feel that the regular expression of 
this feature should not deal with part 'database'. )

  Secondly, at the begining, this feature was compatible with and reference to 
hive feature. But now, the above cases can be executed normally when turn on 
the feature in Hive, Spak has a problem.

 

 


was (Author: merrily01):
Hi [~planga82]  ,

  Thans for your attention and sorry for late reply.

  It's OK to improve the error message, but I don't quite agree with you that 
it is not a bug.

  Firstly, for a very common SQL, the query results of opening and closing 
parameters are different, which is not easy to accept. At least it can prove 
that this function is not perfect.(I always feel that the regular expression of 
this function should not deal with part 'database'. )

  Secondly, at the begining, this feature was compatible with and reference to 
hive feature. But now, the above cases can be executed normally when turn on 
the feature in Hive, Spak has a problem.

 

 

> "Invalid usage of '*' in expression" error due to the feature of 
> 'quotedRegexColumnNames' in some scenarios.
> 
>
> Key: SPARK-36488
> URL: https://issues.apache.org/jira/browse/SPARK-36488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: merrily01
>Priority: Major
>
>  In some cases, the error happens when the following property is set.
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> *case 1:* 
> {code:java}
> spark-sql> create table tb_test as select 1 as col_a, 2 as col_b;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> 1
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue'
> {code}
>  
> *case 2:*
> {code:java}
>          > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> 0.955414
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> Error in query: Invalid usage of '*' in expression 'divide'
> {code}
>  
> This problem exists in 3.X, 2.4.X and master versions. 
>  
> Related issue ： 
> https://issues.apache.org/jira/browse/SPARK-12139
> （As can be seen in the latest comments, some people have encountered the same 
> problem）
>  
> Similar problems：
> https://issues.apache.org/jira/browse/SPARK-28897
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403051#comment-17403051
 ] 

Apache Spark commented on SPARK-36536:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33809

> Split the JSON/CSV option of datetime format to in read and in write
> 
>
> Key: SPARK-36536
> URL: https://issues.apache.org/jira/browse/SPARK-36536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. 
> Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In 
> write, should be the same but in read the option shouldn't be set to a 
> default value. In this way, DateFormatter and TimestampFormatter will use the 
> CAST logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.

2021-08-23 Thread merrily01 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403050#comment-17403050
 ] 

merrily01 commented on SPARK-36488:
---

Hi [~planga82]  ,

  Thans for your attention and sorry for late reply.

  It's OK to improve the error message, but I don't quite agree with you that 
it is not a bug.

  Firstly, for a very common SQL, the query results of opening and closing 
parameters are different, which is not easy to accept. At least it can prove 
that this function is not perfect.(I always feel that the regular expression of 
this function should not deal with part 'database'. )

  Secondly, at the begining, this feature was compatible with and reference to 
hive feature. But now, the above cases can be executed normally when turn on 
the feature in Hive, Spak has a problem.

 

 

> "Invalid usage of '*' in expression" error due to the feature of 
> 'quotedRegexColumnNames' in some scenarios.
> 
>
> Key: SPARK-36488
> URL: https://issues.apache.org/jira/browse/SPARK-36488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: merrily01
>Priority: Major
>
>  In some cases, the error happens when the following property is set.
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> *case 1:* 
> {code:java}
> spark-sql> create table tb_test as select 1 as col_a, 2 as col_b;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> 1
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue'
> {code}
>  
> *case 2:*
> {code:java}
>          > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> 0.955414
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> Error in query: Invalid usage of '*' in expression 'divide'
> {code}
>  
> This problem exists in 3.X, 2.4.X and master versions. 
>  
> Related issue ： 
> https://issues.apache.org/jira/browse/SPARK-12139
> （As can be seen in the latest comments, some people have encountered the same 
> problem）
>  
> Similar problems：
> https://issues.apache.org/jira/browse/SPARK-28897
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403048#comment-17403048
 ] 

Apache Spark commented on SPARK-36418:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33809

> Use CAST in parsing of dates/timestamps with default pattern
> 
>
> Key: SPARK-36418
> URL: https://issues.apache.org/jira/browse/SPARK-36418
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> In functions, CSV/JSON datasources and other places, when the pattern is 
> default, use CAST logic in parsing strings to dates/timestamps.
> Currently, TimestampFormatter.getFormatter() applies the default pattern 
> *-MM-dd  HH:mm:ss* when the pattern is not set, see 
> https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344
>  . Instead of that, need to create a special formatter which invokes the cast 
> logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36536) Split the JSON/CSV option of datetime format to in read and in write

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403049#comment-17403049
 ] 

Apache Spark commented on SPARK-36536:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33809

> Split the JSON/CSV option of datetime format to in read and in write
> 
>
> Key: SPARK-36536
> URL: https://issues.apache.org/jira/browse/SPARK-36536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-36418. 
> Need to split JSON and CSV options *dateFormat* and *timestampFormat*. In 
> write, should be the same but in read the option shouldn't be set to a 
> default value. In this way, DateFormatter and TimestampFormatter will use the 
> CAST logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31936) Implement ScriptTransform in sql/core

2021-08-23 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403034#comment-17403034
 ] 

Gengliang Wang commented on SPARK-31936:


Thank you [~angerszhuuu]

> Implement ScriptTransform in sql/core
> -
>
> Key: SPARK-31936
> URL: https://issues.apache.org/jira/browse/SPARK-31936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31936) Implement ScriptTransform in sql/core

2021-08-23 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31936:
--
Fix Version/s: 3.2.0

> Implement ScriptTransform in sql/core
> -
>
> Key: SPARK-31936
> URL: https://issues.apache.org/jira/browse/SPARK-31936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31936) Implement ScriptTransform in sql/core

2021-08-23 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403030#comment-17403030
 ] 

angerszhu commented on SPARK-31936:
---

[~Gengliang.Wang] The main work has been done, we can close this. 

> Implement ScriptTransform in sql/core
> -
>
> Key: SPARK-31936
> URL: https://issues.apache.org/jira/browse/SPARK-31936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31936) Implement ScriptTransform in sql/core

2021-08-23 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-31936.
---
Resolution: Implemented

> Implement ScriptTransform in sql/core
> -
>
> Key: SPARK-31936
> URL: https://issues.apache.org/jira/browse/SPARK-31936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31936) Implement ScriptTransform in sql/core

2021-08-23 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403025#comment-17403025
 ] 

Gengliang Wang commented on SPARK-31936:


[~angerszhuuu] Is this finished?

> Implement ScriptTransform in sql/core
> -
>
> Key: SPARK-31936
> URL: https://issues.apache.org/jira/browse/SPARK-31936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36561) Remove `ColumnVector.numNulls`

2021-08-23 Thread Dmitry Sysolyatin (Jira)

Dmitry Sysolyatin created SPARK-36561:
-

 Summary: Remove `ColumnVector.numNulls`
 Key: SPARK-36561
 URL: https://issues.apache.org/jira/browse/SPARK-36561
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Dmitry Sysolyatin


Hi!
When I was implementing ColumnVector abstract class I started to check where 
`numNulls` is used in spark source code. I didn't find any places where it is 
used except tests

Is there any plans to use it in the future. If no then I suppose to remove 
`numNulls` method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36560) Deflake PySpark coverage report

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402987#comment-17402987
 ] 

Apache Spark commented on SPARK-36560:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33808

> Deflake PySpark coverage report
> ---
>
> Key: SPARK-36560
> URL: https://issues.apache.org/jira/browse/SPARK-36560
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
> https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
> https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
> https://github.com/apache/spark/runs/3338876122?check_suite_focus=true
> PySpark scheduled coverage jobs are flaky. We should deflake them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36560) Deflake PySpark coverage report

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36560:


Assignee: Apache Spark

> Deflake PySpark coverage report
> ---
>
> Key: SPARK-36560
> URL: https://issues.apache.org/jira/browse/SPARK-36560
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
> https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
> https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
> https://github.com/apache/spark/runs/3338876122?check_suite_focus=true
> PySpark scheduled coverage jobs are flaky. We should deflake them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36560) Deflake PySpark coverage report

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402986#comment-17402986
 ] 

Apache Spark commented on SPARK-36560:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33808

> Deflake PySpark coverage report
> ---
>
> Key: SPARK-36560
> URL: https://issues.apache.org/jira/browse/SPARK-36560
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
> https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
> https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
> https://github.com/apache/spark/runs/3338876122?check_suite_focus=true
> PySpark scheduled coverage jobs are flaky. We should deflake them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36560) Deflake PySpark coverage report

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36560:


Assignee: (was: Apache Spark)

> Deflake PySpark coverage report
> ---
>
> Key: SPARK-36560
> URL: https://issues.apache.org/jira/browse/SPARK-36560
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
> https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
> https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
> https://github.com/apache/spark/runs/3338876122?check_suite_focus=true
> PySpark scheduled coverage jobs are flaky. We should deflake them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36560) Deflake PySpark coverage report

2021-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36560:
-
Priority: Minor  (was: Major)

> Deflake PySpark coverage report
> ---
>
> Key: SPARK-36560
> URL: https://issues.apache.org/jira/browse/SPARK-36560
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
> https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
> https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
> https://github.com/apache/spark/runs/3338876122?check_suite_focus=true
> PySpark scheduled coverage jobs are flaky. We should deflake them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36560) Deflake PySpark coverage report

2021-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36560:
-
Description: 
https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
https://github.com/apache/spark/runs/3338876122?check_suite_focus=true

PySpark scheduled coverage jobs are flaky. We should deflake them

  was:
https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
https://github.com/apache/spark/runs/3338876122?check_suite_focus=true


> Deflake PySpark coverage report
> ---
>
> Key: SPARK-36560
> URL: https://issues.apache.org/jira/browse/SPARK-36560
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
> https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
> https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
> https://github.com/apache/spark/runs/3338876122?check_suite_focus=true
> PySpark scheduled coverage jobs are flaky. We should deflake them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36560) Deflake PySpark coverage report

2021-08-23 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-36560:


 Summary: Deflake PySpark coverage report
 Key: SPARK-36560
 URL: https://issues.apache.org/jira/browse/SPARK-36560
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
https://github.com/apache/spark/runs/3338876122?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36559) Allow column pruning on distributed sequence index (pandas API on Spark)

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36559:


Assignee: (was: Apache Spark)

> Allow column pruning on distributed sequence index (pandas API on Spark)
> 
>
> Key: SPARK-36559
> URL: https://issues.apache.org/jira/browse/SPARK-36559
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-36338 implemented distributed 
> sequence implementation on JVM side. However, it disables leveraging Spark 
> SQL engine because it creates an RDD directly, and truncate the SQL plans.
> We should move the logic into a proper SQL plan to leverage other 
> optimizations such as column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36559) Allow column pruning on distributed sequence index (pandas API on Spark)

2021-08-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36559:


Assignee: Apache Spark

> Allow column pruning on distributed sequence index (pandas API on Spark)
> 
>
> Key: SPARK-36559
> URL: https://issues.apache.org/jira/browse/SPARK-36559
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-36338 implemented distributed 
> sequence implementation on JVM side. However, it disables leveraging Spark 
> SQL engine because it creates an RDD directly, and truncate the SQL plans.
> We should move the logic into a proper SQL plan to leverage other 
> optimizations such as column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36559) Allow column pruning on distributed sequence index (pandas API on Spark)

2021-08-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402972#comment-17402972
 ] 

Apache Spark commented on SPARK-36559:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33807

> Allow column pruning on distributed sequence index (pandas API on Spark)
> 
>
> Key: SPARK-36559
> URL: https://issues.apache.org/jira/browse/SPARK-36559
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-36338 implemented distributed 
> sequence implementation on JVM side. However, it disables leveraging Spark 
> SQL engine because it creates an RDD directly, and truncate the SQL plans.
> We should move the logic into a proper SQL plan to leverage other 
> optimizations such as column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36559) Allow column pruning on distributed sequence index (pandas API on Spark)

2021-08-23 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-36559:


 Summary: Allow column pruning on distributed sequence index 
(pandas API on Spark)
 Key: SPARK-36559
 URL: https://issues.apache.org/jira/browse/SPARK-36559
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


https://issues.apache.org/jira/browse/SPARK-36338 implemented distributed 
sequence implementation on JVM side. However, it disables leveraging Spark SQL 
engine because it creates an RDD directly, and truncate the SQL plans.

We should move the logic into a proper SQL plan to leverage other optimizations 
such as column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36457) Review and fix issues in API docs

2021-08-23 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36457:

Priority: Blocker  (was: Critical)

> Review and fix issues in API docs
> -
>
> Key: SPARK-36457
> URL: https://issues.apache.org/jira/browse/SPARK-36457
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Blocker
>
> Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the 
> following issues:
> * Add missing `Since` annotation for new APIs
> * Remove the leaking class/object in API doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-08-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402964#comment-17402964
 ] 

Dongjoon Hyun commented on SPARK-25075:
---

[~LuciferYang]. Not yet although I believe Apache Spark community can switch 
the default in Year 2022. We will see the adoption of Scala 2.13 after Apache 
Spark 3.2.0 release and stabilize our release first at 3.2.1 and 3.2.2. We can 
start to build a consensus early next year and target it for Apache Spark 3.3.0.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

82 matches

Mail list logo