[jira] [Created] (SPARK-40620) Deduplication of WorkerOffer build in CoarseGrainedSchedulerBackend

2022-09-30 Thread Khalid Mammadov (Jira)
Khalid Mammadov created SPARK-40620:
---

 Summary: Deduplication of WorkerOffer build in 
CoarseGrainedSchedulerBackend
 Key: SPARK-40620
 URL: https://issues.apache.org/jira/browse/SPARK-40620
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Khalid Mammadov


WorkerOffer build in CoarseGrainedSchedulerBackend is repeated two different 
places with exact same parameters. We can deduplicate and increase readability 
by moving that to a private function 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40620) Deduplication of WorkerOffer build in CoarseGrainedSchedulerBackend

2022-09-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40620:


Assignee: Apache Spark

> Deduplication of WorkerOffer build in CoarseGrainedSchedulerBackend
> ---
>
> Key: SPARK-40620
> URL: https://issues.apache.org/jira/browse/SPARK-40620
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Assignee: Apache Spark
>Priority: Trivial
>
> WorkerOffer build in CoarseGrainedSchedulerBackend is repeated two different 
> places with exact same parameters. We can deduplicate and increase 
> readability by moving that to a private function 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40619) HivePartitionFilteringSuites teset aborted due to `java.lang.OutOfMemoryError: Metaspace`

2022-09-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40619.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38057
[https://github.com/apache/spark/pull/38057]

> HivePartitionFilteringSuites teset aborted due to 
> `java.lang.OutOfMemoryError: Metaspace`
> -
>
> Key: SPARK-40619
> URL: https://issues.apache.org/jira/browse/SPARK-40619
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> for example: 
> [https://pipelines.actions.githubusercontent.com/serviceHosts/c184045e-b556-4e78-b8ef-fb37b2eda9a3/_apis/pipelines/1/runs/46804/signedlogcontent/18?urlExpires=2022-09-30T03%3A50%3A17.2786839Z&urlSigningMethod=HMACV1&urlSignature=z5biG0fhc482vVPl3u74twzUWTssQJGny3N1xHCk43c%3D]
>  
> {code:java}
> 2022-09-29T16:23:50.4263170Z [info] 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites *** ABORTED *** 
> (26 minutes, 32 seconds)
> 2022-09-29T16:23:50.4340944Z [info]   
> java.lang.reflect.InvocationTargetException:
> 2022-09-29T16:23:50.4341736Z [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 2022-09-29T16:23:50.4342537Z [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> 2022-09-29T16:23:50.4343543Z [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 2022-09-29T16:23:50.4344319Z [info]   at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> 2022-09-29T16:23:50.4345108Z [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
> 2022-09-29T16:23:50.4346070Z [info]   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50)
> 2022-09-29T16:23:50.4347512Z [info]   at 
> org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:48)
> 2022-09-29T16:23:50.4348463Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuite.init(HivePartitionFilteringSuite.scala:73)
> 2022-09-29T16:23:50.4349656Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuite.beforeAll(HivePartitionFilteringSuite.scala:118)
> 2022-09-29T16:23:50.4350533Z [info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
> 2022-09-29T16:23:50.4351500Z [info]   at 
> org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> 2022-09-29T16:23:50.4352219Z [info]   at 
> org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> 2022-09-29T16:23:50.4353147Z [info]   at 
> org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:66)
> 2022-09-29T16:23:50.4353841Z [info]   at 
> org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1178)
> 2022-09-29T16:23:50.4354737Z [info]   at 
> org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1225)
> 2022-09-29T16:23:50.4355475Z [info]   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> 2022-09-29T16:23:50.4356464Z [info]   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> 2022-09-29T16:23:50.4357212Z [info]   at 
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> 2022-09-29T16:23:50.4358108Z [info]   at 
> org.scalatest.Suite.runNestedSuites(Suite.scala:1223)
> 2022-09-29T16:23:50.4358777Z [info]   at 
> org.scalatest.Suite.runNestedSuites$(Suite.scala:1156)
> 2022-09-29T16:23:50.4359870Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites.runNestedSuites(HivePartitionFilteringSuites.scala:24)
> 2022-09-29T16:23:50.4360679Z [info]   at 
> org.scalatest.Suite.run(Suite.scala:)
> 2022-09-29T16:23:50.4361498Z [info]   at 
> org.scalatest.Suite.run$(Suite.scala:1096)
> 2022-09-29T16:23:50.4362487Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites.run(HivePartitionFilteringSuites.scala:24)
> 2022-09-29T16:23:50.4363571Z [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
> 2022-09-29T16:23:50.4364320Z [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:516)
> 2022-09-29T16:23:50.4365208Z [info]   at 
> sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
> 2022-09-29T16:23:50.4365870Z [info]   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2022-09-29T16:23:50.4366831Z [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2022-09-29T16:23:50.4368396Z [info] 

[jira] [Assigned] (SPARK-40620) Deduplication of WorkerOffer build in CoarseGrainedSchedulerBackend

2022-09-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40620:


Assignee: (was: Apache Spark)

> Deduplication of WorkerOffer build in CoarseGrainedSchedulerBackend
> ---
>
> Key: SPARK-40620
> URL: https://issues.apache.org/jira/browse/SPARK-40620
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Priority: Trivial
>
> WorkerOffer build in CoarseGrainedSchedulerBackend is repeated two different 
> places with exact same parameters. We can deduplicate and increase 
> readability by moving that to a private function 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40620) Deduplication of WorkerOffer build in CoarseGrainedSchedulerBackend

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611400#comment-17611400
 ] 

Apache Spark commented on SPARK-40620:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/38058

> Deduplication of WorkerOffer build in CoarseGrainedSchedulerBackend
> ---
>
> Key: SPARK-40620
> URL: https://issues.apache.org/jira/browse/SPARK-40620
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Priority: Trivial
>
> WorkerOffer build in CoarseGrainedSchedulerBackend is repeated two different 
> places with exact same parameters. We can deduplicate and increase 
> readability by moving that to a private function 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40619) HivePartitionFilteringSuites teset aborted due to `java.lang.OutOfMemoryError: Metaspace`

2022-09-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40619:


Assignee: Yang Jie

> HivePartitionFilteringSuites teset aborted due to 
> `java.lang.OutOfMemoryError: Metaspace`
> -
>
> Key: SPARK-40619
> URL: https://issues.apache.org/jira/browse/SPARK-40619
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> for example: 
> [https://pipelines.actions.githubusercontent.com/serviceHosts/c184045e-b556-4e78-b8ef-fb37b2eda9a3/_apis/pipelines/1/runs/46804/signedlogcontent/18?urlExpires=2022-09-30T03%3A50%3A17.2786839Z&urlSigningMethod=HMACV1&urlSignature=z5biG0fhc482vVPl3u74twzUWTssQJGny3N1xHCk43c%3D]
>  
> {code:java}
> 2022-09-29T16:23:50.4263170Z [info] 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites *** ABORTED *** 
> (26 minutes, 32 seconds)
> 2022-09-29T16:23:50.4340944Z [info]   
> java.lang.reflect.InvocationTargetException:
> 2022-09-29T16:23:50.4341736Z [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 2022-09-29T16:23:50.4342537Z [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> 2022-09-29T16:23:50.4343543Z [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 2022-09-29T16:23:50.4344319Z [info]   at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> 2022-09-29T16:23:50.4345108Z [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
> 2022-09-29T16:23:50.4346070Z [info]   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50)
> 2022-09-29T16:23:50.4347512Z [info]   at 
> org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:48)
> 2022-09-29T16:23:50.4348463Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuite.init(HivePartitionFilteringSuite.scala:73)
> 2022-09-29T16:23:50.4349656Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuite.beforeAll(HivePartitionFilteringSuite.scala:118)
> 2022-09-29T16:23:50.4350533Z [info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
> 2022-09-29T16:23:50.4351500Z [info]   at 
> org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> 2022-09-29T16:23:50.4352219Z [info]   at 
> org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> 2022-09-29T16:23:50.4353147Z [info]   at 
> org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:66)
> 2022-09-29T16:23:50.4353841Z [info]   at 
> org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1178)
> 2022-09-29T16:23:50.4354737Z [info]   at 
> org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1225)
> 2022-09-29T16:23:50.4355475Z [info]   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> 2022-09-29T16:23:50.4356464Z [info]   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> 2022-09-29T16:23:50.4357212Z [info]   at 
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> 2022-09-29T16:23:50.4358108Z [info]   at 
> org.scalatest.Suite.runNestedSuites(Suite.scala:1223)
> 2022-09-29T16:23:50.4358777Z [info]   at 
> org.scalatest.Suite.runNestedSuites$(Suite.scala:1156)
> 2022-09-29T16:23:50.4359870Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites.runNestedSuites(HivePartitionFilteringSuites.scala:24)
> 2022-09-29T16:23:50.4360679Z [info]   at 
> org.scalatest.Suite.run(Suite.scala:)
> 2022-09-29T16:23:50.4361498Z [info]   at 
> org.scalatest.Suite.run$(Suite.scala:1096)
> 2022-09-29T16:23:50.4362487Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites.run(HivePartitionFilteringSuites.scala:24)
> 2022-09-29T16:23:50.4363571Z [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
> 2022-09-29T16:23:50.4364320Z [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:516)
> 2022-09-29T16:23:50.4365208Z [info]   at 
> sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
> 2022-09-29T16:23:50.4365870Z [info]   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2022-09-29T16:23:50.4366831Z [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2022-09-29T16:23:50.4368396Z [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2022-09-29T16:23:50.4368925Z [info]   at java.

[jira] [Commented] (SPARK-40619) HivePartitionFilteringSuites teset aborted due to `java.lang.OutOfMemoryError: Metaspace`

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611403#comment-17611403
 ] 

Apache Spark commented on SPARK-40619:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38057

> HivePartitionFilteringSuites teset aborted due to 
> `java.lang.OutOfMemoryError: Metaspace`
> -
>
> Key: SPARK-40619
> URL: https://issues.apache.org/jira/browse/SPARK-40619
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> for example: 
> [https://pipelines.actions.githubusercontent.com/serviceHosts/c184045e-b556-4e78-b8ef-fb37b2eda9a3/_apis/pipelines/1/runs/46804/signedlogcontent/18?urlExpires=2022-09-30T03%3A50%3A17.2786839Z&urlSigningMethod=HMACV1&urlSignature=z5biG0fhc482vVPl3u74twzUWTssQJGny3N1xHCk43c%3D]
>  
> {code:java}
> 2022-09-29T16:23:50.4263170Z [info] 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites *** ABORTED *** 
> (26 minutes, 32 seconds)
> 2022-09-29T16:23:50.4340944Z [info]   
> java.lang.reflect.InvocationTargetException:
> 2022-09-29T16:23:50.4341736Z [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 2022-09-29T16:23:50.4342537Z [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> 2022-09-29T16:23:50.4343543Z [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 2022-09-29T16:23:50.4344319Z [info]   at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> 2022-09-29T16:23:50.4345108Z [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:315)
> 2022-09-29T16:23:50.4346070Z [info]   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50)
> 2022-09-29T16:23:50.4347512Z [info]   at 
> org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:48)
> 2022-09-29T16:23:50.4348463Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuite.init(HivePartitionFilteringSuite.scala:73)
> 2022-09-29T16:23:50.4349656Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuite.beforeAll(HivePartitionFilteringSuite.scala:118)
> 2022-09-29T16:23:50.4350533Z [info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
> 2022-09-29T16:23:50.4351500Z [info]   at 
> org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> 2022-09-29T16:23:50.4352219Z [info]   at 
> org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> 2022-09-29T16:23:50.4353147Z [info]   at 
> org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:66)
> 2022-09-29T16:23:50.4353841Z [info]   at 
> org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1178)
> 2022-09-29T16:23:50.4354737Z [info]   at 
> org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1225)
> 2022-09-29T16:23:50.4355475Z [info]   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> 2022-09-29T16:23:50.4356464Z [info]   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> 2022-09-29T16:23:50.4357212Z [info]   at 
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> 2022-09-29T16:23:50.4358108Z [info]   at 
> org.scalatest.Suite.runNestedSuites(Suite.scala:1223)
> 2022-09-29T16:23:50.4358777Z [info]   at 
> org.scalatest.Suite.runNestedSuites$(Suite.scala:1156)
> 2022-09-29T16:23:50.4359870Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites.runNestedSuites(HivePartitionFilteringSuites.scala:24)
> 2022-09-29T16:23:50.4360679Z [info]   at 
> org.scalatest.Suite.run(Suite.scala:)
> 2022-09-29T16:23:50.4361498Z [info]   at 
> org.scalatest.Suite.run$(Suite.scala:1096)
> 2022-09-29T16:23:50.4362487Z [info]   at 
> org.apache.spark.sql.hive.client.HivePartitionFilteringSuites.run(HivePartitionFilteringSuites.scala:24)
> 2022-09-29T16:23:50.4363571Z [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
> 2022-09-29T16:23:50.4364320Z [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:516)
> 2022-09-29T16:23:50.4365208Z [info]   at 
> sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
> 2022-09-29T16:23:50.4365870Z [info]   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2022-09-29T16:23:50.4366831Z [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2022-09-

[jira] [Commented] (SPARK-40563) Error at where clause, when sql case executes by else branch

2022-09-30 Thread Vadim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611409#comment-17611409
 ] 

Vadim commented on SPARK-40563:
---

[~Zing] 

Our respect, thanks for the help!

> Error at where clause, when sql case executes by else branch
> 
>
> Key: SPARK-40563
> URL: https://issues.apache.org/jira/browse/SPARK-40563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Vadim
>Priority: Major
> Fix For: 3.3.1
>
> Attachments: java-code-example.txt, sql.txt, stack-trace.txt
>
>
> Hello!
> The Spark SQL phase optimization failed with an internal error. Please, fill 
> a bug report in, and provide the full stack trace.
>  - Spark verison 3.3.0
>  - Scala version 2.12
>  - DatasourceV2
>  - Postgres
>  - Postrgres JDBC Driver: 42+
>  - Java8
> Case:
> select
>     case
>         when (t_name = 'foo') then 'foo'
>         else 'default'
>     end as case_when
> from
>     t
> where
>     case
>         when (t_name = 'foo') then 'foo'
>         else 'default'
>     end *= 'foo';  -> works as expected*
> *--*
> select
>     case
>         when (t_name = 'foo') then 'foo'
>         else 'default'
>     end as case_when
> from
>     t
> where
>     case
>         when (t_name = 'foo') then 'foo'
>         else 'default'
>     end *= 'default'; -> query throw ex;*
> In where clause when we try find rows by else branch, spark thrown exception:
> The Spark SQL phase optimization failed with an internal error. Please, fill 
> a bug report in, and provide the full stack trace.
> Caused by: java.lang.AssertionError: assertion failed
>     at scala.Predef$.assert(Predef.scala:208)
>  
> org.apache.spark.sql.execution.datasources.v2.PushablePredicate.$anonfun$unapply$1(DataSourceV2Strategy.scala:589)
> At debugger def unapply in PushablePredicate.class
> when sql case return 'foo' -> function unapply accept: (t_name = 'foo'), as 
> instance of Predicate
> when sql case return 'default' -> function unapply accept: COALESCE(t_name = 
> 'foo', FALSE) as instance of GeneralScalarExpression and assertation failed 
> with error
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40621) Add `numeric_only` and `min_count` in `GroupBy.sum`

2022-09-30 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40621:
-

 Summary: Add `numeric_only` and `min_count` in `GroupBy.sum`
 Key: SPARK-40621
 URL: https://issues.apache.org/jira/browse/SPARK-40621
 Project: Spark
  Issue Type: Sub-task
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40621) Implement `numeric_only` and `min_count` in `GroupBy.sum`

2022-09-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40621:
--
Summary: Implement `numeric_only` and `min_count` in `GroupBy.sum`  (was: 
Add `numeric_only` and `min_count` in `GroupBy.sum`)

> Implement `numeric_only` and `min_count` in `GroupBy.sum`
> -
>
> Key: SPARK-40621
> URL: https://issues.apache.org/jira/browse/SPARK-40621
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40621) Implement `numeric_only` and `min_count` in `GroupBy.sum`

2022-09-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40621:


Assignee: (was: Apache Spark)

> Implement `numeric_only` and `min_count` in `GroupBy.sum`
> -
>
> Key: SPARK-40621
> URL: https://issues.apache.org/jira/browse/SPARK-40621
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40621) Implement `numeric_only` and `min_count` in `GroupBy.sum`

2022-09-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40621:


Assignee: Apache Spark

> Implement `numeric_only` and `min_count` in `GroupBy.sum`
> -
>
> Key: SPARK-40621
> URL: https://issues.apache.org/jira/browse/SPARK-40621
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40621) Implement `numeric_only` and `min_count` in `GroupBy.sum`

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611450#comment-17611450
 ] 

Apache Spark commented on SPARK-40621:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38060

> Implement `numeric_only` and `min_count` in `GroupBy.sum`
> -
>
> Key: SPARK-40621
> URL: https://issues.apache.org/jira/browse/SPARK-40621
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40165) Update test plugins to latest versions

2022-09-30 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40165:

Description: 
Include:
 * 1.scalacheck (from 1.16.0 to 1.17.0)
 * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
 * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)

 

  was:
Include:
 * 1.scalacheck (from 1.15.4 to 1.16.0)
 * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
 * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)

 


> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>
> Include:
>  * 1.scalacheck (from 1.16.0 to 1.17.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40621) Implement `numeric_only` and `min_count` in `GroupBy.sum`

2022-09-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40621:


Assignee: Ruifeng Zheng

> Implement `numeric_only` and `min_count` in `GroupBy.sum`
> -
>
> Key: SPARK-40621
> URL: https://issues.apache.org/jira/browse/SPARK-40621
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40621) Implement `numeric_only` and `min_count` in `GroupBy.sum`

2022-09-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40621.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38060
[https://github.com/apache/spark/pull/38060]

> Implement `numeric_only` and `min_count` in `GroupBy.sum`
> -
>
> Key: SPARK-40621
> URL: https://issues.apache.org/jira/browse/SPARK-40621
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40448) Prototype implementation

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611553#comment-17611553
 ] 

Apache Spark commented on SPARK-40448:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/38061

> Prototype implementation
> 
>
> Key: SPARK-40448
> URL: https://issues.apache.org/jira/browse/SPARK-40448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>
> In [https://github.com/apache/spark/pull/37710] we created a prototype that 
> shows the end to end integration of Spark Connect with the rest of the system.
>  
> Since the PR is quite large, we will track follow up items as children of 
> SPARK-39375



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40540) Migrate compilation errors onto error classes

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611685#comment-17611685
 ] 

Apache Spark commented on SPARK-40540:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38062

> Migrate compilation errors onto error classes
> -
>
> Key: SPARK-40540
> URL: https://issues.apache.org/jira/browse/SPARK-40540
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Use temporary error classes in the compilation exceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-09-30 Thread Ziqi Liu (Jira)
Ziqi Liu created SPARK-40622:


 Summary: Result of a single task in collect() must fit in 2GB
 Key: SPARK-40622
 URL: https://issues.apache.org/jira/browse/SPARK-40622
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.3.0
Reporter: Ziqi Liu


when collecting results, data from single partition/task is serialized through 
byte array or ByteBuffer(which is backed by byte array as well), therefore it's 
subject to java array max size limit(in terms of byte array, it's 2GB).

 

Construct a single partition larger than 2GB and collect it can easily 
reproduce the issue
{code:java}
val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) as 
data")

withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
  withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
df.queryExecution.executedPlan.executeCollect()
  }
} {code}
 will get a OOM error from 
[https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]

 

Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-09-30 Thread Ziqi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziqi Liu updated SPARK-40622:
-
Description: 
when collecting results, data from single partition/task is serialized through 
byte array or ByteBuffer(which is backed by byte array as well), therefore it's 
subject to java array max size limit(in terms of byte array, it's 2GB).

 

Construct a single partition larger than 2GB and collect it can easily 
reproduce the issue
{code:java}
// create data of size ~3GB in single partition, which exceeds the byte array 
limit
// random gen to make sure it's poorly compressed
val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) as 
data")

withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
  withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
df.queryExecution.executedPlan.executeCollect()
  }
} {code}
 will get a OOM error from 
[https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]

 

Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
this limit

  was:
when collecting results, data from single partition/task is serialized through 
byte array or ByteBuffer(which is backed by byte array as well), therefore it's 
subject to java array max size limit(in terms of byte array, it's 2GB).

 

Construct a single partition larger than 2GB and collect it can easily 
reproduce the issue
{code:java}
val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) as 
data")

withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
  withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
df.queryExecution.executedPlan.executeCollect()
  }
} {code}
 will get a OOM error from 
[https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]

 

Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
this limit


> Result of a single task in collect() must fit in 2GB
> 
>
> Key: SPARK-40622
> URL: https://issues.apache.org/jira/browse/SPARK-40622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Ziqi Liu
>Priority: Major
>
> when collecting results, data from single partition/task is serialized 
> through byte array or ByteBuffer(which is backed by byte array as well), 
> therefore it's subject to java array max size limit(in terms of byte array, 
> it's 2GB).
>  
> Construct a single partition larger than 2GB and collect it can easily 
> reproduce the issue
> {code:java}
> // create data of size ~3GB in single partition, which exceeds the byte array 
> limit
> // random gen to make sure it's poorly compressed
> val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) 
> as data")
> withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
>   withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
> df.queryExecution.executedPlan.executeCollect()
>   }
> } {code}
>  will get a OOM error from 
> [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]
>  
> Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
> this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622

2022-09-30 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611725#comment-17611725
 ] 

Bjørn Jørgensen commented on SPARK-39725:
-

Yes, for the release question I do will recommend you to ask that to 
us...@spark.org or d...@spark.org 

As we do have tools like make-distribution.sh 
https://github.com/apache/spark/blob/master/dev/make-distribution.sh

On some of my PR's I have seen that others are forwarding them to others 
repoes.  

> Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
> 
>
> Key: SPARK-39725
> URL: https://issues.apache.org/jira/browse/SPARK-39725
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: jetty-io-spark.png
>
>
> [Release note |https://github.com/eclipse/jetty.project/releases] 
> [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40569) Expose port for spark standalone mode

2022-09-30 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-40569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611730#comment-17611730
 ] 

Bjørn Jørgensen commented on SPARK-40569:
-

Like this on https://github.com/jupyter/docker-stacks/pull/1783 



> Expose port for spark standalone mode
> -
>
> Key: SPARK-40569
> URL: https://issues.apache.org/jira/browse/SPARK-40569
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40612) On Kubernetes for long running app Spark using an invalid principal to renew the delegation token

2022-09-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40612.
---
Fix Version/s: 3.3.2
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 38048
[https://github.com/apache/spark/pull/38048]

> On Kubernetes for long running app Spark using an invalid principal to renew 
> the delegation token
> -
>
> Key: SPARK-40612
> URL: https://issues.apache.org/jira/browse/SPARK-40612
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.3.2, 3.2.3, 3.4.0
>
>
> When the delegation token fetched at the first time the principal is the 
> current user but the subsequent token renewals are using a MapReduce/Yarn 
> specific principal even on Kubernetes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40623) Add new SQL built-in functions to help with redacting data

2022-09-30 Thread Daniel (Jira)
Daniel created SPARK-40623:
--

 Summary: Add new SQL built-in functions to help with redacting data
 Key: SPARK-40623
 URL: https://issues.apache.org/jira/browse/SPARK-40623
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel


This issue tracks building new scalar SQL functions into Spark for purposes of 
redacting sensitive information from fields. These can be useful for creating 
copies of tables with sensitive information removed, but retaining the same 
schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-09-30 Thread xsys (Jira)
xsys created SPARK-40624:


 Summary: A DECIMAL value with division by 0 errors in DataFrame 
but evaluates to NULL in SparkSQL
 Key: SPARK-40624
 URL: https://issues.apache.org/jira/browse/SPARK-40624
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluated to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from ws71;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-09-30 Thread xsys (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40624:
-
Description: 
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluates to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from ws71;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 

  was:
h3. Describe the bug

Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via {{spark-shell}} 
errors out during RDD creation. However, {{1.0/0}} evaluated to {{NULL}} if the 
value is inserted into a {{DECIMAL(20,10)}} column of a table via 
{{{}spark-sql{}}}.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following: (evaluated to {{{}NULL{}}})
{code:java}
spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
spark-sql> insert into decimal_vals 1.0/0;
spark-sql> select * from ws71;
71    NULL{code}
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following: (errors out during RDD creation)
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
  ... 49 elided{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).

 


> A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL 
> in SparkSQL
> 
>
> Key: SPARK-40624
> URL: https://issues.apache.org/jira/browse/SPARK-40624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via 
> {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates 
> to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a 
> table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following: (evaluated to {{{}NULL{}}})
> {code:java}
> spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
> spark-sql> insert into decimal_vals 1.0/0;
> spark-sql> select * from ws71;
> 71    NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following: (errors out during RDD creation)
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:497)
>   at java.math.BigDecimal.(BigDecimal.java:383)
>   at java.math.BigDecimal.(BigDecimal.java:809)
>   at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
>   at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
>   ... 49 elided{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---

[jira] [Assigned] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-09-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40622:


Assignee: Apache Spark

> Result of a single task in collect() must fit in 2GB
> 
>
> Key: SPARK-40622
> URL: https://issues.apache.org/jira/browse/SPARK-40622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Ziqi Liu
>Assignee: Apache Spark
>Priority: Major
>
> when collecting results, data from single partition/task is serialized 
> through byte array or ByteBuffer(which is backed by byte array as well), 
> therefore it's subject to java array max size limit(in terms of byte array, 
> it's 2GB).
>  
> Construct a single partition larger than 2GB and collect it can easily 
> reproduce the issue
> {code:java}
> // create data of size ~3GB in single partition, which exceeds the byte array 
> limit
> // random gen to make sure it's poorly compressed
> val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) 
> as data")
> withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
>   withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
> df.queryExecution.executedPlan.executeCollect()
>   }
> } {code}
>  will get a OOM error from 
> [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]
>  
> Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
> this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611800#comment-17611800
 ] 

Apache Spark commented on SPARK-40622:
--

User 'liuzqt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38064

> Result of a single task in collect() must fit in 2GB
> 
>
> Key: SPARK-40622
> URL: https://issues.apache.org/jira/browse/SPARK-40622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Ziqi Liu
>Priority: Major
>
> when collecting results, data from single partition/task is serialized 
> through byte array or ByteBuffer(which is backed by byte array as well), 
> therefore it's subject to java array max size limit(in terms of byte array, 
> it's 2GB).
>  
> Construct a single partition larger than 2GB and collect it can easily 
> reproduce the issue
> {code:java}
> // create data of size ~3GB in single partition, which exceeds the byte array 
> limit
> // random gen to make sure it's poorly compressed
> val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) 
> as data")
> withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
>   withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
> df.queryExecution.executedPlan.executeCollect()
>   }
> } {code}
>  will get a OOM error from 
> [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]
>  
> Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
> this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-09-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40622:


Assignee: (was: Apache Spark)

> Result of a single task in collect() must fit in 2GB
> 
>
> Key: SPARK-40622
> URL: https://issues.apache.org/jira/browse/SPARK-40622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Ziqi Liu
>Priority: Major
>
> when collecting results, data from single partition/task is serialized 
> through byte array or ByteBuffer(which is backed by byte array as well), 
> therefore it's subject to java array max size limit(in terms of byte array, 
> it's 2GB).
>  
> Construct a single partition larger than 2GB and collect it can easily 
> reproduce the issue
> {code:java}
> // create data of size ~3GB in single partition, which exceeds the byte array 
> limit
> // random gen to make sure it's poorly compressed
> val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) 
> as data")
> withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
>   withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
> df.queryExecution.executedPlan.executeCollect()
>   }
> } {code}
>  will get a OOM error from 
> [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]
>  
> Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
> this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40625) Add MASK_CCN and TRY_MASK_CCN functions

2022-09-30 Thread Daniel (Jira)
Daniel created SPARK-40625:
--

 Summary: Add MASK_CCN and TRY_MASK_CCN functions
 Key: SPARK-40625
 URL: https://issues.apache.org/jira/browse/SPARK-40625
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40625) Add MASK_CCN and TRY_MASK_CCN functions

2022-09-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40625:


Assignee: Apache Spark

> Add MASK_CCN and TRY_MASK_CCN functions
> ---
>
> Key: SPARK-40625
> URL: https://issues.apache.org/jira/browse/SPARK-40625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40625) Add MASK_CCN and TRY_MASK_CCN functions

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611801#comment-17611801
 ] 

Apache Spark commented on SPARK-40625:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/38065

> Add MASK_CCN and TRY_MASK_CCN functions
> ---
>
> Key: SPARK-40625
> URL: https://issues.apache.org/jira/browse/SPARK-40625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40625) Add MASK_CCN and TRY_MASK_CCN functions

2022-09-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40625:


Assignee: (was: Apache Spark)

> Add MASK_CCN and TRY_MASK_CCN functions
> ---
>
> Key: SPARK-40625
> URL: https://issues.apache.org/jira/browse/SPARK-40625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40624) A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL in SparkSQL

2022-09-30 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611803#comment-17611803
 ] 

Bruce Robbins commented on SPARK-40624:
---

That's not a Spark API throwing that exception. Instead, 
{{scala.math.BigDecimal#apply}} (which you call via {{BigDecimal("1.0/0")}}) is 
throwing the exception.

In a plain Scala REPL (no Spark), I can reproduce:
{noformat}
bash-3.2$ bin/scala
Welcome to Scala 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_311).
Type in expressions for evaluation. Or try :help.

scala> BigDecimal("1.0/0")
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:124)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:282)
  ... 28 elided

scala> 
{noformat}
My guess is that {{BigDecimal#apply(x: String)}} was not written to expect an x 
representing an expression, just representing a number. Even "2/1" fails:
{noformat}
scala> BigDecimal("2/1")
java.lang.NumberFormatException
  at java.math.BigDecimal.(BigDecimal.java:497)
  at java.math.BigDecimal.(BigDecimal.java:383)
  at java.math.BigDecimal.(BigDecimal.java:809)
  at scala.math.BigDecimal$.exact(BigDecimal.scala:124)
  at scala.math.BigDecimal$.apply(BigDecimal.scala:282)
  ... 28 elided

scala> scala> BigDecimal("2.3")
res7: scala.math.BigDecimal = 2.3

scala> 
{noformat}

> A DECIMAL value with division by 0 errors in DataFrame but evaluates to NULL 
> in SparkSQL
> 
>
> Key: SPARK-40624
> URL: https://issues.apache.org/jira/browse/SPARK-40624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid value (e.g. {{{}BigDecimal("1.0/0"){}}}) via 
> {{spark-shell}} errors out during RDD creation. However, {{1.0/0}} evaluates 
> to {{NULL}} if the value is inserted into a {{DECIMAL(20,10)}} column of a 
> table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following: (evaluated to {{{}NULL{}}})
> {code:java}
> spark-sql> create table decimal_vals(c1 DECIMAL(20,10)) stored as ORC;
> spark-sql> insert into decimal_vals 1.0/0;
> spark-sql> select * from ws71;
> 71    NULL{code}
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following: (errors out during RDD creation)
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("1.0/0"
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:497)
>   at java.math.BigDecimal.(BigDecimal.java:383)
>   at java.math.BigDecimal.(BigDecimal.java:809)
>   at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
>   at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
>   ... 49 elided{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}BigDecimal{}}}/{{{}DECIMAL(20,10){}}} and {{{}1.0/0{}}}).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40626) Do not reorder join keys in EnsureRequirements if they are not simple expressions

2022-09-30 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-40626:
---

 Summary: Do not reorder join keys in EnsureRequirements if they 
are not simple expressions
 Key: SPARK-40626
 URL: https://issues.apache.org/jira/browse/SPARK-40626
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40509) Construct an example of applyInPandasWithState in examples directory

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611826#comment-17611826
 ] 

Apache Spark commented on SPARK-40509:
--

User 'chaoqin-li1123' has created a pull request for this issue:
https://github.com/apache/spark/pull/38066

> Construct an example of applyInPandasWithState in examples directory
> 
>
> Key: SPARK-40509
> URL: https://issues.apache.org/jira/browse/SPARK-40509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Chaoqin Li
>Priority: Major
> Fix For: 3.4.0
>
>
> Since we introduce a new API (applyInPandasWithState) in PySpark, it worths 
> to have a separate full example of the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40509) Construct an example of applyInPandasWithState in examples directory

2022-09-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611825#comment-17611825
 ] 

Apache Spark commented on SPARK-40509:
--

User 'chaoqin-li1123' has created a pull request for this issue:
https://github.com/apache/spark/pull/38066

> Construct an example of applyInPandasWithState in examples directory
> 
>
> Key: SPARK-40509
> URL: https://issues.apache.org/jira/browse/SPARK-40509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Chaoqin Li
>Priority: Major
> Fix For: 3.4.0
>
>
> Since we introduce a new API (applyInPandasWithState) in PySpark, it worths 
> to have a separate full example of the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40626) Do not reorder join keys in EnsureRequirements if they are not simple expressions

2022-09-30 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40626:

Description: 

{code:scala}
sql("CREATE TABLE t1 (itemid BIGINT, eventType STRING, dt STRING) USING parquet 
PARTITIONED BY (dt)")
sql("CREATE TABLE t2 (cal_dt DATE, item_id BIGINT) using parquet")

sql("set spark.sql.autoBroadcastJoinThreshold=-1")

sql(
  """
|SELECT itemid,
|   eventtype
|FROM   t1 a
|   INNER JOIN (SELECT DISTINCT cal_dt,
|   item_id
|   FROM   t2) b
|   ON a.itemid = b.item_id
|  AND To_date(a.dt, 'MMdd') = b.cal_dt
  """.stripMargin).explain()
{code}

The plan:

{noformat}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [itemid#10L, eventtype#11]
   +- SortMergeJoin [cast(gettimestamp(dt#12, MMdd, TimestampType, 
Some(America/Los_Angeles), false) as date), itemid#10L], [cal_dt#13, 
item_id#14L], Inner
  :- Sort [cast(gettimestamp(dt#12, MMdd, TimestampType, 
Some(America/Los_Angeles), false) as date) ASC NULLS FIRST, itemid#10L ASC 
NULLS FIRST], false, 0
  :  +- Exchange hashpartitioning(cast(gettimestamp(dt#12, MMdd, 
TimestampType, Some(America/Los_Angeles), false) as date), itemid#10L, 5), 
ENSURE_REQUIREMENTS, [plan_id=48]
  : +- Filter isnotnull(itemid#10L)
  :+- FileScan parquet 
spark_catalog.default.t1[itemid#10L,eventType#11,dt#12]
  +- Sort [cal_dt#13 ASC NULLS FIRST, item_id#14L ASC NULLS FIRST], false, 0
 +- HashAggregate(keys=[cal_dt#13, item_id#14L], functions=[])
+- Exchange hashpartitioning(cal_dt#13, item_id#14L, 5), 
ENSURE_REQUIREMENTS, [plan_id=44]
   +- HashAggregate(keys=[cal_dt#13, item_id#14L], functions=[])
  +- Filter (isnotnull(item_id#14L) AND isnotnull(cal_dt#13))
 +- FileScan parquet 
spark_catalog.default.t2[cal_dt#13,item_id#14L]
{noformat}



> Do not reorder join keys in EnsureRequirements if they are not simple 
> expressions
> -
>
> Key: SPARK-40626
> URL: https://issues.apache.org/jira/browse/SPARK-40626
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> sql("CREATE TABLE t1 (itemid BIGINT, eventType STRING, dt STRING) USING 
> parquet PARTITIONED BY (dt)")
> sql("CREATE TABLE t2 (cal_dt DATE, item_id BIGINT) using parquet")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql(
>   """
> |SELECT itemid,
> |   eventtype
> |FROM   t1 a
> |   INNER JOIN (SELECT DISTINCT cal_dt,
> |   item_id
> |   FROM   t2) b
> |   ON a.itemid = b.item_id
> |  AND To_date(a.dt, 'MMdd') = b.cal_dt
>   """.stripMargin).explain()
> {code}
> The plan:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [itemid#10L, eventtype#11]
>+- SortMergeJoin [cast(gettimestamp(dt#12, MMdd, TimestampType, 
> Some(America/Los_Angeles), false) as date), itemid#10L], [cal_dt#13, 
> item_id#14L], Inner
>   :- Sort [cast(gettimestamp(dt#12, MMdd, TimestampType, 
> Some(America/Los_Angeles), false) as date) ASC NULLS FIRST, itemid#10L ASC 
> NULLS FIRST], false, 0
>   :  +- Exchange hashpartitioning(cast(gettimestamp(dt#12, MMdd, 
> TimestampType, Some(America/Los_Angeles), false) as date), itemid#10L, 5), 
> ENSURE_REQUIREMENTS, [plan_id=48]
>   : +- Filter isnotnull(itemid#10L)
>   :+- FileScan parquet 
> spark_catalog.default.t1[itemid#10L,eventType#11,dt#12]
>   +- Sort [cal_dt#13 ASC NULLS FIRST, item_id#14L ASC NULLS FIRST], 
> false, 0
>  +- HashAggregate(keys=[cal_dt#13, item_id#14L], functions=[])
> +- Exchange hashpartitioning(cal_dt#13, item_id#14L, 5), 
> ENSURE_REQUIREMENTS, [plan_id=44]
>+- HashAggregate(keys=[cal_dt#13, item_id#14L], functions=[])
>   +- Filter (isnotnull(item_id#14L) AND isnotnull(cal_dt#13))
>  +- FileScan parquet 
> spark_catalog.default.t2[cal_dt#13,item_id#14L]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org