date:20230823

[jira] [Resolved] (SPARK-44939) Support Java 21 in SparkR SystemRequirements

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44939.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42645
[https://github.com/apache/spark/pull/42645]

> Support Java 21 in SparkR SystemRequirements
> 
>
> Key: SPARK-44939
> URL: https://issues.apache.org/jira/browse/SPARK-44939
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44939) Support Java 21 in SparkR SystemRequirements

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44939:
-

Assignee: Dongjoon Hyun

> Support Java 21 in SparkR SystemRequirements
> 
>
> Key: SPARK-44939
> URL: https://issues.apache.org/jira/browse/SPARK-44939
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44452) Ignore ArrowEncoderSuite for Java 21

2023-08-23 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44452.
--
Resolution: Won't Fix

Arrow 13.0 has already been released, this fix is no longer needed.

> Ignore ArrowEncoderSuite for Java 21
> 
>
> Key: SPARK-44452
> URL: https://issues.apache.org/jira/browse/SPARK-44452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44939) Support Java 21 in SparkR SystemRequirements

2023-08-23 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758356#comment-17758356
 ] 

Hudson commented on SPARK-44939:


User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42645

> Support Java 21 in SparkR SystemRequirements
> 
>
> Key: SPARK-44939
> URL: https://issues.apache.org/jira/browse/SPARK-44939
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44939) Support Java 21 in SparkR SystemRequirements

2023-08-23 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-44939:
-

 Summary: Support Java 21 in SparkR SystemRequirements
 Key: SPARK-44939
 URL: https://issues.apache.org/jira/browse/SPARK-44939
 Project: Spark
  Issue Type: Sub-task
  Components: R
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44936) Simplify the log when Spark HybridStore hits the memory limit

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44936.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42638
[https://github.com/apache/spark/pull/42638]

> Simplify the log when Spark HybridStore hits the memory limit
> -
>
> Key: SPARK-44936
> URL: https://issues.apache.org/jira/browse/SPARK-44936
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 4.0.0
>
>
> *BEFORE*
> {code}
> 23/08/23 22:40:34 INFO FsHistoryProvider: Failed to create HybridStore for 
> spark-1692805262618-xbiqs4fjqysv62d6708nx424qb0d4-driver-job/None. Using 
> ROCKSDB.
> java.lang.RuntimeException: Not enough memory to create hybrid store for app 
> spark-1692805262618-xbiqs4fjqysv62d6708nx424qb0d4-driver-job / None.
>   at 
> org.apache.spark.deploy.history.HistoryServerMemoryManager.lease(HistoryServerMemoryManager.scala:54)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.createHybridStore(FsHistoryProvider.scala:1256)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.loadDiskStore(FsHistoryProvider.scala:1231)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:342)
>   at 
> org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:199)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:134)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161)
>   at 
> org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:55)
>   at 
> org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:51)
>   at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:88)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:100)
>   at 
> org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:256)
>   at 
> org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:104)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
>   at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
>   at 
> org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
>   at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
>   at 
> org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>   at 
> org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>   at 
> org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
>   at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
>   at 
> org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>   at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
>   at 
> org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
>   at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
>   at 
> org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
>   at 
> org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
>   at 
>

[jira] [Assigned] (SPARK-44936) Simplify the log when Spark HybridStore hits the memory limit

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44936:
-

Assignee: Dongjoon Hyun

> Simplify the log when Spark HybridStore hits the memory limit
> -
>
> Key: SPARK-44936
> URL: https://issues.apache.org/jira/browse/SPARK-44936
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> *BEFORE*
> {code}
> 23/08/23 22:40:34 INFO FsHistoryProvider: Failed to create HybridStore for 
> spark-1692805262618-xbiqs4fjqysv62d6708nx424qb0d4-driver-job/None. Using 
> ROCKSDB.
> java.lang.RuntimeException: Not enough memory to create hybrid store for app 
> spark-1692805262618-xbiqs4fjqysv62d6708nx424qb0d4-driver-job / None.
>   at 
> org.apache.spark.deploy.history.HistoryServerMemoryManager.lease(HistoryServerMemoryManager.scala:54)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.createHybridStore(FsHistoryProvider.scala:1256)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.loadDiskStore(FsHistoryProvider.scala:1231)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:342)
>   at 
> org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:199)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:134)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161)
>   at 
> org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:55)
>   at 
> org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:51)
>   at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:88)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:100)
>   at 
> org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:256)
>   at 
> org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:104)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
>   at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
>   at 
> org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
>   at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
>   at 
> org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>   at 
> org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>   at 
> org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
>   at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
>   at 
> org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>   at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
>   at 
> org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
>   at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
>   at 
> org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
>   at 
> org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
>   at 
> org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
>   at 
> org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>   at

[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-23 Thread Varun Nalla (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758316#comment-17758316
 ] 

Varun Nalla commented on SPARK-44900:
-

[~yao]  Thanks for the comment. However we can't release as cached RDD's are 
being used in every micro batch that's why could not unpersist.

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-23 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758315#comment-17758315
 ] 

Kent Yao commented on SPARK-44900:
--

How about releasing the cached rdds if you never touch it again

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44820) Switch languages consistently across docs for all code snippets

2023-08-23 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758314#comment-17758314
 ] 

BingKun Pan commented on SPARK-44820:
-

Let me try to investigate it.

> Switch languages consistently across docs for all code snippets
> ---
>
> Key: SPARK-44820
> URL: https://issues.apache.org/jira/browse/SPARK-44820
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> When a user chooses a different language for a code snippet, all code 
> snippets on that page should switch to the chosen language. This was the 
> behavior for, for example, Spark 2.0 doc: 
> [https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html]
> But it was broken for later docs, for example the Spark 3.4.1 doc: 
> [https://spark.apache.org/docs/latest/quick-start.html]
> We should fix this behavior change and possibly add test cases to prevent 
> future regressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44938) Change default value of spark.sql.maxSinglePartitionBytes to 128m

2023-08-23 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-44938:
-

 Summary: Change default value of spark.sql.maxSinglePartitionBytes 
to 128m
 Key: SPARK-44938
 URL: https://issues.apache.org/jira/browse/SPARK-44938
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44903) Refine docstring of `approx_count_distinct`

2023-08-23 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44903.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42596
[https://github.com/apache/spark/pull/42596]

> Refine docstring of `approx_count_distinct`
> ---
>
> Key: SPARK-44903
> URL: https://issues.apache.org/jira/browse/SPARK-44903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44903) Refine docstring of `approx_count_distinct`

2023-08-23 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44903:
-

Assignee: Yang Jie

> Refine docstring of `approx_count_distinct`
> ---
>
> Key: SPARK-44903
> URL: https://issues.apache.org/jira/browse/SPARK-44903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42017) df["bad_key"] does not raise AnalysisException

2023-08-23 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42017.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42608
[https://github.com/apache/spark/pull/42608]

> df["bad_key"] does not raise AnalysisException
> --
>
> Key: SPARK-42017
> URL: https://issues.apache.org/jira/browse/SPARK-42017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 4.0.0
>
>
> e.g.)
> {code}
> 23/01/12 14:33:43 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> FAILED [  8%]
> pyspark/sql/tests/test_column.py:105 (ColumnParityTests.test_access_column)
> self =  testMethod=test_access_column>
> def test_access_column(self):
> df = self.df
> self.assertTrue(isinstance(df.key, Column))
> self.assertTrue(isinstance(df["key"], Column))
> self.assertTrue(isinstance(df[0], Column))
> self.assertRaises(IndexError, lambda: df[2])
> >   self.assertRaises(AnalysisException, lambda: df["bad_key"])
> E   AssertionError: AnalysisException not raised by 
> ../test_column.py:112: AssertionError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44750) SparkSession.Builder should respect the options

2023-08-23 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44750.
---
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42548
[https://github.com/apache/spark/pull/42548]

> SparkSession.Builder should respect the options
> ---
>
> Key: SPARK-44750
> URL: https://issues.apache.org/jira/browse/SPARK-44750
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Michael Zhang
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> In connect session builder, we use {{config}} method to set options.
> However, the options are actually ignored when we create a new session.
> {code}
> def create(self) -> "SparkSession":
> has_channel_builder = self._channel_builder is not None
> has_spark_remote = "spark.remote" in self._options
> if has_channel_builder and has_spark_remote:
> raise ValueError(
> "Only one of connection string or channelBuilder "
> "can be used to create a new SparkSession."
> )
> if not has_channel_builder and not has_spark_remote:
> raise ValueError(
> "Needs either connection string or channelBuilder to 
> create a new SparkSession."
> )
> if has_channel_builder:
> assert self._channel_builder is not None
> session = SparkSession(connection=self._channel_builder)
> else:
> spark_remote = to_str(self._options.get("spark.remote"))
> assert spark_remote is not None
> session = SparkSession(connection=spark_remote)
> SparkSession._set_default_and_active_session(session)
> return session
> {code}
> we should respect the options by invoking {{session.conf.set}} after creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2023-08-23 Thread Hasnain Lakhani (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758289#comment-17758289
 ] 

Hasnain Lakhani commented on SPARK-6373:


FYI: I filed https://issues.apache.org/jira/browse/SPARK-44937 as a follow up 
to this ticket. I revived [~turp1twin]'s patch and am cleaning it up before 
submitting a PR.

> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>Priority: Major
>  Labels: bulk-closed
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44937) Add SSL/TLS support for RPC and Shuffle communications

2023-08-23 Thread Hasnain Lakhani (Jira)

Hasnain Lakhani created SPARK-44937:
---

 Summary: Add SSL/TLS support for RPC and Shuffle communications
 Key: SPARK-44937
 URL: https://issues.apache.org/jira/browse/SPARK-44937
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Security, Shuffle, Spark Core
Affects Versions: 4.0.0
Reporter: Hasnain Lakhani


Add support for SSL/TLS based communication for Spark RPCs and block transfers 
- providing an alternative to the existing encryption / authentication 
implementation documented at 
[https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]



This is a superset of the functionality discussed in 
https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44935) Fix `RELEASE` file to have the correct information in Docker images

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44935:
-

Assignee: Dongjoon Hyun

> Fix `RELEASE` file to have the correct information in Docker images
> ---
>
> Key: SPARK-44935
> URL: https://issues.apache.org/jira/browse/SPARK-44935
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE
> -rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE
> $ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail -n1
> -rw-r--r-- 1 root root 0 Feb 21  2022 /opt/spark/RELEASE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44935) Fix `RELEASE` file to have the correct information in Docker images

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44935.
---
Fix Version/s: 3.3.4
   3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42636
[https://github.com/apache/spark/pull/42636]

> Fix `RELEASE` file to have the correct information in Docker images
> ---
>
> Key: SPARK-44935
> URL: https://issues.apache.org/jira/browse/SPARK-44935
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.4, 3.5.0, 4.0.0, 3.4.2
>
>
> {code}
> $ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE
> -rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE
> $ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail -n1
> -rw-r--r-- 1 root root 0 Feb 21  2022 /opt/spark/RELEASE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44936) Simplify the log when Spark HybridStore hits the memory limit

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44936:
--
Description: 
*BEFORE*
{code}
23/08/23 22:40:34 INFO FsHistoryProvider: Failed to create HybridStore for 
spark-1692805262618-xbiqs4fjqysv62d6708nx424qb0d4-driver-job/None. Using 
ROCKSDB.
java.lang.RuntimeException: Not enough memory to create hybrid store for app 
spark-1692805262618-xbiqs4fjqysv62d6708nx424qb0d4-driver-job / None.
at 
org.apache.spark.deploy.history.HistoryServerMemoryManager.lease(HistoryServerMemoryManager.scala:54)
at 
org.apache.spark.deploy.history.FsHistoryProvider.createHybridStore(FsHistoryProvider.scala:1256)
at 
org.apache.spark.deploy.history.FsHistoryProvider.loadDiskStore(FsHistoryProvider.scala:1231)
at 
org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:342)
at 
org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:199)
at 
org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163)
at 
org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:134)
at 
org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161)
at 
org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:55)
at 
org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:51)
at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:88)
at 
org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:100)
at 
org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:256)
at 
org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:104)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
at 
org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
at 
org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
at 
org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
at 
org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at 
org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at 
org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
at 
org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
at 
org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.sparkproject.jetty.server.Server.handle(Server.java:516)
at 
org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
at 
org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
at 
org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:479)
at 
org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at 
org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at

[jira] [Created] (SPARK-44936) Simplify the log when Spark HybridStore hits the memory limit

2023-08-23 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-44936:
-

 Summary: Simplify the log when Spark HybridStore hits the memory 
limit
 Key: SPARK-44936
 URL: https://issues.apache.org/jira/browse/SPARK-44936
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


*BEFORE*
{code}
23/08/23 22:40:34 INFO FsHistoryProvider: Failed to create HybridStore for 
spark-1692805262618-xbiqs4fjqysv62d6708nx424qb0d4-driver-job/None. Using 
ROCKSDB.
java.lang.RuntimeException: Not enough memory to create hybrid store for app 
spark-1692805262618-xbiqs4fjqysv62d6708nx424qb0d4-driver-job / None.
at 
org.apache.spark.deploy.history.HistoryServerMemoryManager.lease(HistoryServerMemoryManager.scala:54)
at 
org.apache.spark.deploy.history.FsHistoryProvider.createHybridStore(FsHistoryProvider.scala:1256)
at 
org.apache.spark.deploy.history.FsHistoryProvider.loadDiskStore(FsHistoryProvider.scala:1231)
at 
org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:342)
at 
org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:199)
at 
org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163)
at 
org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:134)
at 
org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161)
at 
org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:55)
at 
org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:51)
at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:88)
at 
org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:100)
at 
org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:256)
at 
org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:104)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
at 
org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
at 
org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
at 
org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
at 
org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at 
org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at 
org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
at 
org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
at 
org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
at 
org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.sparkproject.jetty.server.Server.handle(Server.java:516)
at 
org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
at 
org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
at 
org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:479)
at

[jira] [Created] (SPARK-44935) Fix `RELEASE` file to have the correct information in Docker images

2023-08-23 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-44935:
-

 Summary: Fix `RELEASE` file to have the correct information in 
Docker images
 Key: SPARK-44935
 URL: https://issues.apache.org/jira/browse/SPARK-44935
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.4.1, 3.3.2, 3.2.4, 3.1.3, 3.0.3, 2.4.8, 3.5.0
Reporter: Dongjoon Hyun


{code}
$ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE
-rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE

$ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail -n1
-rw-r--r-- 1 root root 0 Feb 21  2022 /opt/spark/RELEASE
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44934) PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called over CTE with duplicate attributes

2023-08-23 Thread Wen Yuen Pang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wen Yuen Pang updated SPARK-44934:
--
Description: 
When running the query
{code:java}
with cte as (
 select c1, c1, c2, c3 from t where random() > 0
)
select cte.c1, cte2.c1, cte.c2, cte2.c3 from
 (select c1, c2 from cte) cte
 inner join
 (select c1, c3 from cte) cte2
 on cte.c1 = cte2.c1 {code}
 
The query fails with the error
{code:java}
org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 
(Unknown class) for task 1
org.apache.spark.SparkException: attempted to access non-existent accumulator 
9523{code}
Further investigation shows that the rule 
PushdownPredicatesAndPruneColumnsForCTEDef creates an invalid plan when the 
output of a CTE contains duplicate expression IDs.

  was:
When running the query

```
with cte as (
 select c1, c1, c2, c3 from t where random() > 0
)
select cte.c1, cte2.c1, cte.c2, cte2.c3 from
 (select c1, c2 from cte) cte
 inner join
 (select c1, c3 from cte) cte2
 on cte.c1 = cte2.c1
```
 
The query fails with the error
```
org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 
(Unknown class) for task 1

org.apache.spark.SparkException: attempted to access non-existent accumulator 
9523

```

Further investigation shows that the rule 
`PushdownPredicatesAndPruneColumnsForCTEDef` creates an invalid plan when the 
output of a CTE contains duplicate expression IDs.


> PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called 
> over CTE with duplicate attributes
> --
>
> Key: SPARK-44934
> URL: https://issues.apache.org/jira/browse/SPARK-44934
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Wen Yuen Pang
>Priority: Minor
>
> When running the query
> {code:java}
> with cte as (
>  select c1, c1, c2, c3 from t where random() > 0
> )
> select cte.c1, cte2.c1, cte.c2, cte2.c3 from
>  (select c1, c2 from cte) cte
>  inner join
>  (select c1, c3 from cte) cte2
>  on cte.c1 = cte2.c1 {code}
>  
> The query fails with the error
> {code:java}
> org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 
> (Unknown class) for task 1
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 9523{code}
> Further investigation shows that the rule 
> PushdownPredicatesAndPruneColumnsForCTEDef creates an invalid plan when the 
> output of a CTE contains duplicate expression IDs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44934) PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called over CTE with duplicate attributes

2023-08-23 Thread Wen Yuen Pang (Jira)

Wen Yuen Pang created SPARK-44934:
-

 Summary: PushdownPredicatesAndPruneColumnsForCTEDef creates 
invalid plan when called over CTE with duplicate attributes
 Key: SPARK-44934
 URL: https://issues.apache.org/jira/browse/SPARK-44934
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.4.1
Reporter: Wen Yuen Pang


When running the query

```
with cte as (
 select c1, c1, c2, c3 from t where random() > 0
)
select cte.c1, cte2.c1, cte.c2, cte2.c3 from
 (select c1, c2 from cte) cte
 inner join
 (select c1, c3 from cte) cte2
 on cte.c1 = cte2.c1
```
 
The query fails with the error
```
org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 
(Unknown class) for task 1

org.apache.spark.SparkException: attempted to access non-existent accumulator 
9523

```

Further investigation shows that the rule 
`PushdownPredicatesAndPruneColumnsForCTEDef` creates an invalid plan when the 
output of a CTE contains duplicate expression IDs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44933) Spark structured streaming performance regression in latency times reading/writing to kafka since 3.0.2

2023-08-23 Thread eddie baggott (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eddie baggott updated SPARK-44933:
--
Description: 
 During a migration from spark 2.4.4 to spark 3.4.0 I have noticed slower 
latency times in spark structured streaming when reading and writing to kafka. 
I have tested using both CONTINUOUS and MICROBATCH.

In simple read and write to kafka using CONTINUOUS mode in spark 2.4.4 I 
usually see latency times of ~5ms in our appllication. When moving to spark 
3.4.0 this increased to ~15ms.

I stripped it back to a very simple test where I send 2 data fields in csv 
format to a kafka topic using a simple producer. Then I have a simple consumer 
which reads from the input topic and writes to an output topic. The 2 fields 
are an ID and an amount value. I read from both topics and retrieve the kafka 
timestamp value for all rows. I then subtract the input timestamp from the 
output timestamp to get the latency. To keep things as simple as possible I am 
using 1 kafka partition and I am using local[1] as the spark master.

Version    latency (ms)    Trigger
2.4.4    3.25    CONTINUOUS
3.4.0    7.23    CONTINUOUS
2.4.4    640    MICROBATCH
3.4.0    693    MICROBATCH
I have tried all versions of spark 3.x and I believe this issue was introduced 
in 3.0.2. I also tried different versions of spark 2.4.x and I see the same 
behaviour when going from 2.4.7 to 2.4.8.

In the simple test I only use a few jars. One of these is 
spark-sql-kafka-0-10_2.12 When running on spark 3.0.2 using the 3.0.2 version 
of this jar I see the slower times. When I run again on spark 3.0.2 and use the 
3.0.1 version of this jar I see the faster times.

The same thing happens between 2.4.7 version and the 2.4.8 version. The 2.4.8 
version has the slower times.

Has anyone else observed a slow down in latency in structured streaming when 
reading from kafka ?

Are there any settings I need to change when moving to these versions ?

  was:
During a migration from spark 2.4.4 to spark 3.4.0 I have noticed slower 
latency times in spark structured streaming when reading and writing to kafka. 
I have tested using both CONTINUOUS and MICROBATCH.

In simple read and write to kafka using CONTINUOUS mode in spark 2.4.4 I 
usually see latency times of ~5ms in our appllication. When moving to spark 
3.4.0 this increased to ~15ms.

I stripped it back to a very simple test where I send 2 data fields in csv 
format to a kafka topic using a simple producer. Then I have a simple consumer 
which reads from the input topic and writes to an output topic. The 2 fields 
are an ID and an amount value. I read from both topics and retrieve the kafka 
timestamp value for all rows. I then subtract the input timestamp from the 
output timestamp to get the latency. To keep things as simple as possible I am 
using 1 kafka partition and I am using local[1] as the spark master.

Version    latency (ms)    Trigger
2.4.4    3.25    CONTINUOUS
3.4.0    7.23    CONTINUOUS
2.4.4    640    MICROBATCH
3.4.0    693    MICROBATCH
I have tried all versions of spark 3.x and I believe this issue was introduced 
in 3.0.2. I also tried different versions of spark 2.4.x and I see the same 
behaviour when going from 2.4.7 to 2.4.8.

In the simple test I only use a few jars. One of these is 
spark-sql-kafka-0-10_2.12 When running on spark 3.0.2 using the 3.0.2 version 
of this jar I see the slower times. When I run again on spark 3.0.2 and use the 
3.0.1 version of this jar I see the faster times.

The same thing happens between 2.4.7 version and the 2.4.8 version. The 2.4.8 
version has the slower times.

Has anyone else observed a slow down in latency in structured streaming when 
reading from kafka ?

Are there any settings I need to change when moving to these versions ?


> Spark structured streaming performance regression in latency times 
> reading/writing to kafka since 3.0.2
> ---
>
> Key: SPARK-44933
> URL: https://issues.apache.org/jira/browse/SPARK-44933
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: eddie baggott
>Priority: Major
>
>  During a migration from spark 2.4.4 to spark 3.4.0 I have noticed slower 
> latency times in spark structured streaming when reading and writing to 
> kafka. I have tested using both CONTINUOUS and MICROBATCH.
> In simple read and write to kafka using CONTINUOUS mode in spark 2.4.4 I 
> usually see latency times of ~5ms in our appllication. When moving to spark 
> 3.4.0 this increased to ~15ms.
> I stripped it back to a very simple test where I send 2 data fields in csv 
> format to a kafka topic using a simple producer. Then I have a simple 
>

[jira] [Created] (SPARK-44933) Spark structured streaming performance regression in latency times reading/writing to kafka since 3.0.2

2023-08-23 Thread eddie baggott (Jira)

eddie baggott created SPARK-44933:
-

 Summary: Spark structured streaming performance regression in 
latency times reading/writing to kafka since 3.0.2
 Key: SPARK-44933
 URL: https://issues.apache.org/jira/browse/SPARK-44933
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.0, 3.0.2, 2.4.8
Reporter: eddie baggott


During a migration from spark 2.4.4 to spark 3.4.0 I have noticed slower 
latency times in spark structured streaming when reading and writing to kafka. 
I have tested using both CONTINUOUS and MICROBATCH.

In simple read and write to kafka using CONTINUOUS mode in spark 2.4.4 I 
usually see latency times of ~5ms in our appllication. When moving to spark 
3.4.0 this increased to ~15ms.

I stripped it back to a very simple test where I send 2 data fields in csv 
format to a kafka topic using a simple producer. Then I have a simple consumer 
which reads from the input topic and writes to an output topic. The 2 fields 
are an ID and an amount value. I read from both topics and retrieve the kafka 
timestamp value for all rows. I then subtract the input timestamp from the 
output timestamp to get the latency. To keep things as simple as possible I am 
using 1 kafka partition and I am using local[1] as the spark master.

Version    latency (ms)    Trigger
2.4.4    3.25    CONTINUOUS
3.4.0    7.23    CONTINUOUS
2.4.4    640    MICROBATCH
3.4.0    693    MICROBATCH
I have tried all versions of spark 3.x and I believe this issue was introduced 
in 3.0.2. I also tried different versions of spark 2.4.x and I see the same 
behaviour when going from 2.4.7 to 2.4.8.

In the simple test I only use a few jars. One of these is 
spark-sql-kafka-0-10_2.12 When running on spark 3.0.2 using the 3.0.2 version 
of this jar I see the slower times. When I run again on spark 3.0.2 and use the 
3.0.1 version of this jar I see the faster times.

The same thing happens between 2.4.7 version and the 2.4.8 version. The 2.4.8 
version has the slower times.

Has anyone else observed a slow down in latency in structured streaming when 
reading from kafka ?

Are there any settings I need to change when moving to these versions ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44906) Make Kubernetes[Driver|Executor]Conf.annotations substitute annotations instead of feature steps

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44906:
--
Affects Version/s: 4.0.0
   (was: 3.4.1)

> Make Kubernetes[Driver|Executor]Conf.annotations substitute annotations 
> instead of feature steps
> 
>
> Key: SPARK-44906
> URL: https://issues.apache.org/jira/browse/SPARK-44906
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Major
> Fix For: 4.0.0
>
>
> Move Utils. SubstituteAppNExecIds logic  into KubernetesConf.annotations as 
> the default logic, easy for users to reuse, rather than to rewrite it again 
> at the same logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44906) Make Kubernetes[Driver|Executor]Conf.annotations substitute annotations instead of feature steps

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44906:
--
Summary: Make Kubernetes[Driver|Executor]Conf.annotations substitute 
annotations instead of feature steps  (was: Move substituteAppNExecIds logic 
into kubernetesConf.annotations method )

> Make Kubernetes[Driver|Executor]Conf.annotations substitute annotations 
> instead of feature steps
> 
>
> Key: SPARK-44906
> URL: https://issues.apache.org/jira/browse/SPARK-44906
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Major
> Fix For: 4.0.0
>
>
> Move Utils. SubstituteAppNExecIds logic  into KubernetesConf.annotations as 
> the default logic, easy for users to reuse, rather than to rewrite it again 
> at the same logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44906) Make Kubernetes[Driver|Executor]Conf.annotations substitute annotations instead of feature steps

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44906:
--
Priority: Minor  (was: Major)

> Make Kubernetes[Driver|Executor]Conf.annotations substitute annotations 
> instead of feature steps
> 
>
> Key: SPARK-44906
> URL: https://issues.apache.org/jira/browse/SPARK-44906
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Minor
> Fix For: 4.0.0
>
>
> Move Utils. SubstituteAppNExecIds logic  into KubernetesConf.annotations as 
> the default logic, easy for users to reuse, rather than to rewrite it again 
> at the same logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44906) Move substituteAppNExecIds logic into kubernetesConf.annotations method

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44906.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42600
[https://github.com/apache/spark/pull/42600]

> Move substituteAppNExecIds logic into kubernetesConf.annotations method 
> 
>
> Key: SPARK-44906
> URL: https://issues.apache.org/jira/browse/SPARK-44906
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Major
> Fix For: 4.0.0
>
>
> Move Utils. SubstituteAppNExecIds logic  into KubernetesConf.annotations as 
> the default logic, easy for users to reuse, rather than to rewrite it again 
> at the same logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44906) Move substituteAppNExecIds logic into kubernetesConf.annotations method

2023-08-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44906:
-

Assignee: Binjie Yang

> Move substituteAppNExecIds logic into kubernetesConf.annotations method 
> 
>
> Key: SPARK-44906
> URL: https://issues.apache.org/jira/browse/SPARK-44906
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Major
>
> Move Utils. SubstituteAppNExecIds logic  into KubernetesConf.annotations as 
> the default logic, easy for users to reuse, rather than to rewrite it again 
> at the same logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44884) Spark doesn't create SUCCESS file when external path is passed

2023-08-23 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758167#comment-17758167
 ] 

Steve Loughran commented on SPARK-44884:


i'm not trying to replicate it; i have too many other things to do. in open 
source, sadly, everyone gets to fend for themselves, and i'm not actually a 
spark developer. i'd suggest looking at what changed in .saveAsTable to see 
what possible changes may be to blame...

> Spark doesn't create SUCCESS file when external path is passed
> --
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Critical
> Attachments: image-2023-08-20-18-08-38-531.png, 
> image-2023-08-20-18-46-53-342.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0
> Code to reproduce the issue.
>  
> {code:java}
> scala> spark.conf.set("spark.sql.orc.char.enabled", true)
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_dd123/").mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.table_name")
> 23/08/20 12:31:43 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.   {code}
> The above code succeeds and creates the External Hive table, but {*}there is 
> no SUCCESS file generated{*}. The same code when running spark 2.4.0, 
> generating a SUCCESS file.
> Adding the content of the bucket after table creation
>  
> !image-2023-08-20-18-08-38-531.png|width=453,height=162!
>  
> But when I don’t pass the external path as following, the SUCCESS file is 
> generated
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("us_wm_supply_chain_rcv_pre_prod.test_tb1")
>  {code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-23 Thread Varun Nalla (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758161#comment-17758161
 ] 

Varun Nalla commented on SPARK-44900:
-

[~yao] , is there a way I could prioritize this issue as it's causing us 
production impact ?

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44816) Cryptic error message when UDF associated class is not found

2023-08-23 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-44816:
-

Assignee: Niranjan Jayakar

> Cryptic error message when UDF associated class is not found
> 
>
> Key: SPARK-44816
> URL: https://issues.apache.org/jira/browse/SPARK-44816
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>
> When a Dataset API is used that either requires or is modeled as a UDF, the 
> class defining the UDF/function should be uploaded to the service fist using 
> the `addArtifact()` API.
> When this is not done, an error is thrown. However, this error message is 
> cryptic and is not clear about the problem.
> Improve this error message to make it clear that an expected class was not 
> found.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44816) Cryptic error message when UDF associated class is not found

2023-08-23 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44816.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Cryptic error message when UDF associated class is not found
> 
>
> Key: SPARK-44816
> URL: https://issues.apache.org/jira/browse/SPARK-44816
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
> Fix For: 3.5.0
>
>
> When a Dataset API is used that either requires or is modeled as a UDF, the 
> class defining the UDF/function should be uploaded to the service fist using 
> the `addArtifact()` API.
> When this is not done, an error is thrown. However, this error message is 
> cryptic and is not clear about the problem.
> Improve this error message to make it clear that an expected class was not 
> found.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44861) [CONNECT] jsonignore SparkListenerConnectOperationStarted.planRequest

2023-08-23 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44861.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> [CONNECT] jsonignore SparkListenerConnectOperationStarted.planRequest
> -
>
> Key: SPARK-44861
> URL: https://issues.apache.org/jira/browse/SPARK-44861
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Jean-Francois Desjeans Gauthier
>Assignee: Jean-Francois Desjeans Gauthier
>Priority: Major
> Fix For: 3.5.0
>
>
> SparkListenerConnectOperationStarted was added as part of SPARK-43923.
> SparkListenerConnectOperationStarted.planRequest cannot be serialized & 
> deserialized from json as it has recursive objects.
> Add @JsonIgnoreProperties(\{ "planRequest" }) to avoid failures



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44861) [CONNECT] jsonignore SparkListenerConnectOperationStarted.planRequest

2023-08-23 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-44861:
-

Assignee: Jean-Francois Desjeans Gauthier

> [CONNECT] jsonignore SparkListenerConnectOperationStarted.planRequest
> -
>
> Key: SPARK-44861
> URL: https://issues.apache.org/jira/browse/SPARK-44861
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Jean-Francois Desjeans Gauthier
>Assignee: Jean-Francois Desjeans Gauthier
>Priority: Major
>
> SparkListenerConnectOperationStarted was added as part of SPARK-43923.
> SparkListenerConnectOperationStarted.planRequest cannot be serialized & 
> deserialized from json as it has recursive objects.
> Add @JsonIgnoreProperties(\{ "planRequest" }) to avoid failures



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23887) update query progress

2023-08-23 Thread Bryan Qiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758145#comment-17758145
 ] 

Bryan Qiang commented on SPARK-23887:
-

Hello folks, I'm wondering what's the final decision on this. Because 
{{ContinuousExecution}} not calling {{finishTrigger}} so streaming metrics are 
not updated in continuous structured streaming. Thank you very much!

> update query progress
> -
>
> Key: SPARK-23887
> URL: https://issues.apache.org/jira/browse/SPARK-23887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44932) Continuous Structured Streaming not reporting streaming metrics

2023-08-23 Thread Bryan Qiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Qiang updated SPARK-44932:

Summary: Continuous Structured Streaming not reporting streaming metrics  
(was: Continuous Structured Streaming )

> Continuous Structured Streaming not reporting streaming metrics
> ---
>
> Key: SPARK-44932
> URL: https://issues.apache.org/jira/browse/SPARK-44932
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Bryan Qiang
>Priority: Major
>
> Hello, we've been running spark continuous structured streaming on standalone 
> cluster and happy with the performance however we noticed streaming metrics 
> like input rate and process rate are not updated by the `ProgressReporter` in 
> `ContinuousExecution` because the 
> [`finishTrigger`|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L293]
>  function is never invoked in `ContinuousExecution`. I'm wondering why and 
> how may get metrics like in micro-batch structured streaming.
>  
> !https://preview.redd.it/mzh4oc0cbojb1.png?width=1901=png=webp=8d649ae515e6adb7d6ce853802e0a4134c9fa277!!https://preview.redd.it/8ou3uyuofojb1.png?width=1523=png=webp=ac6bf7fa05cb90b09cb10b9ac815f64b5e97175e!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44932) Continuous Structured Streaming

2023-08-23 Thread Bryan Qiang (Jira)

Bryan Qiang created SPARK-44932:
---

 Summary: Continuous Structured Streaming 
 Key: SPARK-44932
 URL: https://issues.apache.org/jira/browse/SPARK-44932
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.4.1
Reporter: Bryan Qiang


Hello, we've been running spark continuous structured streaming on standalone 
cluster and happy with the performance however we noticed streaming metrics 
like input rate and process rate are not updated by the `ProgressReporter` in 
`ContinuousExecution` because the 
[`finishTrigger`|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L293]
 function is never invoked in `ContinuousExecution`. I'm wondering why and how 
may get metrics like in micro-batch structured streaming.

 

!https://preview.redd.it/mzh4oc0cbojb1.png?width=1901=png=webp=8d649ae515e6adb7d6ce853802e0a4134c9fa277!!https://preview.redd.it/8ou3uyuofojb1.png?width=1523=png=webp=ac6bf7fa05cb90b09cb10b9ac815f64b5e97175e!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44900) Cached DataFrame keeps growing

2023-08-23 Thread Varun Nalla (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Nalla updated SPARK-44900:

Priority: Blocker  (was: Critical)

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44931) Fix JSON Serailization for Spark Connect Event Listener

2023-08-23 Thread GridGain Integration (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758107#comment-17758107
 ] 

GridGain Integration commented on SPARK-44931:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/42630

> Fix JSON Serailization for Spark Connect Event Listener
> ---
>
> Key: SPARK-44931
> URL: https://issues.apache.org/jira/browse/SPARK-44931
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44549) Support correlated references under window functions

2023-08-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44549.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42383
[https://github.com/apache/spark/pull/42383]

> Support correlated references under window functions
> 
>
> Key: SPARK-44549
> URL: https://issues.apache.org/jira/browse/SPARK-44549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Andrey Gubichev
>Assignee: Andrey Gubichev
>Priority: Major
> Fix For: 4.0.0
>
>
> We should support subqueries with correlated references under a window 
> function operator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44549) Support correlated references under window functions

2023-08-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44549:
---

Assignee: Andrey Gubichev

> Support correlated references under window functions
> 
>
> Key: SPARK-44549
> URL: https://issues.apache.org/jira/browse/SPARK-44549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Andrey Gubichev
>Assignee: Andrey Gubichev
>Priority: Major
>
> We should support subqueries with correlated references under a window 
> function operator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44914) Upgrade Apache ivy to 2.5.2

2023-08-23 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44914:


Assignee: Bjørn Jørgensen

> Upgrade Apache ivy  to 2.5.2
> 
>
> Key: SPARK-44914
> URL: https://issues.apache.org/jira/browse/SPARK-44914
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44914) Upgrade Apache ivy to 2.5.2

2023-08-23 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44914.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42613
[https://github.com/apache/spark/pull/42613]

> Upgrade Apache ivy  to 2.5.2
> 
>
> Key: SPARK-44914
> URL: https://issues.apache.org/jira/browse/SPARK-44914
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44931) Fix JSON Serailization for Spark Connect Event Listener

2023-08-23 Thread Martin Grund (Jira)

Martin Grund created SPARK-44931:


 Summary: Fix JSON Serailization for Spark Connect Event Listener
 Key: SPARK-44931
 URL: https://issues.apache.org/jira/browse/SPARK-44931
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Martin Grund






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44908) Fix spark connect ML crossvalidator "foldCol" param

2023-08-23 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-44908.

Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42605
[https://github.com/apache/spark/pull/42605]

> Fix spark connect ML crossvalidator "foldCol" param
> ---
>
> Key: SPARK-44908
> URL: https://issues.apache.org/jira/browse/SPARK-44908
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, ML
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>
> Fix spark connect ML crossvalidator "foldCol" param.
>  
> Currently it calls `df.rdd` APIs but it is not supported in spark connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44846) PushFoldableIntoBranches in complex grouping expressions may cause bindReference error

2023-08-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757899#comment-17757899
 ] 

ASF GitHub Bot commented on SPARK-44846:


User 'zml1206' has created a pull request for this issue:
https://github.com/apache/spark/pull/42531

> PushFoldableIntoBranches in complex grouping expressions may cause 
> bindReference error
> --
>
> Key: SPARK-44846
> URL: https://issues.apache.org/jira/browse/SPARK-44846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: zhuml
>Priority: Major
>
> SQL:
> {code:java}
> select c*2 as d from
> (select if(b > 1, 1, b) as c from
> (select if(a < 0, 0 ,a) as b from t group by b) t1
> group by c) t2 {code}
> ERROR:
> {code:java}
> Couldn't find _groupingexpression#15 in [if ((_groupingexpression#15 > 1)) 1 
> else _groupingexpression#15#16]
> java.lang.IllegalStateException: Couldn't find _groupingexpression#15 in [if 
> ((_groupingexpression#15 > 1)) 1 else _groupingexpression#15#16]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241)
>     at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240)
>     at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren(TreeNode.scala:1272)
>     at 
> org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren$(TreeNode.scala:1271)
>     at 
> org.apache.spark.sql.catalyst.expressions.If.mapChildren(conditionalExpressions.scala:41)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1215)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1214)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:533)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:405)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>     at scala.collection.immutable.List.map(List.scala:293)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:360)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce(AggregateCodegenSupport.scala:69)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce$(AggregateCodegenSupport.scala:65)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:49)
>     at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
>     at 
> org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:92)
>     at 
>

[jira] [Commented] (SPARK-44923) Some directories should be cleared when regenerating files

2023-08-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757897#comment-17757897
 ] 

ASF GitHub Bot commented on SPARK-44923:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42622

> Some directories should be cleared when regenerating files
> --
>
> Key: SPARK-44923
> URL: https://issues.apache.org/jira/browse/SPARK-44923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44923) Some directories should be cleared when regenerating files

2023-08-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757898#comment-17757898
 ] 

ASF GitHub Bot commented on SPARK-44923:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42622

> Some directories should be cleared when regenerating files
> --
>
> Key: SPARK-44923
> URL: https://issues.apache.org/jira/browse/SPARK-44923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44923) Some directories should be cleared when regenerating files

2023-08-23 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44923:
-

Assignee: BingKun Pan

> Some directories should be cleared when regenerating files
> --
>
> Key: SPARK-44923
> URL: https://issues.apache.org/jira/browse/SPARK-44923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44923) Some directories should be cleared when regenerating files

2023-08-23 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44923.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42622
[https://github.com/apache/spark/pull/42622]

> Some directories should be cleared when regenerating files
> --
>
> Key: SPARK-44923
> URL: https://issues.apache.org/jira/browse/SPARK-44923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44928) Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions

2023-08-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757894#comment-17757894
 ] 

ASF GitHub Bot commented on SPARK-44928:


User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42628

> Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions
> 
>
> Key: SPARK-44928
> URL: https://issues.apache.org/jira/browse/SPARK-44928
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql import functions as F
> {code}
> isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
> (https://peps.python.org/pep-0008/#package-and-module-names).
> {quote}
> Modules should have short, all-lowercase names. Underscores can be used in 
> the module name if it improves
> readability. Python packages should also have short, all-lowercase names, 
> although the use of underscores
> is discouraged.
> {quote}
> Therefore, the module’s alias should follow this. In practice, the uppercase 
> is only used at the module/package
> level constants in my experience, see also Constants 
> (https://peps.python.org/pep-0008/#constants).
> See also this stackoverflow comment 
> (https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44930) Deterministic ApplyFunctionExpression should be foldable

2023-08-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757893#comment-17757893
 ] 

ASF GitHub Bot commented on SPARK-44930:


User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/42629

> Deterministic ApplyFunctionExpression should be foldable
> 
>
> Key: SPARK-44930
> URL: https://issues.apache.org/jira/browse/SPARK-44930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Currently, ApplyFunctionExpression is unfoldable because inherits the default 
> value from Expression.  However, it should be foldable for a deterministic 
> ApplyFunctionExpression. This could help optimize the usage for V2 UDF 
> applying to constant expressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44930) Deterministic ApplyFunctionExpression should be foldable

2023-08-23 Thread Xianyang Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyang Liu updated SPARK-44930:
-
Description: Currently, ApplyFunctionExpression is unfoldable because 
inherits the default value from Expression.  However, it should be foldable for 
a deterministic ApplyFunctionExpression. This could help optimize the usage for 
V2 UDF applying to constant expressions.  (was: Currently, 
ApplyFunctionExpression is unfoldable because inherits the default value from 
Expression.  However, it should be foldable for a deterministic 
ApplyFunctionExpression. This could help optimize the usage for V2 UDF applying 
on constant expression.)

> Deterministic ApplyFunctionExpression should be foldable
> 
>
> Key: SPARK-44930
> URL: https://issues.apache.org/jira/browse/SPARK-44930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Currently, ApplyFunctionExpression is unfoldable because inherits the default 
> value from Expression.  However, it should be foldable for a deterministic 
> ApplyFunctionExpression. This could help optimize the usage for V2 UDF 
> applying to constant expressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44219) Add extra per-rule validation for optimization rewrites.

2023-08-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757892#comment-17757892
 ] 

ASF GitHub Bot commented on SPARK-44219:


User 'YannisSismanis' has created a pull request for this issue:
https://github.com/apache/spark/pull/41763

> Add extra per-rule validation for optimization rewrites.
> 
>
> Key: SPARK-44219
> URL: https://issues.apache.org/jira/browse/SPARK-44219
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Yannis Sismanis
>Priority: Major
>
> Adds per-rule validation checks for the following:
> 1.  aggregate expressions in Aggregate plans are valid.
> 2. Grouping key types in Aggregate plans cannot by of type Map. 
> 3. No dangling references have been generated.
> This is validation is by default enabled for all tests or selectively using 
> the spark.sql.planChangeValidation=true flag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44930) Deterministic ApplyFunctionExpression should be foldable

2023-08-23 Thread Xianyang Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyang Liu updated SPARK-44930:
-
Description: Currently, ApplyFunctionExpression is unfoldable because 
inherits the default value from Expression.  However, it should be foldable for 
a deterministic ApplyFunctionExpression. This could help optimize the usage for 
V2 UDF applying on constant expression.  (was: Currently, 
ApplyFunctionExpression is unfoldable because inherits the default value from 
Expression.  However, it should be foldable for a deterministic 
ApplyFunctionExpression. This could help optimize the usage for V2 UDF applying 
on constant literal.)

> Deterministic ApplyFunctionExpression should be foldable
> 
>
> Key: SPARK-44930
> URL: https://issues.apache.org/jira/browse/SPARK-44930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Currently, ApplyFunctionExpression is unfoldable because inherits the default 
> value from Expression.  However, it should be foldable for a deterministic 
> ApplyFunctionExpression. This could help optimize the usage for V2 UDF 
> applying on constant expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44930) Deterministic ApplyFunctionExpression should be foldable

2023-08-23 Thread Xianyang Liu (Jira)

Xianyang Liu created SPARK-44930:


 Summary: Deterministic ApplyFunctionExpression should be foldable
 Key: SPARK-44930
 URL: https://issues.apache.org/jira/browse/SPARK-44930
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: Xianyang Liu


Currently, ApplyFunctionExpression is unfoldable because inherits the default 
value from Expression.  However, it should be foldable for a deterministic 
ApplyFunctionExpression. This could help optimize the usage for V2 UDF applying 
on constant literal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44928) Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions

2023-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44928:
-
Description: 
{code}
from pyspark.sql import functions as F
{code}

isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
(https://peps.python.org/pep-0008/#package-and-module-names).

{quote}
Modules should have short, all-lowercase names. Underscores can be used in the 
module name if it improves
readability. Python packages should also have short, all-lowercase names, 
although the use of underscores
is discouraged.
{quote}

Therefore, the module’s alias should follow this. In practice, the uppercase is 
only used at the module/package
level constants in my experience, see also Constants 
(https://peps.python.org/pep-0008/#constants).

See also this stackoverflow comment 
(https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).

  was:
{code}
from pyspark.sql import functions as F
{code}

isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
(https://peps.python.org/pep-0008/#package-and-module-names).

Modules should have short, all-lowercase names. Underscores can be used in the 
module name if it improves
readability. Python packages should also have short, all-lowercase names, 
although the use of underscores
is discouraged.

Therefore, the module’s alias should follow this. In practice, the uppercase is 
only used at the module/package
level constants in my experience, see also Constants 
(https://peps.python.org/pep-0008/#constants).

See also this stackoverflow comment 
(https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).


> Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions
> 
>
> Key: SPARK-44928
> URL: https://issues.apache.org/jira/browse/SPARK-44928
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql import functions as F
> {code}
> isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
> (https://peps.python.org/pep-0008/#package-and-module-names).
> {quote}
> Modules should have short, all-lowercase names. Underscores can be used in 
> the module name if it improves
> readability. Python packages should also have short, all-lowercase names, 
> although the use of underscores
> is discouraged.
> {quote}
> Therefore, the module’s alias should follow this. In practice, the uppercase 
> is only used at the module/package
> level constants in my experience, see also Constants 
> (https://peps.python.org/pep-0008/#constants).
> See also this stackoverflow comment 
> (https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-44927) Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions

2023-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon deleted SPARK-44927:
-


> Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions
> 
>
> Key: SPARK-44927
> URL: https://issues.apache.org/jira/browse/SPARK-44927
> Project: Spark
>  Issue Type: Documentation
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql import functions as F
> {code}
> isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
> (https://peps.python.org/pep-0008/#package-and-module-names).
> Modules should have short, all-lowercase names. Underscores can be used in 
> the module name if it improves
> readability. Python packages should also have short, all-lowercase names, 
> although the use of underscores
> is discouraged.
> Therefore, the module’s alias should follow this. In practice, the uppercase 
> is only used at the module/package
> level constants in my experience, see also Constants 
> (https://peps.python.org/pep-0008/#constants).
> See also this stackoverflow comment 
> (https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-44926) Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions

2023-08-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon deleted SPARK-44926:
-


> Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions
> 
>
> Key: SPARK-44926
> URL: https://issues.apache.org/jira/browse/SPARK-44926
> Project: Spark
>  Issue Type: Documentation
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql import functions as F
> {code}
> isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
> (https://peps.python.org/pep-0008/#package-and-module-names).
> Modules should have short, all-lowercase names. Underscores can be used in 
> the module name if it improves
> readability. Python packages should also have short, all-lowercase names, 
> although the use of underscores
> is discouraged.
> Therefore, the module’s alias should follow this. In practice, the uppercase 
> is only used at the module/package
> level constants in my experience, see also Constants 
> (https://peps.python.org/pep-0008/#constants).
> See also this stackoverflow comment 
> (https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44929) Truncate log output for console appender in tests

2023-08-23 Thread Kent Yao (Jira)

Kent Yao created SPARK-44929:


 Summary: Truncate log output for console appender in tests
 Key: SPARK-44929
 URL: https://issues.apache.org/jira/browse/SPARK-44929
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-23 Thread Peter Toth (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Fix Version/s: 3.5.0

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Critical
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44926) Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions

2023-08-23 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-44926:


 Summary: Replace the module alias 'sf' instead of 'F' in 
pyspark.sql import functions
 Key: SPARK-44926
 URL: https://issues.apache.org/jira/browse/SPARK-44926
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Hyukjin Kwon


{code}
from pyspark.sql import functions as F
{code}

isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
(https://peps.python.org/pep-0008/#package-and-module-names).

Modules should have short, all-lowercase names. Underscores can be used in the 
module name if it improves
readability. Python packages should also have short, all-lowercase names, 
although the use of underscores
is discouraged.

Therefore, the module’s alias should follow this. In practice, the uppercase is 
only used at the module/package
level constants in my experience, see also Constants 
(https://peps.python.org/pep-0008/#constants).

See also this stackoverflow comment 
(https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44928) Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions

2023-08-23 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-44928:


 Summary: Replace the module alias 'sf' instead of 'F' in 
pyspark.sql import functions
 Key: SPARK-44928
 URL: https://issues.apache.org/jira/browse/SPARK-44928
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Hyukjin Kwon


{code}
from pyspark.sql import functions as F
{code}

isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
(https://peps.python.org/pep-0008/#package-and-module-names).

Modules should have short, all-lowercase names. Underscores can be used in the 
module name if it improves
readability. Python packages should also have short, all-lowercase names, 
although the use of underscores
is discouraged.

Therefore, the module’s alias should follow this. In practice, the uppercase is 
only used at the module/package
level constants in my experience, see also Constants 
(https://peps.python.org/pep-0008/#constants).

See also this stackoverflow comment 
(https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44927) Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions

2023-08-23 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-44927:


 Summary: Replace the module alias 'sf' instead of 'F' in 
pyspark.sql import functions
 Key: SPARK-44927
 URL: https://issues.apache.org/jira/browse/SPARK-44927
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Hyukjin Kwon


{code}
from pyspark.sql import functions as F
{code}

isn’t very Pythonic - it does not follow PEP 8, see  Package and Module Names 
(https://peps.python.org/pep-0008/#package-and-module-names).

Modules should have short, all-lowercase names. Underscores can be used in the 
module name if it improves
readability. Python packages should also have short, all-lowercase names, 
although the use of underscores
is discouraged.

Therefore, the module’s alias should follow this. In practice, the uppercase is 
only used at the module/package
level constants in my experience, see also Constants 
(https://peps.python.org/pep-0008/#constants).

See also this stackoverflow comment 
(https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions#comment129714058_70458115).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-44871:
-
Fix Version/s: 4.0.0

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Critical
> Fix For: 3.4.2, 4.0.0, 3.3.4
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-44871:
-
Fix Version/s: 3.3.4

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Critical
> Fix For: 3.4.2, 3.3.4
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-08-23 Thread zhangzhenhao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757858#comment-17757858
 ] 

zhangzhenhao edited comment on SPARK-42905 at 8/23/23 7:35 AM:
---

minimal reproducible example. the result is incorrect and inconsistent when 
tied value size > 10_000_000

 
{code:java}
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val N = 1002
val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)
val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)
//val s1 = Statistics.corr(x, y, "spearman")
val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")
  
val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head
val r = coeff1(0, 1)
println(s"pearson correlation in spark: $r")
// pearson correlation in spark: -9.90476024495E-8 {code}
 

 

the correct result is -1.0


was (Author: JIRAUSER301717):
minimal reproducible example, the result is incorrect and inconsistent when 
tied value size > 10_000_000

```scala
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}import 
org.apache.spark.ml.stat.Correlationimport org.apache.spark.sql.Rowval N = 
1002val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)val y = 
sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")  val Row(coeff1: Matrix) = Correlation.corr(df, "features", 
"spearman").headval r = coeff1(0, 1)
println(s"pearson correlation in spark: $r")// pearson correlation in spark: 
-9.90476024495E-8
```

correct result is -1.0

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Critical
>  Labels: correctness
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  
> Another point to note is : If i add some random noise to the data, which will 
> inturn increase the distinct values in the data. It again gives consistent 
> results for any runs. Which makes me believe that the Python version handles 
> ties correctly and gives consistent results no matter how many ties exist. 
> However, pyspark

[jira] [Resolved] (SPARK-44909) Skip starting torch distributor log streaming server when it is not available

2023-08-23 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-44909.

Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42606
[https://github.com/apache/spark/pull/42606]

> Skip starting torch distributor log streaming server when it is not available
> -
>
> Key: SPARK-44909
> URL: https://issues.apache.org/jira/browse/SPARK-44909
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 0.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Skip starting torch distributor log streaming server when it is not available.
>  
> In some cases, e.g., in a databricks connect cluster, there is some network 
> limitation that casues starting log streaming server failure, but, this does 
> not need to break torch distributor training routine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-08-23 Thread zhangzhenhao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757858#comment-17757858
 ] 

zhangzhenhao commented on SPARK-42905:
--

minimal reproducible example, the result is incorrect and inconsistent when 
tied value size > 10_000_000

```scala
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}import 
org.apache.spark.ml.stat.Correlationimport org.apache.spark.sql.Rowval N = 
1002val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)val y = 
sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")  val Row(coeff1: Matrix) = Correlation.corr(df, "features", 
"spearman").headval r = coeff1(0, 1)
println(s"pearson correlation in spark: $r")// pearson correlation in spark: 
-9.90476024495E-8
```

correct result is -1.0

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Critical
>  Labels: correctness
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  
> Another point to note is : If i add some random noise to the data, which will 
> inturn increase the distinct values in the data. It again gives consistent 
> results for any runs. Which makes me believe that the Python version handles 
> ties correctly and gives consistent results no matter how many ties exist. 
> However, pyspark method is somehow not able to handle many ties in data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44909) Skip starting torch distributor log streaming server when it is not available

2023-08-23 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-44909:
--

Assignee: Weichen Xu

> Skip starting torch distributor log streaming server when it is not available
> -
>
> Key: SPARK-44909
> URL: https://issues.apache.org/jira/browse/SPARK-44909
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 0.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Skip starting torch distributor log streaming server when it is not available.
>  
> In some cases, e.g., in a databricks connect cluster, there is some network 
> limitation that casues starting log streaming server failure, but, this does 
> not need to break torch distributor training routine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44899) Refine the docstring of `DataFrame.collect`

2023-08-23 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44899:
-

Assignee: Allison Wang

> Refine the docstring of `DataFrame.collect`
> ---
>
> Key: SPARK-44899
> URL: https://issues.apache.org/jira/browse/SPARK-44899
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Make the docstring of DataFrame.collect() better and add more examples.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44899) Refine the docstring of `DataFrame.collect`

2023-08-23 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44899.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42592
[https://github.com/apache/spark/pull/42592]

> Refine the docstring of `DataFrame.collect`
> ---
>
> Key: SPARK-44899
> URL: https://issues.apache.org/jira/browse/SPARK-44899
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 4.0.0
>
>
> Make the docstring of DataFrame.collect() better and add more examples.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44921) Remove SqlBaseLexer.tokens from codebase

2023-08-23 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44921.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42620
[https://github.com/apache/spark/pull/42620]

> Remove SqlBaseLexer.tokens from codebase
> 
>
> Key: SPARK-44921
> URL: https://issues.apache.org/jira/browse/SPARK-44921
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44920) Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()

2023-08-23 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757826#comment-17757826
 ] 

Hudson commented on SPARK-44920:


User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/42619

> Use await() instead of awaitUninterruptibly() in 
> TransportClientFactory.createClient() 
> ---
>
> Key: SPARK-44920
> URL: https://issues.apache.org/jira/browse/SPARK-44920
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
>
> This is a follow up for SPARK-44241:
> That call added an `awaitUninterruptibly()` call, which I think should be a 
> plain `await()` instead. This will prevent issues when cancelling tasks with 
> hanging network connections. 
> This issue is similar to SPARK-19529



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

77 matches

Mail list logo