[jira] [Commented] (SPARK-45002) Avoid uncaught exception from state store maintenance task thread on error

2023-08-28 Thread Anish Shrigondekar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759829#comment-17759829
 ] 

Anish Shrigondekar commented on SPARK-45002:


PR here: [https://github.com/apache/spark/pull/42716]

 

cc - [~kabhwan] 

> Avoid uncaught exception from state store maintenance task thread on error
> --
>
> Key: SPARK-45002
> URL: https://issues.apache.org/jira/browse/SPARK-45002
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Anish Shrigondekar
>Priority: Major
>
> Avoid uncaught exception from state store maintenance task thread on error



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45003) Refine docstring of `asc/desc`

2023-08-28 Thread Yang Jie (Jira)
Yang Jie created SPARK-45003:


 Summary: Refine docstring of `asc/desc`
 Key: SPARK-45003
 URL: https://issues.apache.org/jira/browse/SPARK-45003
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45002) Avoid uncaught exception from state store maintenance task thread on error

2023-08-28 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-45002:
--

 Summary: Avoid uncaught exception from state store maintenance 
task thread on error
 Key: SPARK-45002
 URL: https://issues.apache.org/jira/browse/SPARK-45002
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.5.1
Reporter: Anish Shrigondekar


Avoid uncaught exception from state store maintenance task thread on error



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44996.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42710
[https://github.com/apache/spark/pull/42710]

> VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
> -
>
> Key: SPARK-44996
> URL: https://issues.apache.org/jira/browse/SPARK-44996
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 4.0.0
>
>
> Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit 
> test suite `VolcanoFeatureStepSuite` behaves like an integration test. In 
> other words, it fails when there is no backend K8s clusters.
> {code}
> $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z 
> SPARK-36061"
> ...
> [info] VolcanoFeatureStepSuite:
> [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
> milliseconds)
> [info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
> are not allowed here
> [info]  in reader, line 1, column 94:
> [info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. 
> ...
> [info]  ^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44996:
-

Assignee: Dongjoon Hyun

> VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
> -
>
> Key: SPARK-44996
> URL: https://issues.apache.org/jira/browse/SPARK-44996
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit 
> test suite `VolcanoFeatureStepSuite` behaves like an integration test. In 
> other words, it fails when there is no backend K8s clusters.
> {code}
> $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z 
> SPARK-36061"
> ...
> [info] VolcanoFeatureStepSuite:
> [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
> milliseconds)
> [info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
> are not allowed here
> [info]  in reader, line 1, column 94:
> [info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. 
> ...
> [info]  ^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44999) Refactor ExternalSorter to reduce checks on shouldPartition when calling getPartition

2023-08-28 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44999:
-
Summary: Refactor ExternalSorter to reduce checks on shouldPartition when 
calling getPartition  (was: Refactor ExternalSorter#getPartition to reduce 
checks on shouldPartition)

> Refactor ExternalSorter to reduce checks on shouldPartition when calling 
> getPartition
> -
>
> Key: SPARK-44999
> URL: https://issues.apache.org/jira/browse/SPARK-44999
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
>   private def getPartition(key: K): Int = {
>     if (shouldPartition) partitioner.get.getPartition(key) else 0
>   } {code}
>  
> The {{getPartition}} method checks {{shouldPartition}} every time it is 
> called. However, {{shouldPartition}} should not be able to change after the 
> {{ExternalSorter}} is instantiated. Therefore, it can be refactored to reduce 
> the checks on {{{}shouldPartition{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-28 Thread Yauheni Audzeichyk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759821#comment-17759821
 ] 

Yauheni Audzeichyk commented on SPARK-44900:


[~yxzhang] looks like it is just disk usage tracking issue as disk space is not 
used as much.

However it affects effectiveness of cached data since Spark spills it to disk 
as it believes it doesn't fit memory anymore so eventually it becomes 100% 
stored on disk.

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44999) Refactor ExternalSorter#getPartition to reduce checks on shouldPartition

2023-08-28 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44999:
-
Summary: Refactor ExternalSorter#getPartition to reduce checks on 
shouldPartition  (was: Refactor `ExternalSorter#getPartition` to reduce the 
number of i`f else` judgments)

> Refactor ExternalSorter#getPartition to reduce checks on shouldPartition
> 
>
> Key: SPARK-44999
> URL: https://issues.apache.org/jira/browse/SPARK-44999
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
>   private def getPartition(key: K): Int = {
>     if (shouldPartition) partitioner.get.getPartition(key) else 0
>   } {code}
>  
> The {{getPartition}} method checks {{shouldPartition}} every time it is 
> called. However, {{shouldPartition}} should not be able to change after the 
> {{ExternalSorter}} is instantiated. Therefore, it can be refactored to reduce 
> the checks on {{{}shouldPartition{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-28 Thread Yuexin Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759818#comment-17759818
 ] 

Yuexin Zhang commented on SPARK-44900:
--

Hi [~varun2807]  [~yaud]  did you check the actual cached file size on disk, on 
the yarn node manager local filesystem?  Is it really ever growing?

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45001) Implement DataFrame.foreachPartition

2023-08-28 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-45001:


 Summary: Implement DataFrame.foreachPartition
 Key: SPARK-45001
 URL: https://issues.apache.org/jira/browse/SPARK-45001
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41279) Feature parity: DataFrame API in Spark Connect

2023-08-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759803#comment-17759803
 ] 

Hyukjin Kwon commented on SPARK-41279:
--

See also https://github.com/apache/spark/pull/42714

> Feature parity: DataFrame API in Spark Connect
> --
>
> Key: SPARK-41279
> URL: https://issues.apache.org/jira/browse/SPARK-41279
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Critical
>
> Implement DataFrame API in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45000) Implement DataFrame.foreach

2023-08-28 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759802#comment-17759802
 ] 

Snoot.io commented on SPARK-45000:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42714

> Implement DataFrame.foreach
> ---
>
> Key: SPARK-45000
> URL: https://issues.apache.org/jira/browse/SPARK-45000
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45000) Implement DataFrame.foreach

2023-08-28 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-45000:


 Summary: Implement DataFrame.foreach
 Key: SPARK-45000
 URL: https://issues.apache.org/jira/browse/SPARK-45000
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments

2023-08-28 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44999:
-
Description: 
{code:java}
  private def getPartition(key: K): Int = {
    if (shouldPartition) partitioner.get.getPartition(key) else 0
  } {code}
 

The {{getPartition}} method checks {{shouldPartition}} every time it is called. 
However, {{shouldPartition}} should not be able to change after the 
{{ExternalSorter}} is instantiated. Therefore, it can be refactored to reduce 
the checks on {{{}shouldPartition{}}}.

  was:
{code:java}
  private def getPartition(key: K): Int = {
    if (shouldPartition) partitioner.get.getPartition(key) else 0
  } {code}
 


> Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` 
> judgments
> --
>
> Key: SPARK-44999
> URL: https://issues.apache.org/jira/browse/SPARK-44999
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
>   private def getPartition(key: K): Int = {
>     if (shouldPartition) partitioner.get.getPartition(key) else 0
>   } {code}
>  
> The {{getPartition}} method checks {{shouldPartition}} every time it is 
> called. However, {{shouldPartition}} should not be able to change after the 
> {{ExternalSorter}} is instantiated. Therefore, it can be refactored to reduce 
> the checks on {{{}shouldPartition{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments

2023-08-28 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759799#comment-17759799
 ] 

Snoot.io commented on SPARK-44999:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/42713

> Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` 
> judgments
> --
>
> Key: SPARK-44999
> URL: https://issues.apache.org/jira/browse/SPARK-44999
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
>   private def getPartition(key: K): Int = {
>     if (shouldPartition) partitioner.get.getPartition(key) else 0
>   } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments

2023-08-28 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759798#comment-17759798
 ] 

Snoot.io commented on SPARK-44999:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/42713

> Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` 
> judgments
> --
>
> Key: SPARK-44999
> URL: https://issues.apache.org/jira/browse/SPARK-44999
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
>   private def getPartition(key: K): Int = {
>     if (shouldPartition) partitioner.get.getPartition(key) else 0
>   } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments

2023-08-28 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44999:
-
Description: 
{code:java}
  private def getPartition(key: K): Int = {
    if (shouldPartition) partitioner.get.getPartition(key) else 0
  } {code}
 

> Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` 
> judgments
> --
>
> Key: SPARK-44999
> URL: https://issues.apache.org/jira/browse/SPARK-44999
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
>   private def getPartition(key: K): Int = {
>     if (shouldPartition) partitioner.get.getPartition(key) else 0
>   } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments

2023-08-28 Thread Yang Jie (Jira)
Yang Jie created SPARK-44999:


 Summary: Refactor `ExternalSorter#getPartition` to reduce the 
number of i`f else` judgments
 Key: SPARK-44999
 URL: https://issues.apache.org/jira/browse/SPARK-44999
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44997) Align example order (Python -> Scala/Java -> R) in all Spark Doc Content

2023-08-28 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759794#comment-17759794
 ] 

Snoot.io commented on SPARK-44997:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42712

> Align example order (Python -> Scala/Java -> R) in all Spark Doc Content
> 
>
> Key: SPARK-44997
> URL: https://issues.apache.org/jira/browse/SPARK-44997
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41279) Feature parity: DataFrame API in Spark Connect

2023-08-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759789#comment-17759789
 ] 

Hyukjin Kwon commented on SPARK-41279:
--

You can run

{code}
def wrapped(itr):
for pandas_df in itr:
yield pandas_df.applymap(your_func)

df.mapInPandas(wrapped, schema=...)
{code}

> Feature parity: DataFrame API in Spark Connect
> --
>
> Key: SPARK-41279
> URL: https://issues.apache.org/jira/browse/SPARK-41279
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Critical
>
> Implement DataFrame API in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43646) Make `connect` module daily test pass

2023-08-28 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-43646:


Assignee: Yang Jie

> Make `connect` module daily test pass
> -
>
> Key: SPARK-43646
> URL: https://issues.apache.org/jira/browse/SPARK-43646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> run
> {code:java}
> build/mvn clean install -DskipTests
> build/mvn test -pl connector/connect/server {code}
> {code:java}
> - from_protobuf_messageClassName *** FAILED ***
>   org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could 
> not load Protobuf class with name 
> org.apache.spark.connect.proto.StorageLevel. 
> org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf 
> Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with 
> Protobuf classes needs to be shaded (com.google.protobuf.* --> 
> org.sparkproject.spark_protobuf.protobuf.*).
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417)
>   at 
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193)
>   at 
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> - from_protobuf_messageClassName_options *** FAILED ***
>   org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could 
> not load Protobuf class with name 
> org.apache.spark.connect.proto.StorageLevel. 
> org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf 
> Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with 
> Protobuf classes needs to be shaded (com.google.protobuf.* --> 
> org.sparkproject.spark_protobuf.protobuf.*).
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417)
>   at 
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193)
>   at 
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43646) Make `connect` module daily test pass

2023-08-28 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-43646.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42236
[https://github.com/apache/spark/pull/42236]

> Make `connect` module daily test pass
> -
>
> Key: SPARK-43646
> URL: https://issues.apache.org/jira/browse/SPARK-43646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> run
> {code:java}
> build/mvn clean install -DskipTests
> build/mvn test -pl connector/connect/server {code}
> {code:java}
> - from_protobuf_messageClassName *** FAILED ***
>   org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could 
> not load Protobuf class with name 
> org.apache.spark.connect.proto.StorageLevel. 
> org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf 
> Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with 
> Protobuf classes needs to be shaded (com.google.protobuf.* --> 
> org.sparkproject.spark_protobuf.protobuf.*).
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417)
>   at 
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193)
>   at 
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> - from_protobuf_messageClassName_options *** FAILED ***
>   org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could 
> not load Protobuf class with name 
> org.apache.spark.connect.proto.StorageLevel. 
> org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf 
> Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with 
> Protobuf classes needs to be shaded (com.google.protobuf.* --> 
> org.sparkproject.spark_protobuf.protobuf.*).
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417)
>   at 
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193)
>   at 
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
>   at 
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44998) No need to retry parsing event log path again when FileNotFoundException occurs

2023-08-28 Thread Zhen Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhen Wang updated SPARK-44998:
--
Description: 
I found a lot of retry parsing inprogress event log records in history server 
log. The application is already done while parsing, so we don't need to retry 
parsing it again when FileNotFoundException occurs.

 

!image-2023-08-29-10-47-08-027.png!

!image-2023-08-29-10-47-43-567.png!

  was:
I found a lot of retry parsing inprogress event log records in history server 
log. The application is already done while parsing, so we don't need to retry 
parsing it again when FileNotFoundException occurs.

 

!image-2023-08-29-10-43-21-991.png!

!image-2023-08-29-10-44-34-375.png!


> No need to retry parsing event log path again when FileNotFoundException 
> occurs
> ---
>
> Key: SPARK-44998
> URL: https://issues.apache.org/jira/browse/SPARK-44998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Zhen Wang
>Priority: Minor
> Attachments: image-2023-08-29-10-47-08-027.png, 
> image-2023-08-29-10-47-43-567.png
>
>
> I found a lot of retry parsing inprogress event log records in history server 
> log. The application is already done while parsing, so we don't need to retry 
> parsing it again when FileNotFoundException occurs.
>  
> !image-2023-08-29-10-47-08-027.png!
> !image-2023-08-29-10-47-43-567.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44998) No need to retry parsing event log path again when FileNotFoundException occurs

2023-08-28 Thread Zhen Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhen Wang updated SPARK-44998:
--
Attachment: image-2023-08-29-10-47-08-027.png

> No need to retry parsing event log path again when FileNotFoundException 
> occurs
> ---
>
> Key: SPARK-44998
> URL: https://issues.apache.org/jira/browse/SPARK-44998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Zhen Wang
>Priority: Minor
> Attachments: image-2023-08-29-10-47-08-027.png, 
> image-2023-08-29-10-47-43-567.png
>
>
> I found a lot of retry parsing inprogress event log records in history server 
> log. The application is already done while parsing, so we don't need to retry 
> parsing it again when FileNotFoundException occurs.
>  
> !image-2023-08-29-10-43-21-991.png!
> !image-2023-08-29-10-44-34-375.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44998) No need to retry parsing event log path again when FileNotFoundException occurs

2023-08-28 Thread Zhen Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhen Wang updated SPARK-44998:
--
Attachment: image-2023-08-29-10-47-43-567.png

> No need to retry parsing event log path again when FileNotFoundException 
> occurs
> ---
>
> Key: SPARK-44998
> URL: https://issues.apache.org/jira/browse/SPARK-44998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Zhen Wang
>Priority: Minor
> Attachments: image-2023-08-29-10-47-08-027.png, 
> image-2023-08-29-10-47-43-567.png
>
>
> I found a lot of retry parsing inprogress event log records in history server 
> log. The application is already done while parsing, so we don't need to retry 
> parsing it again when FileNotFoundException occurs.
>  
> !image-2023-08-29-10-43-21-991.png!
> !image-2023-08-29-10-44-34-375.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44998) No need to retry parsing event log path again when FileNotFoundException occurs

2023-08-28 Thread Zhen Wang (Jira)
Zhen Wang created SPARK-44998:
-

 Summary: No need to retry parsing event log path again when 
FileNotFoundException occurs
 Key: SPARK-44998
 URL: https://issues.apache.org/jira/browse/SPARK-44998
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Zhen Wang


I found a lot of retry parsing inprogress event log records in history server 
log. The application is already done while parsing, so we don't need to retry 
parsing it again when FileNotFoundException occurs.

 

!image-2023-08-29-10-43-21-991.png!

!image-2023-08-29-10-44-34-375.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44860) Implement SESSION_USER function

2023-08-28 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44860.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42549
[https://github.com/apache/spark/pull/42549]

> Implement SESSION_USER function
> ---
>
> Key: SPARK-44860
> URL: https://issues.apache.org/jira/browse/SPARK-44860
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
> Fix For: 4.0.0
>
>
> According to SQL standard SESSION_USER and CURRENT_USER behavior differs for 
> routines:
> - CURRENT_USER inside a routine should return security definer of a routine, 
> e.g. owner identity
> - SESSION_USER inside a routine should return connected user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44965) Hide internal functions/variables from `pyspark.sql.functions`

2023-08-28 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44965:
-

Assignee: Ruifeng Zheng

> Hide internal functions/variables from `pyspark.sql.functions`
> --
>
> Key: SPARK-44965
> URL: https://issues.apache.org/jira/browse/SPARK-44965
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44965) Hide internal functions/variables from `pyspark.sql.functions`

2023-08-28 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44965.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42680
[https://github.com/apache/spark/pull/42680]

> Hide internal functions/variables from `pyspark.sql.functions`
> --
>
> Key: SPARK-44965
> URL: https://issues.apache.org/jira/browse/SPARK-44965
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44997) Align example order (Python -> Scala/Java -> R) in all Spark Doc Content

2023-08-28 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44997:
---

 Summary: Align example order (Python -> Scala/Java -> R) in all 
Spark Doc Content
 Key: SPARK-44997
 URL: https://issues.apache.org/jira/browse/SPARK-44997
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44995) Promote SparkKubernetesClientFactory to DeveloperApi

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44995:
-

Assignee: Dongjoon Hyun

> Promote SparkKubernetesClientFactory to DeveloperApi
> 
>
> Key: SPARK-44995
> URL: https://issues.apache.org/jira/browse/SPARK-44995
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44995) Promote SparkKubernetesClientFactory to DeveloperApi

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44995.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42709
[https://github.com/apache/spark/pull/42709]

> Promote SparkKubernetesClientFactory to DeveloperApi
> 
>
> Key: SPARK-44995
> URL: https://issues.apache.org/jira/browse/SPARK-44995
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44993) Add ShuffleChecksumUtils.compareChecksums by reusing ShuffleChecksumTestHelp.compareChecksums

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44993:
-

Assignee: Dongjoon Hyun

> Add ShuffleChecksumUtils.compareChecksums by reusing 
> ShuffleChecksumTestHelp.compareChecksums
> -
>
> Key: SPARK-44993
> URL: https://issues.apache.org/jira/browse/SPARK-44993
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44993) Add ShuffleChecksumUtils.compareChecksums by reusing ShuffleChecksumTestHelp.compareChecksums

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44993.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42707
[https://github.com/apache/spark/pull/42707]

> Add ShuffleChecksumUtils.compareChecksums by reusing 
> ShuffleChecksumTestHelp.compareChecksums
> -
>
> Key: SPARK-44993
> URL: https://issues.apache.org/jira/browse/SPARK-44993
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44996:
--
Description: 
Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test 
suite `VolcanoFeatureStepSuite` behaves like an integration test. In other 
words, it fails when there is no backend K8s clusters.

{code}
$ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z 
SPARK-36061"
...
[info] VolcanoFeatureStepSuite:
[info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
milliseconds)
[info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
are not allowed here
[info]  in reader, line 1, column 94:
[info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. ...
[info]  ^
{code}

  was:
Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test 
suite `VolcanoFeatureStepSuite` behaves like an integration test. In other 
words, it fails when there is no network connectivity.

{code}
$ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z 
SPARK-36061"
...
[info] VolcanoFeatureStepSuite:
[info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
milliseconds)
[info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
are not allowed here
[info]  in reader, line 1, column 94:
[info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. ...
[info]  ^
{code}


> VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
> -
>
> Key: SPARK-44996
> URL: https://issues.apache.org/jira/browse/SPARK-44996
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit 
> test suite `VolcanoFeatureStepSuite` behaves like an integration test. In 
> other words, it fails when there is no backend K8s clusters.
> {code}
> $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z 
> SPARK-36061"
> ...
> [info] VolcanoFeatureStepSuite:
> [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
> milliseconds)
> [info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
> are not allowed here
> [info]  in reader, line 1, column 94:
> [info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. 
> ...
> [info]  ^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44996:
--
Description: 
Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test 
suite `VolcanoFeatureStepSuite` behaves like an integration test. In other 
words, it fails when there is no network connectivity.

{code}
$ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z 
SPARK-36061"
...
[info] VolcanoFeatureStepSuite:
[info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
milliseconds)
[info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
are not allowed here
[info]  in reader, line 1, column 94:
[info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. ...
[info]  ^
{code}

  was:
Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test 
suite `VolcanoFeatureStepSuite` behaves like an integration test. In other 
words, it fails when there is no network connectivity.

{code}
[info] VolcanoFeatureStepSuite:
[info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
milliseconds)
[info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
are not allowed here
[info]  in reader, line 1, column 94:
[info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. ...
[info]  ^
{code}


> VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
> -
>
> Key: SPARK-44996
> URL: https://issues.apache.org/jira/browse/SPARK-44996
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit 
> test suite `VolcanoFeatureStepSuite` behaves like an integration test. In 
> other words, it fails when there is no network connectivity.
> {code}
> $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z 
> SPARK-36061"
> ...
> [info] VolcanoFeatureStepSuite:
> [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
> milliseconds)
> [info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
> are not allowed here
> [info]  in reader, line 1, column 94:
> [info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. 
> ...
> [info]  ^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed

2023-08-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44996:
-

 Summary: VolcanoFeatureStep should not create 
`DefaultVolcanoClient` if not needed
 Key: SPARK-44996
 URL: https://issues.apache.org/jira/browse/SPARK-44996
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test 
suite `VolcanoFeatureStepSuite` behaves like an integration test. In other 
words, it fails when there is no network connectivity.

{code}
[info] VolcanoFeatureStepSuite:
[info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 
milliseconds)
[info]   org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values 
are not allowed here
[info]  in reader, line 1, column 94:
[info]  ... well-known/openid-configuration": dial tcp: lookup iam.corp. ...
[info]  ^
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44995) Promote SparkKubernetesClientFactory to DeveloperApi

2023-08-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44995:
-

 Summary: Promote SparkKubernetesClientFactory to DeveloperApi
 Key: SPARK-44995
 URL: https://issues.apache.org/jira/browse/SPARK-44995
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44994) Refine docstring of `DataFrame.filter`

2023-08-28 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44994:
-
Summary: Refine docstring of `DataFrame.filter`  (was: Refine docstring for 
`DataFrame.filter`)

> Refine docstring of `DataFrame.filter`
> --
>
> Key: SPARK-44994
> URL: https://issues.apache.org/jira/browse/SPARK-44994
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Refine the docstring and add more examples for DataFrame.filter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44993) Add ShuffleChecksumUtils.compareChecksums by reusing ShuffleChecksumTestHelp.compareChecksums

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44993:
--
Summary: Add ShuffleChecksumUtils.compareChecksums by reusing 
ShuffleChecksumTestHelp.compareChecksums  (was: Move compareChecksums from 
ShuffleChecksumTestHelpe to ShuffleChecksumUtils)

> Add ShuffleChecksumUtils.compareChecksums by reusing 
> ShuffleChecksumTestHelp.compareChecksums
> -
>
> Key: SPARK-44993
> URL: https://issues.apache.org/jira/browse/SPARK-44993
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44994) Refine the docstring of `DataFrame.filter`

2023-08-28 Thread Allison Wang (Jira)
Allison Wang created SPARK-44994:


 Summary: Refine the docstring of `DataFrame.filter`
 Key: SPARK-44994
 URL: https://issues.apache.org/jira/browse/SPARK-44994
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Refine the docstring and add more examples for DataFrame.filter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44994) Refine docstring for `DataFrame.filter`

2023-08-28 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44994:
-
Summary: Refine docstring for `DataFrame.filter`  (was: Refine the 
docstring of `DataFrame.filter`)

> Refine docstring for `DataFrame.filter`
> ---
>
> Key: SPARK-44994
> URL: https://issues.apache.org/jira/browse/SPARK-44994
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Refine the docstring and add more examples for DataFrame.filter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44993) Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44993:
--
Summary: Move compareChecksums from ShuffleChecksumTestHelpe to 
ShuffleChecksumUtils  (was: Move compareChecksums from ShuffleChecksumTestHelpe 
to ShuffleChecksumUtils and move compareChecksums)

> Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils
> ---
>
> Key: SPARK-44993
> URL: https://issues.apache.org/jira/browse/SPARK-44993
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44993) Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils and move compareChecksums

2023-08-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44993:
-

 Summary: Move compareChecksums from ShuffleChecksumTestHelpe to 
ShuffleChecksumUtils and move compareChecksums
 Key: SPARK-44993
 URL: https://issues.apache.org/jira/browse/SPARK-44993
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41279) Feature parity: DataFrame API in Spark Connect

2023-08-28 Thread Johannes Alberti (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759732#comment-17759732
 ] 

Johannes Alberti commented on SPARK-41279:
--

[~gurwls223] thank you for your response. When using `mapInPandas`, is 
`spark.sql.execution.arrow.maxRecordsPerBatch` always the rowset boundary? Just 
for my understanding, in `foreach` we have a invocation of `func` per row, in 
`foreachbatch` we have a invocation of `func` per partition. When moving to 
`mapInPandas`, (not that this would make much sense, but just for illustration 
purposes) ... I would need to run with `maxRecordsPerBatch=1` to have the same 
behavior as `foreach`, ... but otherwise, when running with the default of 
`maxRecordsPerBatch=10_000`, but likely with better performance than in the 
past (execution is still distributed in the cluster), ... and I will have more 
invocations of `func` then partition count, if my partitions are larger than 
`10_000` rows per partition. Is that understanding correct? Thanks again!


> Feature parity: DataFrame API in Spark Connect
> --
>
> Key: SPARK-41279
> URL: https://issues.apache.org/jira/browse/SPARK-41279
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Critical
>
> Implement DataFrame API in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44992) Add support for rack information from an environment variable

2023-08-28 Thread Holden Karau (Jira)
Holden Karau created SPARK-44992:


 Summary: Add support for rack information from an environment 
variable
 Key: SPARK-44992
 URL: https://issues.apache.org/jira/browse/SPARK-44992
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Holden Karau


This would allow us to use things like EC2_AVAILABILITY_ZONE for locality for 
Kube (or other clusters) which span multiple AZs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-28 Thread Varun Nalla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759703#comment-17759703
 ] 

Varun Nalla commented on SPARK-44900:
-

[~yao] hope you got a chance to look into what [~yaud] mentioned.

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44991) Spark json schema inference and fromJson api having inconsistent behavior

2023-08-28 Thread nirav patel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-44991:

Summary: Spark json schema inference and fromJson api having inconsistent 
behavior  (was: Spark json datasource reader and fromJson api having 
inconsistent behavior)

> Spark json schema inference and fromJson api having inconsistent behavior
> -
>
> Key: SPARK-44991
> URL: https://issues.apache.org/jira/browse/SPARK-44991
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: nirav patel
>Priority: Major
>
> Spark json reader can infer datatype of a fields. I am ingesting millions of 
> datapoints and  generating a `DataFrameA`. what i notice that Schema 
> inference mark datatype of a field with tons of Integers and Empty Strings as 
> a Long. That is an okay behavior as I don't set `primitivesAsString` cause I 
> do want  primitive type inference. I store `DataFrameA` into `TableA` 
> Now, this inference behavior is not respected by `fromJson` of `from_json` 
> api when I am trying to write new data on `TableA`. Means, if I read a chunk 
> of input data into using 
> `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
> reader complains that EmptyString cannot be cast to Long . 
> `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema 
> somehow. and `/path/to/more/data` have some value for this fields as an empty 
> string.
> I think if reader doesnt complain about Empty string during schema inference 
> it shouldn't complain either on reading without inference. May be treat Empty 
> as Null just like during schema inference or at least give an additional 
> option - treatEmptyAsNull so it's more explicit for application users? 
> ps - i marked it as bug but could be more suited as improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44991) Spark json datasource reader and fromJson api having inconsistent behavior

2023-08-28 Thread nirav patel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-44991:

Description: 
Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` cause I do want  
primitive type inference. I store `DataFrameA` into `TableA` 

Now, this inference behavior is not respected by `fromJson` of `from_json` api 
when I am trying to write new data on `TableA`. Means, if I read a chunk of 
input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` 
is psuedo method that returns `struct` of TableA schema somehow. and 
`/path/to/more/data` have some value for this fields as an empty string.

I think if reader doesnt complain about Empty string during schema inference it 
shouldn't complain either on reading without inference. May be treat Empty as 
Null just like during schema inference or at least give an additional option - 
treatEmptyAsNull so it's more explicit for application users? 

ps - i marked it as bug but could be more suited as improvements.

  was:
Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` cause I do want 
proper primitive type inference. I store `DataFrameA` into `TableA` 

Now, this infererence behavior is not respected by `fromJson` api when I am 
trying to write new data on `TableA` generated using my schema inference 
approach. Means, if I read a chunk of input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` 
is psuedo method that returns `struct` of TableA schema somehow. 


> Spark json datasource reader and fromJson api having inconsistent behavior
> --
>
> Key: SPARK-44991
> URL: https://issues.apache.org/jira/browse/SPARK-44991
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: nirav patel
>Priority: Major
>
> Spark json reader can infer datatype of a fields. I am ingesting millions of 
> datapoints and  generating a `DataFrameA`. what i notice that Schema 
> inference mark datatype of a field with tons of Integers and Empty Strings as 
> a Long. That is an okay behavior as I don't set `primitivesAsString` cause I 
> do want  primitive type inference. I store `DataFrameA` into `TableA` 
> Now, this inference behavior is not respected by `fromJson` of `from_json` 
> api when I am trying to write new data on `TableA`. Means, if I read a chunk 
> of input data into using 
> `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
> reader complains that EmptyString cannot be cast to Long . 
> `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema 
> somehow. and `/path/to/more/data` have some value for this fields as an empty 
> string.
> I think if reader doesnt complain about Empty string during schema inference 
> it shouldn't complain either on reading without inference. May be treat Empty 
> as Null just like during schema inference or at least give an additional 
> option - treatEmptyAsNull so it's more explicit for application users? 
> ps - i marked it as bug but could be more suited as improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44991) Spark json datasource reader and fromJson api having inconsistent behavior

2023-08-28 Thread nirav patel (Jira)
nirav patel created SPARK-44991:
---

 Summary: Spark json datasource reader and fromJson api having 
inconsistent behavior
 Key: SPARK-44991
 URL: https://issues.apache.org/jira/browse/SPARK-44991
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: nirav patel


Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` cause I do want 
proper primitive type inference. I store `DataFrameA` into `TableA` 

Now, this infererence behavior is not respected by `fromJson` api when I am 
trying to write new data on `TableA` generated using my schema inference 
approach. Means, if I read a chunk of input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` 
is psuedo method that returns `struct` of TableA schema somehow. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44989) Add a directional message to promote JIRA_ACCESS_TOKEN

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44989:
-

Assignee: Dongjoon Hyun

> Add a directional message to promote JIRA_ACCESS_TOKEN
> --
>
> Key: SPARK-44989
> URL: https://issues.apache.org/jira/browse/SPARK-44989
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44989) Add a directional message to promote JIRA_ACCESS_TOKEN

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44989.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42704
[https://github.com/apache/spark/pull/42704]

> Add a directional message to promote JIRA_ACCESS_TOKEN
> --
>
> Key: SPARK-44989
> URL: https://issues.apache.org/jira/browse/SPARK-44989
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44832) Fix connect client transitive classpath

2023-08-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44832.
---
Fix Version/s: 3.5.0
 Assignee: Herman van Hövell
   Resolution: Fixed

> Fix connect client transitive classpath 
> 
>
> Key: SPARK-44832
> URL: https://issues.apache.org/jira/browse/SPARK-44832
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44989) Add a directional message to promote JIRA_ACCESS_TOKEN

2023-08-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44989:
-

 Summary: Add a directional message to promote JIRA_ACCESS_TOKEN
 Key: SPARK-44989
 URL: https://issues.apache.org/jira/browse/SPARK-44989
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44985) Use toString instead of stacktrace for task reaper threadDump

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44985.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42699
[https://github.com/apache/spark/pull/42699]

> Use toString instead of stacktrace for task reaper threadDump
> -
>
> Key: SPARK-44985
> URL: https://issues.apache.org/jira/browse/SPARK-44985
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44985) Use toString instead of stacktrace for task reaper threadDump

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44985:
-

Assignee: Kent Yao

> Use toString instead of stacktrace for task reaper threadDump
> -
>
> Key: SPARK-44985
> URL: https://issues.apache.org/jira/browse/SPARK-44985
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44972) Eagerly check if the token is valid to align with the behavior of username/password auth

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44972:
-

Assignee: Kent Yao

> Eagerly check if the token is valid to align with the behavior of 
> username/password auth
> 
>
> Key: SPARK-44972
> URL: https://issues.apache.org/jira/browse/SPARK-44972
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44972) Eagerly check if the token is valid to align with the behavior of username/password auth

2023-08-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44972.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42625
[https://github.com/apache/spark/pull/42625]

> Eagerly check if the token is valid to align with the behavior of 
> username/password auth
> 
>
> Key: SPARK-44972
> URL: https://issues.apache.org/jira/browse/SPARK-44972
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44988) Parquet INT64 (TIMESTAMP(NANOS,false)) throwing Illegal Parquet type

2023-08-28 Thread Flavio Odas (Jira)
Flavio Odas created SPARK-44988:
---

 Summary: Parquet INT64 (TIMESTAMP(NANOS,false)) throwing Illegal 
Parquet type
 Key: SPARK-44988
 URL: https://issues.apache.org/jira/browse/SPARK-44988
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.4.0
Reporter: Flavio Odas


This bug seems similar to https://issues.apache.org/jira/browse/SPARK-40819, 
except that it's a problem with INT64 (TIMESTAMP(NANOS,false)), instead of 
INT64 (TIMESTAMP(NANOS,true)).

The error happens whenever I'm trying to read:
{code:java}
org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 
(TIMESTAMP(NANOS,false)).
at 
org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1762)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:206)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:283)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:224)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:187)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3(ParquetSchemaConverter.scala:147)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3$adapted(ParquetSchemaConverter.scala:117)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertInternal(ParquetSchemaConverter.scala:117)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:87)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:493)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:493)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:473)
at scala.collection.immutable.Stream.map(Stream.scala:418)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:473)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464)
at 
org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:79)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-08-28 Thread Jakub Wozniak (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759615#comment-17759615
 ] 

Jakub Wozniak commented on SPARK-44805:
---

Hello,

Is it possible to know any ETA on this one?

Is this something that could potentially be fixed in the next version of Spark 
or rather not? 

Thanks,

Jakub

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44983) Convert binary to string by to_char for the formats: hex, base64, utf-8

2023-08-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44983.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42632
[https://github.com/apache/spark/pull/42632]

> Convert binary to string by to_char for the formats: hex, base64, utf-8
> ---
>
> Key: SPARK-44983
> URL: https://issues.apache.org/jira/browse/SPARK-44983
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 4.0.0
>
>
> Map the to_char() function with a binary input to one of hex(), base64(), 
> decode() to achieve feature parity with:
> - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/to_char
> - SAP SQL Anywhere: 
> https://help.sap.com/docs/SAP_SQL_Anywhere/93079d4ba8e44920ae63ffb4def91f5b/81fe51196ce21014b9c6cf43b298.html
> - Oracle: 
> https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_CHAR-number.html#GUID-00DA076D-2468-41AB-A3AC-CC78DBA0D9CB
> - Vertica: 
> https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_CHAR.htm
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44974) Replace SparkSession/Dataset/KeyValueGroupedDataset with null during serialization

2023-08-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44974.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Replace SparkSession/Dataset/KeyValueGroupedDataset with null during 
> serialization
> --
>
> Key: SPARK-44974
> URL: https://issues.apache.org/jira/browse/SPARK-44974
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44976) Preserve full principal user name on executor side

2023-08-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-44976:
-
Summary: Preserve full principal user name on executor side  (was: 
Utils.getCurrentUserName should return the full principal name)

> Preserve full principal user name on executor side
> --
>
> Key: SPARK-44976
> URL: https://issues.apache.org/jira/browse/SPARK-44976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3, 3.3.3, 3.4.1
>Reporter: YUBI LEE
>Priority: Major
>
> SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
> shortname instead of full principal name.
> Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
> side of non-kerberized hdfs namenode.
> For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
> kerberized.
> I make a rule to add some prefix to username on the non-kerberized cluster if 
> some one access it from the kerberized cluster.
> {code}
>   
> hadoop.security.auth_to_local
> 
> RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> DEFAULT
>   
> {code}
> However, if I submit spark job with keytab & principal option, hdfs directory 
> and files ownership is not coherent.
> (I change some words for privacy.)
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
> Found 52 items
> -rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/_SUCCESS
> -rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
> hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> {code}
> Another interesting point is that if I submit spark job without keytab and 
> principal option but with kerberos authentication with {{kinit}}, it will not 
> follow {{hadoop.security.auth_to_local}} rule completely.
> {code}
> $ hdfs dfs -ls  hdfs:///user/eub/output/
> Found 3 items
> -rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
> hdfs:///user/eub/output/_SUCCESS
> -rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
> hdfs:///user/eub/output/part-0.gz
> -rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
> hdfs:///user/eub/output/part-1.gz
> {code}
> I finally found that if I submit spark job with {{--principal}} and 
> {{--keytab}} option, ugi will be different.
> (refer to 
> https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
> Only file ({{_SUCCESS}}) and output directory created by driver (application 
> master side) will respect {{hadoop.security.auth_to_local}} on the 
> non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
> provided.
> No matter how hdfs files or directory are created by executor or driver, 
> those should respect {{hadoop.security.auth_to_local}} rule and should be the 
> same.
> Workaround is to pass additional argument to change {{SPARK_USER}} on the 
> executor side.
> e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}
> {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. 
> There are some logics to append environment value with {{:}} (colon) as a 
> separator.
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44984) Remove _get_alias from DataFrame

2023-08-28 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44984:
-

Assignee: Ruifeng Zheng

> Remove _get_alias from DataFrame
> 
>
> Key: SPARK-44984
> URL: https://issues.apache.org/jira/browse/SPARK-44984
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44984) Remove _get_alias from DataFrame

2023-08-28 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44984.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42698
[https://github.com/apache/spark/pull/42698]

> Remove _get_alias from DataFrame
> 
>
> Key: SPARK-44984
> URL: https://issues.apache.org/jira/browse/SPARK-44984
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-08-28 Thread zhangzhenhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757858#comment-17757858
 ] 

zhangzhenhao edited comment on SPARK-42905 at 8/28/23 11:04 AM:


minimal reproducible example. the result is incorrect and inconsistent when 
tied value size > 10_000_000

 
{code:java}
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val N = 1002
val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)
val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)
//val s1 = Statistics.corr(x, y, "spearman")
val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")
  
val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head
val r = coeff1(0, 1)
println(s"spearman correlation in spark: $r")
// spearman correlation in spark: -9.90476024495E-8 {code}
 

 

the correct result is -1.0


was (Author: JIRAUSER301717):
minimal reproducible example. the result is incorrect and inconsistent when 
tied value size > 10_000_000

 
{code:java}
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val N = 1002
val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)
val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)
//val s1 = Statistics.corr(x, y, "spearman")
val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")
  
val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head
val r = coeff1(0, 1)
println(s"pearson correlation in spark: $r")
// pearson correlation in spark: -9.90476024495E-8 {code}
 

 

the correct result is -1.0

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Critical
>  Labels: correctness
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  
> Another point to note is : If i add some random noise to the data, which will 
> inturn increase the distinct values in the data. It again gives consistent 
> results for any runs. Which makes me believe that the Python version handles 
> ties co

[jira] [Updated] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100

2023-08-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-44987:
-
Description: Assign a name and improve the error message format.

> Assign name to the error class _LEGACY_ERROR_TEMP_1100
> --
>
> Key: SPARK-44987
> URL: https://issues.apache.org/jira/browse/SPARK-44987
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Minor
>
> Assign a name and improve the error message format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100

2023-08-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-44987:
-
Reporter: Max Gekk  (was: BingKun Pan)

> Assign name to the error class _LEGACY_ERROR_TEMP_1100
> --
>
> Key: SPARK-44987
> URL: https://issues.apache.org/jira/browse/SPARK-44987
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100

2023-08-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44987:


Assignee: Max Gekk

> Assign name to the error class _LEGACY_ERROR_TEMP_1100
> --
>
> Key: SPARK-44987
> URL: https://issues.apache.org/jira/browse/SPARK-44987
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100[1017,1073,1074,1076,1125,1126]

2023-08-28 Thread Max Gekk (Jira)
Max Gekk created SPARK-44987:


 Summary: Assign name to the error class 
_LEGACY_ERROR_TEMP_1100[1017,1073,1074,1076,1125,1126]
 Key: SPARK-44987
 URL: https://issues.apache.org/jira/browse/SPARK-44987
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100

2023-08-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-44987:
-
Summary: Assign name to the error class _LEGACY_ERROR_TEMP_1100  (was: 
Assign name to the error class 
_LEGACY_ERROR_TEMP_1100[1017,1073,1074,1076,1125,1126])

> Assign name to the error class _LEGACY_ERROR_TEMP_1100
> --
>
> Key: SPARK-44987
> URL: https://issues.apache.org/jira/browse/SPARK-44987
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44983) Convert binary to string by to_char for the formats: hex, base64, utf-8

2023-08-28 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759530#comment-17759530
 ] 

Hudson commented on SPARK-44983:


User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/42632

> Convert binary to string by to_char for the formats: hex, base64, utf-8
> ---
>
> Key: SPARK-44983
> URL: https://issues.apache.org/jira/browse/SPARK-44983
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Map the to_char() function with a binary input to one of hex(), base64(), 
> decode() to achieve feature parity with:
> - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/to_char
> - SAP SQL Anywhere: 
> https://help.sap.com/docs/SAP_SQL_Anywhere/93079d4ba8e44920ae63ffb4def91f5b/81fe51196ce21014b9c6cf43b298.html
> - Oracle: 
> https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_CHAR-number.html#GUID-00DA076D-2468-41AB-A3AC-CC78DBA0D9CB
> - Vertica: 
> https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_CHAR.htm
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44986) There should be a gap at the bottom of the HTML

2023-08-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759520#comment-17759520
 ] 

ASF GitHub Bot commented on SPARK-44986:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42702

> There should be a gap at the bottom of the HTML
> ---
>
> Key: SPARK-44986
> URL: https://issues.apache.org/jira/browse/SPARK-44986
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2023-08-28-16-46-04-705.png, 
> image-2023-08-28-16-47-11-582.png
>
>
> Before:
> !image-2023-08-28-16-47-11-582.png|width=794,height=392!
>  
> After:
> !image-2023-08-28-16-46-04-705.png|width=744,height=329!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44986) There should be a gap at the bottom of the HTML

2023-08-28 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44986:

Attachment: image-2023-08-28-16-47-11-582.png

> There should be a gap at the bottom of the HTML
> ---
>
> Key: SPARK-44986
> URL: https://issues.apache.org/jira/browse/SPARK-44986
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2023-08-28-16-46-04-705.png, 
> image-2023-08-28-16-47-11-582.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44986) There should be a gap at the bottom of the HTML

2023-08-28 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44986:

Description: 
Before:

!image-2023-08-28-16-47-11-582.png|width=794,height=392!

 

After:

!image-2023-08-28-16-46-04-705.png|width=744,height=329!

> There should be a gap at the bottom of the HTML
> ---
>
> Key: SPARK-44986
> URL: https://issues.apache.org/jira/browse/SPARK-44986
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2023-08-28-16-46-04-705.png, 
> image-2023-08-28-16-47-11-582.png
>
>
> Before:
> !image-2023-08-28-16-47-11-582.png|width=794,height=392!
>  
> After:
> !image-2023-08-28-16-46-04-705.png|width=744,height=329!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44986) There should be a gap at the bottom of the HTML

2023-08-28 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44986:

Attachment: image-2023-08-28-16-46-04-705.png

> There should be a gap at the bottom of the HTML
> ---
>
> Key: SPARK-44986
> URL: https://issues.apache.org/jira/browse/SPARK-44986
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2023-08-28-16-46-04-705.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44986) There should be a gap at the bottom of the HTML

2023-08-28 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44986:
---

 Summary: There should be a gap at the bottom of the HTML
 Key: SPARK-44986
 URL: https://issues.apache.org/jira/browse/SPARK-44986
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44982) Mark Spark Connect configurations as static configuration

2023-08-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44982.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42695
[https://github.com/apache/spark/pull/42695]

> Mark Spark Connect configurations as static configuration
> -
>
> Key: SPARK-44982
> URL: https://issues.apache.org/jira/browse/SPARK-44982
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Spark Connect server configurations are not marked either static or runtime 
> yet. We should mark them static.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44982) Mark Spark Connect configurations as static configuration

2023-08-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44982:


Assignee: Hyukjin Kwon

> Mark Spark Connect configurations as static configuration
> -
>
> Key: SPARK-44982
> URL: https://issues.apache.org/jira/browse/SPARK-44982
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Spark Connect server configurations are not marked either static or runtime 
> yet. We should mark them static.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44981) Filter out static configurations used in local mode

2023-08-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44981.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42694
[https://github.com/apache/spark/pull/42694]

> Filter out static configurations used in local mode
> ---
>
> Key: SPARK-44981
> URL: https://issues.apache.org/jira/browse/SPARK-44981
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>
> If you set a static configuration with `--remote local` mode, it shows a 
> bunch of warnings as below:
> {code}
> 23/08/28 11:39:42 ERROR ErrorUtils: Spark Connect RPC error during: config. 
> UserId: hyukjin.kwon. SessionId: 424674ef-af95-4b12-b10e-86479413f9fd.
> org.apache.spark.sql.AnalysisException: Cannot modify the value of a static 
> config: spark.connect.copyFromLocalToFs.allowDestLocal.
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfStaticConfigError(QueryCompilationErrors.scala:3227)
>   at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:162)
>   at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1(SparkConnectConfigHandler.scala:67)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1$adapted(SparkConnectConfigHandler.scala:65)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handleSet(SparkConnectConfigHandler.scala:65)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handle(SparkConnectConfigHandler.scala:40)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectService.config(SparkConnectService.scala:120)
>   at 
> org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:751)
>   at 
> org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
>   at 
> org.sparkproject.connect.grpc.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346)
>   at 
> org.sparkproject.connect.grpc.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860)
>   at 
> org.sparkproject.connect.grpc.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>   at 
> org.sparkproject.connect.grpc.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44981) Filter out static configurations used in local mode

2023-08-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44981:


Assignee: Hyukjin Kwon

> Filter out static configurations used in local mode
> ---
>
> Key: SPARK-44981
> URL: https://issues.apache.org/jira/browse/SPARK-44981
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> If you set a static configuration with `--remote local` mode, it shows a 
> bunch of warnings as below:
> {code}
> 23/08/28 11:39:42 ERROR ErrorUtils: Spark Connect RPC error during: config. 
> UserId: hyukjin.kwon. SessionId: 424674ef-af95-4b12-b10e-86479413f9fd.
> org.apache.spark.sql.AnalysisException: Cannot modify the value of a static 
> config: spark.connect.copyFromLocalToFs.allowDestLocal.
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfStaticConfigError(QueryCompilationErrors.scala:3227)
>   at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:162)
>   at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1(SparkConnectConfigHandler.scala:67)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1$adapted(SparkConnectConfigHandler.scala:65)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handleSet(SparkConnectConfigHandler.scala:65)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handle(SparkConnectConfigHandler.scala:40)
>   at 
> org.apache.spark.sql.connect.service.SparkConnectService.config(SparkConnectService.scala:120)
>   at 
> org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:751)
>   at 
> org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
>   at 
> org.sparkproject.connect.grpc.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346)
>   at 
> org.sparkproject.connect.grpc.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860)
>   at 
> org.sparkproject.connect.grpc.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>   at 
> org.sparkproject.connect.grpc.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44985) Use toString instead of stacktrace for task reaper threadDump

2023-08-28 Thread Kent Yao (Jira)
Kent Yao created SPARK-44985:


 Summary: Use toString instead of stacktrace for task reaper 
threadDump
 Key: SPARK-44985
 URL: https://issues.apache.org/jira/browse/SPARK-44985
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44984) Remove _get_alias from DataFrame

2023-08-28 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-44984:
-

 Summary: Remove _get_alias from DataFrame
 Key: SPARK-44984
 URL: https://issues.apache.org/jira/browse/SPARK-44984
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44819) Make Python the first language in all Spark code snippet

2023-08-28 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759468#comment-17759468
 ] 

BingKun Pan edited comment on SPARK-44819 at 8/28/23 7:36 AM:
--

This PR is duplicated with `https://issues.apache.org/jira/browse/SPARK-42642`, 
I think we can close this.


was (Author: panbingkun):
I work on it.

> Make Python the first language in all Spark code snippet
> 
>
> Key: SPARK-44819
> URL: https://issues.apache.org/jira/browse/SPARK-44819
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: Screenshot 2023-08-15 at 11.59.11.png
>
>
> Currently, the first and default language for all code snippets is Sacla. For 
> instance: https://spark.apache.org/docs/latest/quick-start.html
> We should make Python the first language for all the code snippets.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44983) Convert binary to string by to_char for the formats: hex, base64, utf-8

2023-08-28 Thread Max Gekk (Jira)
Max Gekk created SPARK-44983:


 Summary: Convert binary to string by to_char for the formats: hex, 
base64, utf-8
 Key: SPARK-44983
 URL: https://issues.apache.org/jira/browse/SPARK-44983
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk


Map the to_char() function with a binary input to one of hex(), base64(), 
decode() to achieve feature parity with:
- Snowflake: https://docs.snowflake.com/en/sql-reference/functions/to_char
- SAP SQL Anywhere: 
https://help.sap.com/docs/SAP_SQL_Anywhere/93079d4ba8e44920ae63ffb4def91f5b/81fe51196ce21014b9c6cf43b298.html
- Oracle: 
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_CHAR-number.html#GUID-00DA076D-2468-41AB-A3AC-CC78DBA0D9CB
- Vertica: 
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_CHAR.htm
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org