[jira] [Commented] (SPARK-43291) Match behavior for DataFrame.cov on string DataFrame

2023-06-11 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731449#comment-17731449
 ] 

Haejoon Lee commented on SPARK-43291:
-

I'm currently doing an analysis of the usage of the pandas API. We plan to 
support the most commonly used APIs in the next release, and for all other 
pandas-related breaking changes, we expect to provide support in version 4.0. 
We have scheduled another meeting tomorrow to discuss this further, so let me 
update with the final conclusions soon.

> Match behavior for DataFrame.cov on string DataFrame
> 
>
> Key: SPARK-43291
> URL: https://issues.apache.org/jira/browse/SPARK-43291
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Should enable test below:
> {code:java}
> pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], 
> columns=["a", "b"])
> psdf = ps.from_pandas(pdf)
> self.assert_eq(pdf.cov(), psdf.cov()) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44025) CSV Table Read Error with CharType(length) column

2023-06-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44025:

Target Version/s: 3.4.1

> CSV Table Read Error with CharType(length) column
> -
>
> Key: SPARK-44025
> URL: https://issues.apache.org/jira/browse/SPARK-44025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: {{apache/spark:v3.4.0 image}}
>Reporter: Fengyu Cao
>Priority: Major
>
> Problem:
>  # read a CSV format table
>  # table has a `CharType(length)` column
>  # read table failed with Exception:  `org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 36.0 (TID 72) (10.113.9.208 executor 
> 11): java.lang.IllegalArgumentException: requirement failed: requiredSchema 
> (struct) should be the subset of dataSchema 
> (struct).`
>  
> reproduce with official image:
>  # {{docker run -it apache/spark:v3.4.0 /opt/spark/bin/spark-sql}}
>  # {{CREATE TABLE csv_bug (name STRING, age INT, job CHAR(4)) USING CSV 
> OPTIONS ('header' = 'true', 'sep' = ';') LOCATION 
> "/opt/spark/examples/src/main/resources/people.csv";}}
>  # SELECT * FROM csv_bug;
>  # ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: requiredSchema 
> (struct) should be the subset of dataSchema 
> (struct).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43529) Support general expressions as OPTIONS values

2023-06-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-43529:
--

Assignee: Daniel

> Support general expressions as OPTIONS values 
> --
>
> Key: SPARK-43529
> URL: https://issues.apache.org/jira/browse/SPARK-43529
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43529) Support general expressions as OPTIONS values

2023-06-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-43529.

Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41191
[https://github.com/apache/spark/pull/41191]

> Support general expressions as OPTIONS values 
> --
>
> Key: SPARK-43529
> URL: https://issues.apache.org/jira/browse/SPARK-43529
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44021) Add a config to make it do not generate too many partitions

2023-06-11 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731429#comment-17731429
 ] 

Snoot.io commented on SPARK-44021:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/41545

> Add a config to make it do not generate too many partitions
> ---
>
> Key: SPARK-44021
> URL: https://issues.apache.org/jira/browse/SPARK-44021
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44021) Add a config to make it do not generate too many partitions

2023-06-11 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731428#comment-17731428
 ] 

Snoot.io commented on SPARK-44021:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/41545

> Add a config to make it do not generate too many partitions
> ---
>
> Key: SPARK-44021
> URL: https://issues.apache.org/jira/browse/SPARK-44021
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44025) CSV Table Read Error with CharType(length) column

2023-06-11 Thread Fengyu Cao (Jira)
Fengyu Cao created SPARK-44025:
--

 Summary: CSV Table Read Error with CharType(length) column
 Key: SPARK-44025
 URL: https://issues.apache.org/jira/browse/SPARK-44025
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
 Environment: {{apache/spark:v3.4.0 image}}
Reporter: Fengyu Cao


Problem:
 # read a CSV format table
 # table has a `CharType(length)` column
 # read table failed with Exception:  `org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most recent 
failure: Lost task 0.3 in stage 36.0 (TID 72) (10.113.9.208 executor 11): 
java.lang.IllegalArgumentException: requirement failed: requiredSchema 
(struct) should be the subset of dataSchema 
(struct).`

 

reproduce with official image:
 # {{docker run -it apache/spark:v3.4.0 /opt/spark/bin/spark-sql}}
 # {{CREATE TABLE csv_bug (name STRING, age INT, job CHAR(4)) USING CSV OPTIONS 
('header' = 'true', 'sep' = ';') LOCATION 
"/opt/spark/examples/src/main/resources/people.csv";}}
 # SELECT * FROM csv_bug;
 # ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: requiredSchema 
(struct) should be the subset of dataSchema 
(struct).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2023-06-11 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731425#comment-17731425
 ] 

Snoot.io commented on SPARK-32559:
--

User 'Kwafoor' has created a pull request for this issue:
https://github.com/apache/spark/pull/41535

> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters 
> correctly
> ---
>
> Key: SPARK-32559
> URL: https://issues.apache.org/jira/browse/SPARK-32559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.1
>
> Attachments: error.log
>
>
> The trim logic in Cast expression introduced in 
> [https://github.com/apache/spark/pull/26622] will trim chinese characters 
> unexpectly.
> For example,  sql  select cast("1中文" as float) gives 1 instead of null
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44024) Change to use `map` where `unzip` used to extract a single element

2023-06-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44024:
-
Summary: Change to use `map` where `unzip` used to extract a single element 
  (was: Change to use map where unzip used to extract a single element )

> Change to use `map` where `unzip` used to extract a single element 
> ---
>
> Key: SPARK-44024
> URL: https://issues.apache.org/jira/browse/SPARK-44024
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44024) Change to use `map` where `unzip` used to extract a single element

2023-06-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44024:
-
Description: 
For example:

 

Seq((1, 11), (2, 22)).unzip._1

 

should change to 

 

Seq((1, 11), (2, 22)).map(_._1)

> Change to use `map` where `unzip` used to extract a single element 
> ---
>
> Key: SPARK-44024
> URL: https://issues.apache.org/jira/browse/SPARK-44024
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> For example:
>  
> Seq((1, 11), (2, 22)).unzip._1
>  
> should change to 
>  
> Seq((1, 11), (2, 22)).map(_._1)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44024) Change to use map where unzip used to extract a single element

2023-06-11 Thread Yang Jie (Jira)
Yang Jie created SPARK-44024:


 Summary: Change to use map where unzip used to extract a single 
element 
 Key: SPARK-44024
 URL: https://issues.apache.org/jira/browse/SPARK-44024
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44023) Add System.gc at beforeEach in PruneFileSourcePartitionsSuite

2023-06-11 Thread Yang Jie (Jira)
Yang Jie created SPARK-44023:


 Summary: Add System.gc at beforeEach in 
PruneFileSourcePartitionsSuite
 Key: SPARK-44023
 URL: https://issues.apache.org/jira/browse/SPARK-44023
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44022) Enforce Java max bytecode version to maven dependencies

2023-06-11 Thread Bowen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bowen Liang updated SPARK-44022:

Description: To enforce Java's max bytecode version to maven dependencies, 
by using `enforceBytecodeVersion` enforcer rule. Preventing introducing 
dependencies requiring higher Java version 11+, including transparent 
depencencies.  (was: To enforce Java's max bytecode version to maven 
dependencies, by using `enforceBytecodeVersion` enforcer rule. Preventing 
introducing dependencies requiring higher Java version 11+.)

> Enforce Java max bytecode version to maven dependencies
> ---
>
> Key: SPARK-44022
> URL: https://issues.apache.org/jira/browse/SPARK-44022
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bowen Liang
>Priority: Major
>
> To enforce Java's max bytecode version to maven dependencies, by using 
> `enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies 
> requiring higher Java version 11+, including transparent depencencies.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44022) Enforce Java max bytecode version to maven dependencies

2023-06-11 Thread Bowen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bowen Liang updated SPARK-44022:

Description: 
To enforce Java's max bytecode version to maven dependencies, by using 
`enforceBytecodeVersion` enforcer rule.

Preventing introducing dependencies requiring higher Java version 11+, 
including transparent dependencies.

  was:To enforce Java's max bytecode version to maven dependencies, by using 
`enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies 
requiring higher Java version 11+, including transparent depencencies.


> Enforce Java max bytecode version to maven dependencies
> ---
>
> Key: SPARK-44022
> URL: https://issues.apache.org/jira/browse/SPARK-44022
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bowen Liang
>Priority: Major
>
> To enforce Java's max bytecode version to maven dependencies, by using 
> `enforceBytecodeVersion` enforcer rule.
> Preventing introducing dependencies requiring higher Java version 11+, 
> including transparent dependencies.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44022) Enforce Java max bytecode version to maven dependencies

2023-06-11 Thread Bowen Liang (Jira)
Bowen Liang created SPARK-44022:
---

 Summary: Enforce Java max bytecode version to maven dependencies
 Key: SPARK-44022
 URL: https://issues.apache.org/jira/browse/SPARK-44022
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Bowen Liang


To enforce Java's max bytecode version to maven dependencies, by using 
`enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies 
requiring higher Java version 11+.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43617) Enable pyspark.pandas.spark.functions.product in Spark Connect.

2023-06-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43617:
-

Assignee: Ruifeng Zheng

> Enable pyspark.pandas.spark.functions.product in Spark Connect.
> ---
>
> Key: SPARK-43617
> URL: https://issues.apache.org/jira/browse/SPARK-43617
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> Enable pyspark.pandas.spark.functions.product in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43938) Add to_* functions to Scala and Python

2023-06-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43938:
-

Assignee: BingKun Pan

> Add to_* functions to Scala and Python
> --
>
> Key: SPARK-43938
> URL: https://issues.apache.org/jira/browse/SPARK-43938
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
>
> Add following functions:
> * str_to_map
> * to_binary
> * to_char
> * to_number
> * to_timestamp_ltz
> * to_timestamp_ntz
> * to_unix_timestamp
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43938) Add to_* functions to Scala and Python

2023-06-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43938.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41505
[https://github.com/apache/spark/pull/41505]

> Add to_* functions to Scala and Python
> --
>
> Key: SPARK-43938
> URL: https://issues.apache.org/jira/browse/SPARK-43938
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.5.0
>
>
> Add following functions:
> * str_to_map
> * to_binary
> * to_char
> * to_number
> * to_timestamp_ltz
> * to_timestamp_ntz
> * to_unix_timestamp
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44021) Add a config to make it do not generate too many partitions

2023-06-11 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-44021:
---

 Summary: Add a config to make it do not generate too many 
partitions
 Key: SPARK-44021
 URL: https://issues.apache.org/jira/browse/SPARK-44021
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43179) Add option for applications to control saving of metadata in the External Shuffle Service LevelDB

2023-06-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43179.
--
Resolution: Fixed

Issue resolved by pull request 41502
[https://github.com/apache/spark/pull/41502]

> Add option for applications to control saving of metadata in the External 
> Shuffle Service LevelDB
> -
>
> Key: SPARK-43179
> URL: https://issues.apache.org/jira/browse/SPARK-43179
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.4.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.5.0
>
>
> Currently, the External Shuffle Service stores application metadata in 
> LevelDB. This is necessary to enable the shuffle server to resume serving 
> shuffle data for an application whose executors registered before the 
> NodeManager restarts. However, the metadata includes the application secret, 
> which is stored in LevelDB without encryption. This is a potential security 
> risk, particularly for applications with high security requirements. While 
> filesystem access control lists (ACLs) can help protect keys and 
> certificates, they may not be sufficient for some use cases. In response, we 
> have decided not to store metadata for these high-security applications in 
> LevelDB. As a result, these applications may experience more failures in the 
> event of a node restart, but we believe this trade-off is acceptable given 
> the increased security risk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43942) Add string functions to Scala and Python - part 1

2023-06-11 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731340#comment-17731340
 ] 

BingKun Pan commented on SPARK-43942:
-

I work on it.

> Add string functions to Scala and Python - part 1
> -
>
> Key: SPARK-43942
> URL: https://issues.apache.org/jira/browse/SPARK-43942
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * char
> * btrim
> * char_length
> * character_length
> * chr
> * contains
> * elt
> * find_in_set
> * like
> * ilike
> * lcase
> * ucase
> * len
> * left
> * right
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42290) Spark Driver hangs on OOM during Broadcast when AQE is enabled

2023-06-11 Thread Jia Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731331#comment-17731331
 ] 

Jia Fan commented on SPARK-42290:
-

Thanks [~dongjoon] 

> Spark Driver hangs on OOM during Broadcast when AQE is enabled 
> ---
>
> Key: SPARK-42290
> URL: https://issues.apache.org/jira/browse/SPARK-42290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Shardul Mahadik
>Assignee: Jia Fan
>Priority: Critical
> Fix For: 3.4.1, 3.5.0
>
>
> Repro steps:
> {code}
> $ spark-shell --conf spark.driver.memory=1g
> val df = spark.range(500).withColumn("str", 
> lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
> val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
> df2.collect
> {code}
> This will cause the driver to hang indefinitely. Heres a thread dump of the 
> {{main}} thread when its stuck
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:285)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$Lambda$2819/629294880.apply(Unknown
>  Source)
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:809)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:236)
>  => holding Monitor(java.lang.Object@1932537396})
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:381)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:354)
> org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4179)
> org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3420)
> org.apache.spark.sql.Dataset$$Lambda$2390/1803372144.apply(Unknown Source)
> org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4169)
> org.apache.spark.sql.Dataset$$Lambda$2791/1357377136.apply(Unknown Source)
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4167)
> org.apache.spark.sql.Dataset$$Lambda$2391/1172042998.apply(Unknown Source)
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
> org.apache.spark.sql.execution.SQLExecution$$$Lambda$2402/721269425.apply(Unknown
>  Source)
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
> org.apache.spark.sql.execution.SQLExecution$$$Lambda$2392/11632488.apply(Unknown
>  Source)
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:809)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
> org.apache.spark.sql.Dataset.withAction(Dataset.scala:4167)
> org.apache.spark.sql.Dataset.collect(Dataset.scala:3420)
> {code}
> When we disable AQE though we get the following exception instead of driver 
> hang.
> {code}
> Caused by: org.apache.spark.SparkException: Not enough memory to build and 
> broadcast the table to all worker nodes. As a workaround, you can either 
> disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or 
> increase the spark driver memory by setting spark.driver.memory to a higher 
> value.
>   ... 7 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:834)
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:777)
>   at 
> org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1086)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:157)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:1163)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:1151)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:148)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$Lambda$2999/145945436.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCap

[jira] [Commented] (SPARK-43926) Add array_agg, array_size, cardinality, count_min_sketch,mask,named_struct,json_* to Scala and Python

2023-06-11 Thread Tengfei Huang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731326#comment-17731326
 ] 

Tengfei Huang commented on SPARK-43926:
---

I am working on this, will send a PR soon.

> Add array_agg, array_size, cardinality, 
> count_min_sketch,mask,named_struct,json_* to Scala and Python
> -
>
> Key: SPARK-43926
> URL: https://issues.apache.org/jira/browse/SPARK-43926
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add array_agg, array_size, cardinality, count_min_sketch
> Add following functions:
> * array_agg
> * array_size
> * cardinality
> * count_min_sketch
> * named_struct
> * json_array_length
> * json_object_keys
> * mask
>   to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43438) Fix mismatched column list error on INSERT

2023-06-11 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731321#comment-17731321
 ] 

Max Gekk commented on SPARK-43438:
--

[~erico] Would you like to work on this issue? It is related to your resolved 
ticket SPARK-43387

> Fix mismatched column list error on INSERT
> --
>
> Key: SPARK-43438
> URL: https://issues.apache.org/jira/browse/SPARK-43438
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> This error message is pretty bad, and common
> "_LEGACY_ERROR_TEMP_1038" : {
> "message" : [
> "Cannot write to table due to mismatched user specified column 
> size() and data column size()."
> ]
> },
> It can perhaps be merged with this one - after giving it an ERROR_CLASS
> "_LEGACY_ERROR_TEMP_1168" : {
> "message" : [
> " requires that the data to be inserted have the same number of 
> columns as the target table: target table has  column(s) but 
> the inserted data has  column(s), including  
> partition column(s) having constant value(s)."
> ]
> },
> Repro:
> CREATE TABLE tabtest(c1 INT, c2 INT);
> INSERT INTO tabtest SELECT 1;
> `spark_catalog`.`default`.`tabtest` requires that the data to be inserted 
> have the same number of columns as the target table: target table has 2 
> column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> INSERT INTO tabtest(c1) SELECT 1, 2, 3;
> Cannot write to table due to mismatched user specified column size(1) and 
> data column size(3).; line 1 pos 24
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43438) Fix mismatched column list error on INSERT

2023-06-11 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-43438:
-
Parent: SPARK-37935
Issue Type: Sub-task  (was: Improvement)

> Fix mismatched column list error on INSERT
> --
>
> Key: SPARK-43438
> URL: https://issues.apache.org/jira/browse/SPARK-43438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> This error message is pretty bad, and common
> "_LEGACY_ERROR_TEMP_1038" : {
> "message" : [
> "Cannot write to table due to mismatched user specified column 
> size() and data column size()."
> ]
> },
> It can perhaps be merged with this one - after giving it an ERROR_CLASS
> "_LEGACY_ERROR_TEMP_1168" : {
> "message" : [
> " requires that the data to be inserted have the same number of 
> columns as the target table: target table has  column(s) but 
> the inserted data has  column(s), including  
> partition column(s) having constant value(s)."
> ]
> },
> Repro:
> CREATE TABLE tabtest(c1 INT, c2 INT);
> INSERT INTO tabtest SELECT 1;
> `spark_catalog`.`default`.`tabtest` requires that the data to be inserted 
> have the same number of columns as the target table: target table has 2 
> column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> INSERT INTO tabtest(c1) SELECT 1, 2, 3;
> Cannot write to table due to mismatched user specified column size(1) and 
> data column size(3).; line 1 pos 24
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42298) Assign name to _LEGACY_ERROR_TEMP_2132

2023-06-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731320#comment-17731320
 ] 

ASF GitHub Bot commented on SPARK-42298:


User 'Hisoka-X' has created a pull request for this issue:
https://github.com/apache/spark/pull/40632

> Assign name to _LEGACY_ERROR_TEMP_2132
> --
>
> Key: SPARK-42298
> URL: https://issues.apache.org/jira/browse/SPARK-42298
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2023-06-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731318#comment-17731318
 ] 

ASF GitHub Bot commented on SPARK-40637:


User 'Hisoka-X' has created a pull request for this issue:
https://github.com/apache/spark/pull/41531

> Spark-shell can correctly encode BINARY type but Spark-sql cannot
> -
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.4.0
>Reporter: xsys
>Priority: Minor
> Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
> {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
> Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
> when querying it via {{{}spark-sql{}}}.
> i.e.,
> Insert via spark-shell, read via spark-shell: display correctly
> Insert via spark-shell, read via spark-sql: does not display correctly
> Insert via spark-sql, read via spark-sql: does not display correctly
> Insert via spark-sql, read via spark-shell: display correctly
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> import org.apache.spark.sql.Row 
> scala> import org.apache.spark.sql.types._
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[356] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
> scala> spark.sql("select * from binary_vals_shell;").show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++{code}
> Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
> binary_vals_shell table, and then (2) insert the value via spark-sql to the 
> binary_vals_sql table (we use tee to redirect the log to a file)
> {code:java}
> $SPARK_HOME/bin/spark-sql | tee sql.log{code}
>  Execute the following, we only get an empty output in the terminal (but a 
> garbage character in the log file):
> {code:java}
> spark-sql> select * from binary_vals_shell; -- query what is inserted via 
> spark-shell;
> spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
> spark-sql> insert into binary_vals_sql select X'01'; -- try to insert 
> directly in spark-sql;
> spark-sql> select * from binary_vals_sql;
> Time taken: 0.077 seconds, Fetched 1 row(s)
> {code}
> From the log file, we find it shows as a garbage character. (We never 
> encountered this garbage character in logs of other data types)
> h3. !image-2022-10-18-12-15-05-576.png!
> We then return to spark-shell again and run the following:
> {code:java}
> scala> spark.sql("select * from binary_vals_sql;").show(false)
> ++                                                                        
>   
> |c1  |
> ++
> |[01]|
> ++{code}
> The binary value does not display correctly via spark-sql, it still displays 
> correctly via spark-shell.
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type ({{{}BINARY{}}}) & input 
> ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.
>  
> h3. Additional context
> We also tried Avro and Parquet and encountered the same issue. We believe 
> this is format-independent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43772) Move version configuration in `connect` module to parent

2023-06-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-43772.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41295
[https://github.com/apache/spark/pull/41295]

> Move version configuration in `connect` module to parent
> 
>
> Key: SPARK-43772
> URL: https://issues.apache.org/jira/browse/SPARK-43772
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> In the pom file of the submodule, there are some common version properties, 
> eg:
>  * guava.version
>  * guava.failureaccess.version
> that need to be moved to the parent pom for better management



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43772) Move version configuration in `connect` module to parent

2023-06-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-43772:


Assignee: BingKun Pan

> Move version configuration in `connect` module to parent
> 
>
> Key: SPARK-43772
> URL: https://issues.apache.org/jira/browse/SPARK-43772
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>
> In the pom file of the submodule, there are some common version properties, 
> eg:
>  * guava.version
>  * guava.failureaccess.version
> that need to be moved to the parent pom for better management



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org