[jira] [Resolved] (SPARK-47978) Decouple Spark Go Connect Library versioning from Spark versioning

2024-04-25 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang resolved SPARK-47978.

Resolution: Fixed

> Decouple Spark Go Connect Library versioning from Spark versioning
> --
>
> Key: SPARK-47978
> URL: https://issues.apache.org/jira/browse/SPARK-47978
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.1
>Reporter: BoYang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> There is a recent discussion in Spark community for Spark Operator version 
> naming convention. People like to use version independent of Spark versions. 
> That applies to Spark Connect Go Client as well. Better to start from v1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45225) XML: XSD file URL support

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45225:
---
Labels: pull-request-available  (was: )

> XML: XSD file URL support
> -
>
> Key: SPARK-45225
> URL: https://issues.apache.org/jira/browse/SPARK-45225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47991) Arrange the test cases for window frames and window functions.

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47991.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46226
[https://github.com/apache/spark/pull/46226]

> Arrange the test cases for window frames and window functions.
> --
>
> Key: SPARK-47991
> URL: https://issues.apache.org/jira/browse/SPARK-47991
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48000) Hash join support for strings with collation

2024-04-25 Thread Jira
Uroš Bojanić created SPARK-48000:


 Summary: Hash join support for strings with collation
 Key: SPARK-48000
 URL: https://issues.apache.org/jira/browse/SPARK-48000
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22231) Support of map, filter, withField, dropFields in nested list of structures

2024-04-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840934#comment-17840934
 ] 

Dongjoon Hyun commented on SPARK-22231:
---

I removed the outdated target version from this issue.

> Support of map, filter, withField, dropFields in nested list of structures
> --
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 |20.0|[[20,21.0], [21,22.0]]|
> // 

[jira] [Updated] (SPARK-22231) Support of map, filter, withField, dropFields in nested list of structures

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22231:
--
Target Version/s:   (was: 3.2.0)

> Support of map, filter, withField, dropFields in nested list of structures
> --
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 |20.0|[[20,21.0], [21,22.0]]|
> // +---++--+
> {code}
> and the second 

[jira] [Updated] (SPARK-24941) Add RDDBarrier.coalesce() function

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24941:
--
Target Version/s:   (was: 3.2.0)

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25383) Image data source supports sample pushdown

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25383:
--
Target Version/s:   (was: 3.2.0)

> Image data source supports sample pushdown
> --
>
> Key: SPARK-25383
> URL: https://issues.apache.org/jira/browse/SPARK-25383
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> After SPARK-25349, we should update image data source to support sampling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25752:
--
Target Version/s:   (was: 3.2.0)

> Add trait to easily whitelist logical operators that produce named output 
> from CleanupAliases
> -
>
> Key: SPARK-25752
> URL: https://issues.apache.org/jira/browse/SPARK-25752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> The rule `CleanupAliases` cleans up aliases from logical operators that do 
> not match a whitelist. This whitelist is hardcoded inside the rule which is 
> cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` 
> that will be ignored by `CleanupAliases` and other ops that require aliases 
> to be preserved in the operator should extend it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder

2024-04-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840928#comment-17840928
 ] 

Dongjoon Hyun commented on SPARK-28629:
---

I removed the outdated target version from this issue.

> Capture the missing rules in HiveSessionStateBuilder
> 
>
> Key: SPARK-28629
> URL: https://issues.apache.org/jira/browse/SPARK-28629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> A general mistake for new contributors is to forget adding the corresponding 
> rules into the extended extendedResolutionRules, postHocResolutionRules, 
> extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the 
> rules or capture them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2024-04-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840930#comment-17840930
 ] 

Dongjoon Hyun commented on SPARK-27780:
---

I removed the outdated target version from this issue.

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28629:
--
Target Version/s:   (was: 3.2.0)

> Capture the missing rules in HiveSessionStateBuilder
> 
>
> Key: SPARK-28629
> URL: https://issues.apache.org/jira/browse/SPARK-28629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> A general mistake for new contributors is to forget adding the corresponding 
> rules into the extended extendedResolutionRules, postHocResolutionRules, 
> extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the 
> rules or capture them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27780:
--
Target Version/s:   (was: 3.2.0)

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL

2024-04-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840927#comment-17840927
 ] 

Dongjoon Hyun commented on SPARK-30324:
---

I removed the outdated target version from this issue.

> Simplify API for JSON access in DataFrames/SQL
> --
>
> Key: SPARK-30324
> URL: https://issues.apache.org/jira/browse/SPARK-30324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> get_json_object() is a UDF to parse JSON fields. It is verbose and hard to 
> use, e.g. I wasn't expecting the path to a field to have to start with "$.". 
> We can simplify all of this when a column is of StringType, and a nested 
> field is requested. This API sugar will in the query planner be rewritten as 
> get_json_object.
> This nested access can then be extended in the future to other 
> semi-structured formats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30324:
--
Target Version/s:   (was: 3.2.0)

> Simplify API for JSON access in DataFrames/SQL
> --
>
> Key: SPARK-30324
> URL: https://issues.apache.org/jira/browse/SPARK-30324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> get_json_object() is a UDF to parse JSON fields. It is verbose and hard to 
> use, e.g. I wasn't expecting the path to a field to have to start with "$.". 
> We can simplify all of this when a column is of StringType, and a nested 
> field is requested. This API sugar will in the query planner be rewritten as 
> get_json_object.
> This nested access can then be extended in the future to other 
> semi-structured formats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30334:
--
Target Version/s:   (was: 3.2.0)

> Add metadata around semi-structured columns to Spark
> 
>
> Key: SPARK-30334
> URL: https://issues.apache.org/jira/browse/SPARK-30334
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events 
> in a wide variety of formats. Click events in product analytics can be stored 
> as json. Some application logs can be in the form of delimited key=value 
> text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column 
> exists. This will then enable Spark to "auto-parse" these columns on the fly. 
> The proposal is to store this information as part of the column metadata, in 
> the fields:
>  - format: The format of the semi-structured column, e.g. json, xml, avro
>  - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> ++---++
> | ts | event |raw |
> ++---++
> | 2019-10-12 | click | {"field":"value"}  |
> ++---++ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> ++---+--+
> | ts | event | raw  |
> ++---+--+
> | 2019-10-12 | click | field1=v1|field2=v2  |
> ++---+--+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>  
> As a first step, we will introduce the function "as_json", which accomplishes 
> this for JSON columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30334) Add metadata around semi-structured columns to Spark

2024-04-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840926#comment-17840926
 ] 

Dongjoon Hyun commented on SPARK-30334:
---

I removed the outdated target version from this issue.

> Add metadata around semi-structured columns to Spark
> 
>
> Key: SPARK-30334
> URL: https://issues.apache.org/jira/browse/SPARK-30334
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events 
> in a wide variety of formats. Click events in product analytics can be stored 
> as json. Some application logs can be in the form of delimited key=value 
> text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column 
> exists. This will then enable Spark to "auto-parse" these columns on the fly. 
> The proposal is to store this information as part of the column metadata, in 
> the fields:
>  - format: The format of the semi-structured column, e.g. json, xml, avro
>  - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> ++---++
> | ts | event |raw |
> ++---++
> | 2019-10-12 | click | {"field":"value"}  |
> ++---++ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> ++---+--+
> | ts | event | raw  |
> ++---+--+
> | 2019-10-12 | click | field1=v1|field2=v2  |
> ++---+--+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>  
> As a first step, we will introduce the function "as_json", which accomplishes 
> this for JSON columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2024-04-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840913#comment-17840913
 ] 

Dongjoon Hyun commented on SPARK-24942:
---

I removed the outdated target version, `3.2.0`, from this Jira. For now, Apache 
Spark community has no target version for this issue.

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47998) pandas-on-spark DataFrame.concat will not join a Pandas dataframe and raises a misleading error

2024-04-25 Thread Philip Kahn (Jira)
Philip Kahn created SPARK-47998:
---

 Summary: pandas-on-spark DataFrame.concat will not join a Pandas 
dataframe and raises a misleading error
 Key: SPARK-47998
 URL: https://issues.apache.org/jira/browse/SPARK-47998
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.3
Reporter: Philip Kahn


The `concat` method has a strict type check, that raises a misleading error:

!image-2024-04-25-11-33-29-208.png!
Note that the type raised is of `objs`, rather than `obj`, so a list of various 
objects will say that it cannot concatenate objects of type list, rather than 
the failed internal types.

 

Additionally, this strictly checks for pandas-on-spark Series and DataFrames; 
since both objects will happily convert a naive Pandas object, something like

 

objs = [DataFrame(x) if isinstance(x, pd.Dataframe) else Series(x) if 
isinstance(x, pd.Series) else x for x in objs] 

would trivially make this work in those cases and prevent a different strange 
error reporting that a dataframe wasn't valid in a dataframe concatenation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24942:
--
Target Version/s:   (was: 3.2.0)

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47997) Pandas-on-Spark incompletely implements DataFrame.drop

2024-04-25 Thread Philip Kahn (Jira)
Philip Kahn created SPARK-47997:
---

 Summary: Pandas-on-Spark incompletely implements DataFrame.drop
 Key: SPARK-47997
 URL: https://issues.apache.org/jira/browse/SPARK-47997
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.3
Reporter: Philip Kahn


For Pandas v1.0+, `drop` supports the `errors` kwarg:

[https://pandas.pydata.org/pandas-docs/version/1.0/reference/api/pandas.DataFrame.drop.html]

 

Pandas-on-Spark does not implement it. This is especially glaring since the 
pyspark drop is a no-op on absent columns, behaving like `errors='ignore'`, so 
_extra_ work needed to be done to implement the raise behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47996) Pandas-on-Spark incompletely implements merge methods

2024-04-25 Thread Philip Kahn (Jira)
Philip Kahn created SPARK-47996:
---

 Summary: Pandas-on-Spark incompletely implements merge methods
 Key: SPARK-47996
 URL: https://issues.apache.org/jira/browse/SPARK-47996
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.3
Reporter: Philip Kahn


For Pandas >= 1.2 ( 
[https://pandas.pydata.org/pandas-docs/version/1.2/reference/api/pandas.DataFrame.merge.html]
 ) (current = 2.2) how implements method "cross". which is absent.

 

This breaks API compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47994) SQLServer does not support 1 and 0 as boolean values

2024-04-25 Thread Stefan Bukorovic (Jira)
Stefan Bukorovic created SPARK-47994:


 Summary: SQLServer does not support 1 and 0 as boolean values
 Key: SPARK-47994
 URL: https://issues.apache.org/jira/browse/SPARK-47994
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.3
Reporter: Stefan Bukorovic


Sometimes in Spark, when a column that is generated as CASE WHEN structure is 
used in comparison filter, output of optimized plan will be: CASE WHEN 
expression THEN (1 or 0)... which is not supported in SQLServer. Exception is 
thrown by SQLServer that a "non-boolean expression is given when boolean was 
expected". For now, we should not support CASE WHEN pushdown in SQLServer. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44111) Prepare Apache Spark 4.0.0

2024-04-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840853#comment-17840853
 ] 

Dongjoon Hyun commented on SPARK-44111:
---

Yes, we will provide `4.0.0-preview` in advance, [~fbiville] . Here is the 
discussion thread on Apache Spark dev mailing list.
 * [https://lists.apache.org/thread/nxmvz2j7kp96otzlnl3kd277knlb6qgb]

[~cloud_fan] is the release manager who is leading Apache Spark 4.0.0 release 
(including preview).

> Prepare Apache Spark 4.0.0
> --
>
> Key: SPARK-44111
> URL: https://issues.apache.org/jira/browse/SPARK-44111
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: pull-request-available
>
> For now, this issue aims to collect ideas for planning Apache Spark 4.0.0.
> We will add more items which will be excluded from Apache Spark 3.5.0 
> (Feature Freeze: July 16th, 2023).
> {code}
> Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
> Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
> Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
> Spark 4: 2024.06 (4.0.0, NEW)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47355) Use wildcard imports in CollationTypeCasts

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47355:
---
Labels: pull-request-available  (was: )

> Use wildcard imports in CollationTypeCasts
> --
>
> Key: SPARK-47355
> URL: https://issues.apache.org/jira/browse/SPARK-47355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47355) Use wildcard imports in CollationTypeCast

2024-04-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47355:
-
Summary: Use wildcard imports in CollationTypeCast  (was: TBD)

> Use wildcard imports in CollationTypeCast
> -
>
> Key: SPARK-47355
> URL: https://issues.apache.org/jira/browse/SPARK-47355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47355) Use wildcard imports in CollationTypeCasts

2024-04-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47355:
-
Summary: Use wildcard imports in CollationTypeCasts  (was: Use wildcard 
imports in CollationTypeCast)

> Use wildcard imports in CollationTypeCasts
> --
>
> Key: SPARK-47355
> URL: https://issues.apache.org/jira/browse/SPARK-47355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47987) Enable `ArrowParityTests.test_createDataFrame_empty_partition`

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47987.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46220
[https://github.com/apache/spark/pull/46220]

> Enable `ArrowParityTests.test_createDataFrame_empty_partition`
> --
>
> Key: SPARK-47987
> URL: https://issues.apache.org/jira/browse/SPARK-47987
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47990) Upgrade `zstd-jni` to 1.5.6-3

2024-04-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47990.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46225
[https://github.com/apache/spark/pull/46225]

> Upgrade `zstd-jni` to 1.5.6-3
> -
>
> Key: SPARK-47990
> URL: https://issues.apache.org/jira/browse/SPARK-47990
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46830) Introducing collation concept into Spark

2024-04-25 Thread Gideon P (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840792#comment-17840792
 ] 

Gideon P commented on SPARK-46830:
--

[~uros-db] what should I work on next?

> Introducing collation concept into Spark
> 
>
> Key: SPARK-46830
> URL: https://issues.apache.org/jira/browse/SPARK-46830
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
> Attachments: Collation Support in Spark.docx
>
>
> This feature will introduce collation support to the Spark engine. This means 
> that:
>  
>  # Every StringType will have an associated collation. Default remains UTF8 
> Binary, which will behave under the same rules as current UTF8 String 
> comparison.
>  # Collation will be respected in all collation sensitive operations - 
> comparisons, hashing, string operations (contains, startWith, endsWith etc.)
>  # Collation can be set through following ways:
>  ## COLLATE expression. e.g. strExpr COLLATE collation_name
>  ## In CREATE TABLE column definition
>  ## By setting session collation.
>  # All the Spark operators need to respect collation settings (filters, 
> joins, shuffles, aggs etc.)
>  
> This is a high level description of the feature. You can find detailed design 
> under 
> [this|https://docs.google.com/document/d/1A9RQiwq-n3R3vuh571yjOLaaIuIYRTyCx7UFr0Qg-eY/edit?usp=sharing]
>  link (doc is in attachment as well).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44111) Prepare Apache Spark 4.0.0

2024-04-25 Thread Florent BIVILLE (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840788#comment-17840788
 ] 

Florent BIVILLE commented on SPARK-44111:
-

Is there going to be pre-releases for Spark 4 that library authors can try?

Or shall we build from the `master` branch and report back?

> Prepare Apache Spark 4.0.0
> --
>
> Key: SPARK-44111
> URL: https://issues.apache.org/jira/browse/SPARK-44111
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: pull-request-available
>
> For now, this issue aims to collect ideas for planning Apache Spark 4.0.0.
> We will add more items which will be excluded from Apache Spark 3.5.0 
> (Feature Freeze: July 16th, 2023).
> {code}
> Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
> Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
> Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
> Spark 4: 2024.06 (4.0.0, NEW)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47985) Simplify functions with `lit`

2024-04-25 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-47985.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46219
[https://github.com/apache/spark/pull/46219]

> Simplify functions with `lit`
> -
>
> Key: SPARK-47985
> URL: https://issues.apache.org/jira/browse/SPARK-47985
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47985) Simplify functions with `lit`

2024-04-25 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-47985:
-

Assignee: Ruifeng Zheng

> Simplify functions with `lit`
> -
>
> Key: SPARK-47985
> URL: https://issues.apache.org/jira/browse/SPARK-47985
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47993) Drop Python 3.8 support

2024-04-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47993:
-
Labels: release-notes  (was: release-note)

> Drop Python 3.8 support
> ---
>
> Key: SPARK-47993
> URL: https://issues.apache.org/jira/browse/SPARK-47993
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: release-notes
>
> Python 3.8 is EOL in this October. Considering the release schedule, we 
> should better drop it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47963) Make the external Spark ecosystem can use structured logging mechanisms

2024-04-25 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-47963:

Summary: Make the external Spark ecosystem can use structured logging 
mechanisms   (was: Add an external LogKey usage case in UT)

> Make the external Spark ecosystem can use structured logging mechanisms 
> 
>
> Key: SPARK-47963
> URL: https://issues.apache.org/jira/browse/SPARK-47963
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47992) Support recursive descent path in get_json_object function

2024-04-25 Thread Qian Sun (Jira)
Qian Sun created SPARK-47992:


 Summary: Support recursive descent path in get_json_object function
 Key: SPARK-47992
 URL: https://issues.apache.org/jira/browse/SPARK-47992
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Qian Sun


JSONPath borrows recursive descent syntax from E4X. We could use it to collect 
json object from json map string.
{code:java}
// json data
{"key1": {"b": {"c": "c1", "d": "d1", "e": "e1"}}}
{"key2": {"b": {"c": "c2", "d": "d2", "e": "e2"}}}

select get_json_object(data, '$..c'); -- [c1, c2]{code}
ref: https://goessner.net/articles/JsonPath/index.html#e2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47991) Arrange the test cases for window frames and window functions.

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47991:
---
Labels: pull-request-available  (was: )

> Arrange the test cases for window frames and window functions.
> --
>
> Key: SPARK-47991
> URL: https://issues.apache.org/jira/browse/SPARK-47991
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47991) Arrange the test cases for window frames and window functions.

2024-04-25 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-47991:
--

 Summary: Arrange the test cases for window frames and window 
functions.
 Key: SPARK-47991
 URL: https://issues.apache.org/jira/browse/SPARK-47991
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38958) Override S3 Client in Spark Write/Read calls

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840748#comment-17840748
 ] 

ASF GitHub Bot commented on SPARK-38958:


hadoop-yetus commented on PR #6550:
URL: https://github.com/apache/hadoop/pull/6550#issuecomment-2076861454

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m 01s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  spotbugs  |   0m 00s |  |  spotbugs executables are not 
available.  |
   | +0 :ok: |  codespell  |   0m 01s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m 01s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m 00s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m 00s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  92m 11s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   5m 02s |  |  trunk passed  |
   | +1 :green_heart: |  checkstyle  |   4m 36s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   5m 03s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   4m 45s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  | 146m 50s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   2m 55s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 16s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 16s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m 00s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   2m 02s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   2m 28s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 14s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  | 159m 37s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   5m 25s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 421m 41s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | GITHUB PR | https://github.com/apache/hadoop/pull/6550 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | MINGW64_NT-10.0-17763 691d1e3161c7 3.4.10-87d57229.x86_64 
2024-02-14 20:17 UTC x86_64 Msys |
   | Build tool | maven |
   | Personality | /c/hadoop/dev-support/bin/hadoop.sh |
   | git revision | trunk / c8168fd0bc45331bd8b55dd53b537bec4b05fba5 |
   | Default Java | Azul Systems, Inc.-1.8.0_332-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6550/1/testReport/
 |
   | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6550/1/console
 |
   | versions | git=2.44.0.windows.1 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Override S3 Client in Spark Write/Read calls
> 
>
> Key: SPARK-38958
> URL: https://issues.apache.org/jira/browse/SPARK-38958
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Hershal
>Priority: Major
>  Labels: pull-request-available
>
> Hello,
> I have been working to use spark to read and write data to S3. Unfortunately, 
> there are a few S3 headers that I need to add to my spark read/write calls. 
> After much looking, I have not found a way to replace the S3 client that 
> spark uses to make the read/write calls. I also have not found a 
> configuration that allows me to pass in S3 headers. Here is an example of 
> some common S3 request headers 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonRequestHeaders.html).]
>  Does there already exist functionality to add S3 headers to spark read/write 
> calls or pass in a custom client that would pass these headers on every 
> read/write request? Appreciate the help and feedback
>  
> Thanks,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Updated] (SPARK-47297) TBD

2024-04-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47297:
-
Summary: TBD  (was: split (binary & lowercase collation only))

> TBD
> ---
>
> Key: SPARK-47297
> URL: https://issues.apache.org/jira/browse/SPARK-47297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47408) Fix mathExpressions that use StringType

2024-04-25 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47408:
--
Summary: Fix mathExpressions that use StringType  (was: TBD)

> Fix mathExpressions that use StringType
> ---
>
> Key: SPARK-47408
> URL: https://issues.apache.org/jira/browse/SPARK-47408
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47353) Mode (all collations)

2024-04-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47353:
-
Description: 
Enable collation support for the *Mode* expression in Spark. First confirm what 
is the expected behaviour for this expression when given collated strings, then 
move on to the implementation that would enable handling strings of all 
collation types. Implement the corresponding unit tests and E2E SQL tests to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *Mode* expression so it 
supports all collation types currently supported in Spark. To understand what 
changes were introduced in order to enable full collation support for other 
existing functions in Spark, take a look at the Spark PRs and Jira tickets for 
completed tasks in this parent (for example: Contains, StartsWith, EndsWith).

Examples:

With UTF8_BINARY collation, the query
SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) AS 
tab(col);
should return 'a'.

With UTF8_BINARY_LCASE collation, the query
SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) AS 
tab(col);
should return either 'B' or 'b'.

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

  was:
Enable collation support for the *Mode* expression in Spark. First confirm what 
is the expected behaviour for this expression when given collated strings, then 
move on to the implementation that would enable handling strings of all 
collation types. Implement the corresponding unit tests and E2E SQL tests to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *Mode* expression so it 
supports all collation types currently supported in Spark. To understand what 
changes were introduced in order to enable full collation support for other 
existing functions in Spark, take a look at the Spark PRs and Jira tickets for 
completed tasks in this parent (for example: Contains, StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].


> Mode (all collations)
> -
>
> Key: SPARK-47353
> URL: https://issues.apache.org/jira/browse/SPARK-47353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *Mode* expression in Spark. First confirm 
> what is the expected behaviour for this expression when given collated 
> strings, then move on to the implementation that would enable handling 
> strings of all collation types. Implement the corresponding unit tests and 
> E2E SQL tests to reflect how this function should be used with collation in 
> SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Mode* expression so it 
> supports all collation types currently supported in Spark. To understand what 
> changes were introduced in order to enable full collation support for other 
> existing functions in Spark, take a look at the Spark PRs and Jira tickets 
> for completed tasks in this parent (for example: Contains, StartsWith, 
> EndsWith).
> Examples:
> With UTF8_BINARY collation, the query
> SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) 
> AS tab(col);
> should return 'a'.
> With UTF8_BINARY_LCASE collation, the query
> SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) 
> AS 

[jira] [Updated] (SPARK-47353) Mode (all collations)

2024-04-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47353:
-
Description: 
Enable collation support for the *Mode* expression in Spark. First confirm what 
is the expected behaviour for this expression when given collated strings, then 
move on to the implementation that would enable handling strings of all 
collation types. Implement the corresponding unit tests and E2E SQL tests to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *Mode* expression so it 
supports all collation types currently supported in Spark. To understand what 
changes were introduced in order to enable full collation support for other 
existing functions in Spark, take a look at the Spark PRs and Jira tickets for 
completed tasks in this parent (for example: Contains, StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

> Mode (all collations)
> -
>
> Key: SPARK-47353
> URL: https://issues.apache.org/jira/browse/SPARK-47353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *Mode* expression in Spark. First confirm 
> what is the expected behaviour for this expression when given collated 
> strings, then move on to the implementation that would enable handling 
> strings of all collation types. Implement the corresponding unit tests and 
> E2E SQL tests to reflect how this function should be used with collation in 
> SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Mode* expression so it 
> supports all collation types currently supported in Spark. To understand what 
> changes were introduced in order to enable full collation support for other 
> existing functions in Spark, take a look at the Spark PRs and Jira tickets 
> for completed tasks in this parent (for example: Contains, StartsWith, 
> EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47353) Mode (all collations)

2024-04-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-47353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840728#comment-17840728
 ] 

Uroš Bojanić commented on SPARK-47353:
--

[~panbingkun] if you're looking to make some contributions to the collation 
effort, please check out this ticket and let me know if you want to claim it!

> Mode (all collations)
> -
>
> Key: SPARK-47353
> URL: https://issues.apache.org/jira/browse/SPARK-47353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *Mode* expression in Spark. First confirm 
> what is the expected behaviour for this expression when given collated 
> strings, then move on to the implementation that would enable handling 
> strings of all collation types. Implement the corresponding unit tests and 
> E2E SQL tests to reflect how this function should be used with collation in 
> SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Mode* expression so it 
> supports all collation types currently supported in Spark. To understand what 
> changes were introduced in order to enable full collation support for other 
> existing functions in Spark, take a look at the Spark PRs and Jira tickets 
> for completed tasks in this parent (for example: Contains, StartsWith, 
> EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47566) SubstringIndex

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47566:
--

Assignee: Apache Spark

> SubstringIndex
> --
>
> Key: SPARK-47566
> URL: https://issues.apache.org/jira/browse/SPARK-47566
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Milan Dankovic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *SubstringIndex* built-in string function in 
> Spark. First confirm what is the expected behaviour for these functions when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *SubstringIndex* functions 
> so that they support all collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47566) SubstringIndex

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47566:
--

Assignee: (was: Apache Spark)

> SubstringIndex
> --
>
> Key: SPARK-47566
> URL: https://issues.apache.org/jira/browse/SPARK-47566
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Milan Dankovic
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *SubstringIndex* built-in string function in 
> Spark. First confirm what is the expected behaviour for these functions when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *SubstringIndex* functions 
> so that they support all collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47623) Enable `QuietTest` in parity tests

2024-04-25 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-47623:
--
Summary: Enable `QuietTest` in parity tests  (was: Use `QuietTest` in 
parity tests)

> Enable `QuietTest` in parity tests
> --
>
> Key: SPARK-47623
> URL: https://issues.apache.org/jira/browse/SPARK-47623
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47982) Update code style' plugins to latest version

2024-04-25 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47982.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46216
[https://github.com/apache/spark/pull/46216]

> Update code style' plugins to latest version
> 
>
> Key: SPARK-47982
> URL: https://issues.apache.org/jira/browse/SPARK-47982
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47986) [CONNECT][PYTHON] Unable to create a new session when the default session is closed by the server

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47986:
---
Labels: pull-request-available  (was: )

> [CONNECT][PYTHON] Unable to create a new session when the default session is 
> closed by the server
> -
>
> Key: SPARK-47986
> URL: https://issues.apache.org/jira/browse/SPARK-47986
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Niranjan Jayakar
>Priority: Major
>  Labels: pull-request-available
>
> When the server closes a session, usually after a cluster restart, the client 
> is unaware of this until it receives an error.
> Once it does so, there is no way for the client to create a new session since 
> the stale sessions are still recorded as default and active sessions.
> The only solution currently is to restart the Python interpreter on the 
> client, or to reach into the session builder and change the active or default 
> session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47987) Reenable `ArrowParityTests.test_createDataFrame_empty_partition`

2024-04-25 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-47987:
-

 Summary: Reenable 
`ArrowParityTests.test_createDataFrame_empty_partition`
 Key: SPARK-47987
 URL: https://issues.apache.org/jira/browse/SPARK-47987
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47984) Change `MetricsAggregate/V2Aggregator`'s `serialize/deserialize` to call `SparkSerDeUtils`'s `serialize/deserialize` methods.

2024-04-25 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47984.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46218
[https://github.com/apache/spark/pull/46218]

> Change `MetricsAggregate/V2Aggregator`'s `serialize/deserialize` to call 
> `SparkSerDeUtils`'s `serialize/deserialize` methods.
> -
>
> Key: SPARK-47984
> URL: https://issues.apache.org/jira/browse/SPARK-47984
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47984) Change `MetricsAggregate/V2Aggregator`'s `serialize/deserialize` to call `SparkSerDeUtils`'s `serialize/deserialize` methods.

2024-04-25 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47984:


Assignee: Yang Jie

> Change `MetricsAggregate/V2Aggregator`'s `serialize/deserialize` to call 
> `SparkSerDeUtils`'s `serialize/deserialize` methods.
> -
>
> Key: SPARK-47984
> URL: https://issues.apache.org/jira/browse/SPARK-47984
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47970) Revisit skipped parity tests for PySpark Connect

2024-04-25 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-47970:
--
Summary: Revisit skipped parity tests for PySpark Connect  (was: Revisit 
skipped parity tests for PySpark)

> Revisit skipped parity tests for PySpark Connect
> 
>
> Key: SPARK-47970
> URL: https://issues.apache.org/jira/browse/SPARK-47970
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47983) Demote spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled to internal

2024-04-25 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47983.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46217
[https://github.com/apache/spark/pull/46217]

> Demote spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled to 
> internal
> --
>
> Key: SPARK-47983
> URL: https://issues.apache.org/jira/browse/SPARK-47983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47985) Simplify functions with `lit`

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47985:
---
Labels: pull-request-available  (was: )

> Simplify functions with `lit`
> -
>
> Key: SPARK-47985
> URL: https://issues.apache.org/jira/browse/SPARK-47985
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47985) Simplify functions with `lit`

2024-04-25 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-47985:
-

 Summary: Simplify functions with `lit`
 Key: SPARK-47985
 URL: https://issues.apache.org/jira/browse/SPARK-47985
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47984) Change `MetricsAggregate/V2Aggregator`'s `serialize/deserialize` to call `SparkSerDeUtils`'s `serialize/deserialize` methods.

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47984:
---
Labels: pull-request-available  (was: )

> Change `MetricsAggregate/V2Aggregator`'s `serialize/deserialize` to call 
> `SparkSerDeUtils`'s `serialize/deserialize` methods.
> -
>
> Key: SPARK-47984
> URL: https://issues.apache.org/jira/browse/SPARK-47984
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47984) Change `MetricsAggregate/V2Aggregator`'s `serialize/deserialize` to call `SparkSerDeUtils`'s `serialize/deserialize` methods.

2024-04-25 Thread Yang Jie (Jira)
Yang Jie created SPARK-47984:


 Summary: Change `MetricsAggregate/V2Aggregator`'s 
`serialize/deserialize` to call `SparkSerDeUtils`'s `serialize/deserialize` 
methods.
 Key: SPARK-47984
 URL: https://issues.apache.org/jira/browse/SPARK-47984
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, SQL
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47983) Demote spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled to internal

2024-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47983:
---
Labels: pull-request-available  (was: )

> Demote spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled to 
> internal
> --
>
> Key: SPARK-47983
> URL: https://issues.apache.org/jira/browse/SPARK-47983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47983) Demote spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled to internal

2024-04-25 Thread Kent Yao (Jira)
Kent Yao created SPARK-47983:


 Summary: Demote 
spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled to internal
 Key: SPARK-47983
 URL: https://issues.apache.org/jira/browse/SPARK-47983
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org