[jira] [Resolved] (SPARK-48627) Perf improvement for binary to HEX_DISCRETE strings

2024-06-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48627.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46984
[https://github.com/apache/spark/pull/46984]

> Perf improvement for binary to HEX_DISCRETE strings
> ---
>
> Key: SPARK-48627
> URL: https://issues.apache.org/jira/browse/SPARK-48627
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> +OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
> +Apple M2 Max
> +Cardinality 10:                       Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Spark                                             42210          43595       
>  1207          0.0      422102.9       1.0X
> +Java                                                238            243       
>     2          0.4        2381.9     177.2X {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48627) Perf improvement for binary to HEX_DISCRETE strings

2024-06-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48627:


Assignee: Kent Yao

> Perf improvement for binary to HEX_DISCRETE strings
> ---
>
> Key: SPARK-48627
> URL: https://issues.apache.org/jira/browse/SPARK-48627
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> +OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
> +Apple M2 Max
> +Cardinality 10:                       Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Spark                                             42210          43595       
>  1207          0.0      422102.9       1.0X
> +Java                                                238            243       
>     2          0.4        2381.9     177.2X {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48577) Replace invalid byte sequences in UTF8Strings

2024-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48577.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46899
[https://github.com/apache/spark/pull/46899]

> Replace invalid byte sequences in UTF8Strings
> -
>
> Key: SPARK-48577
> URL: https://issues.apache.org/jira/browse/SPARK-48577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48633) Upgrade scalacheck to 1.18.0

2024-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48633.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46992
[https://github.com/apache/spark/pull/46992]

> Upgrade scalacheck to 1.18.0
> 
>
> Key: SPARK-48633
> URL: https://issues.apache.org/jira/browse/SPARK-48633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48587) Avoid storage amplification when accessing sub-Variant

2024-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48587.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46941
[https://github.com/apache/spark/pull/46941]

> Avoid storage amplification when accessing sub-Variant
> --
>
> Key: SPARK-48587
> URL: https://issues.apache.org/jira/browse/SPARK-48587
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Assignee: David Cashman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When a variant_get expression returns a Variant, or a nested type containing 
> Variant, we just return the sub-slice of the Variant value along with the 
> full metadata, even though most of the metadata is probably unnecessary to 
> represent the value. This may be very inefficient if the value is then 
> written to disk (e.g. shuffle file or parquet). We should instead rebuild the 
> value with minimal metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48640) Perf improvement for format hex from byte array

2024-06-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48640:
-
Summary: Perf improvement for format hex from byte array  (was: Perf 
improvement for format hex)

> Perf improvement for format hex from byte array
> ---
>
> Key: SPARK-48640
> URL: https://issues.apache.org/jira/browse/SPARK-48640
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48640) Perf improvement for format hex

2024-06-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48640:
-
Priority: Minor  (was: Critical)

> Perf improvement for format hex
> ---
>
> Key: SPARK-48640
> URL: https://issues.apache.org/jira/browse/SPARK-48640
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48640) Perf improvement for format hex

2024-06-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48640:
-
Parent: SPARK-48624
Issue Type: Sub-task  (was: Improvement)

> Perf improvement for format hex
> ---
>
> Key: SPARK-48640
> URL: https://issues.apache.org/jira/browse/SPARK-48640
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48640) Perf improvement for format hex

2024-06-16 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48640:
---

 Summary: Perf improvement for format hex
 Key: SPARK-48640
 URL: https://issues.apache.org/jira/browse/SPARK-48640
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48615) Perf improvement for parsing hex string

2024-06-16 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48615:


Assignee: Kent Yao

> Perf improvement for parsing hex string
> ---
>
> Key: SPARK-48615
> URL: https://issues.apache.org/jira/browse/SPARK-48615
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> 
> Hex Comparison
> OpenJDK
>  64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
> Apple M2 Max
> Cardinality 100:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> 
> Apache                                             5050           5100        
>   86          0.2        5050.1       1.0X
> Spark                                              3822           3840        
>   30          0.3        3821.6       1.3X
> Java                                               2462           2522        
>   87          0.4        2462.1       2.1XOpenJDK 64-Bit Server VM 17.0.10+0 
> on Mac OS X 14.5
> Apple M2 Max
> Cardinality 200:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> 
> Apache                                            10020          10828        
> 1154          0.2        5010.1       1.0X
> Spark                                              6875           6966        
>  144          0.3        3437.7       1.5X
> Java                                               4999           5092        
>   89          0.4        2499.3       2.0XOpenJDK 64-Bit Server VM 17.0.10+0 
> on Mac OS X 14.5
> Apple M2 Max
> Cardinality 400:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> 
> Apache                                            20090          20433        
>  433          0.2        5022.5       1.0X
> Spark                                             13389          13620        
>  229          0.3        3347.2       1.5X
> Java                                              10023          10069        
>   42          0.4        2505.6       2.0XOpenJDK 64-Bit Server VM 17.0.10+0 
> on Mac OS X 14.5
> Apple M2 Max
> Cardinality 800:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> 
> Apache                                            40277          43453        
> 2755          0.2        5034.7       1.0X
> Spark                                             27145          27380        
>  311          0.3        3393.1       1.5X
> Java                                              19980          21198        
> 1473          0.4        2497.5       2.0X {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48615) Perf improvement for parsing hex string

2024-06-16 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48615.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46972
[https://github.com/apache/spark/pull/46972]

> Perf improvement for parsing hex string
> ---
>
> Key: SPARK-48615
> URL: https://issues.apache.org/jira/browse/SPARK-48615
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> 
> Hex Comparison
> OpenJDK
>  64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
> Apple M2 Max
> Cardinality 100:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> 
> Apache                                             5050           5100        
>   86          0.2        5050.1       1.0X
> Spark                                              3822           3840        
>   30          0.3        3821.6       1.3X
> Java                                               2462           2522        
>   87          0.4        2462.1       2.1XOpenJDK 64-Bit Server VM 17.0.10+0 
> on Mac OS X 14.5
> Apple M2 Max
> Cardinality 200:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> 
> Apache                                            10020          10828        
> 1154          0.2        5010.1       1.0X
> Spark                                              6875           6966        
>  144          0.3        3437.7       1.5X
> Java                                               4999           5092        
>   89          0.4        2499.3       2.0XOpenJDK 64-Bit Server VM 17.0.10+0 
> on Mac OS X 14.5
> Apple M2 Max
> Cardinality 400:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> 
> Apache                                            20090          20433        
>  433          0.2        5022.5       1.0X
> Spark                                             13389          13620        
>  229          0.3        3347.2       1.5X
> Java                                              10023          10069        
>   42          0.4        2505.6       2.0XOpenJDK 64-Bit Server VM 17.0.10+0 
> on Mac OS X 14.5
> Apple M2 Max
> Cardinality 800:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> 
> Apache                                            40277          43453        
> 2755          0.2        5034.7       1.0X
> Spark                                             27145          27380        
>  311          0.3        3393.1       1.5X
> Java                                              19980          21198        
> 1473          0.4        2497.5       2.0X {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48639) Add Origin to RelationCommon in protobuf defnition

2024-06-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-48639:


 Summary: Add Origin to RelationCommon in protobuf defnition
 Key: SPARK-48639
 URL: https://issues.apache.org/jira/browse/SPARK-48639
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


SPARK-48459 adds the new protobuf message for Origin. We should reuse the 
definition in `RelationCommon` as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48555) Support Column type for several SQL functions in scala and python

2024-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48555:


Assignee: Ron Serruya

> Support Column type for several SQL functions in scala and python
> -
>
> Key: SPARK-48555
> URL: https://issues.apache.org/jira/browse/SPARK-48555
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, Spark Core
>Affects Versions: 3.5.1
>Reporter: Ron Serruya
>Assignee: Ron Serruya
>Priority: Major
>  Labels: pull-request-available
>
> Currently, several SQL functions accept both native types and Columns, but 
> only accept native types in their scala/python APIs:
> * array_remove (works in SQL, scala, not in python)
> * array_position(works in SQL, scala, not in python)
> * map_contains_key (works in SQL, scala, not in python)
> * substring (works only in SQL)
> For example, this is possible in SQL:
> {code:python}
> spark.sql("select array_remove(col1, col2) from values(array(1,2,3), 2)")
> {code}
> But not in python:
> {code:python}
> df.select(F.array_remove(F.col("col1"), F.col("col2"))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48555) Support Column type for several SQL functions in scala and python

2024-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48555.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46901
[https://github.com/apache/spark/pull/46901]

> Support Column type for several SQL functions in scala and python
> -
>
> Key: SPARK-48555
> URL: https://issues.apache.org/jira/browse/SPARK-48555
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, Spark Core
>Affects Versions: 3.5.1
>Reporter: Ron Serruya
>Assignee: Ron Serruya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, several SQL functions accept both native types and Columns, but 
> only accept native types in their scala/python APIs:
> * array_remove (works in SQL, scala, not in python)
> * array_position(works in SQL, scala, not in python)
> * map_contains_key (works in SQL, scala, not in python)
> * substring (works only in SQL)
> For example, this is possible in SQL:
> {code:python}
> spark.sql("select array_remove(col1, col2) from values(array(1,2,3), 2)")
> {code}
> But not in python:
> {code:python}
> df.select(F.array_remove(F.col("col1"), F.col("col2"))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47777) Add spark connect test for python streaming data source

2024-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46906
[https://github.com/apache/spark/pull/46906]

> Add spark connect test for python streaming data source
> ---
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SS, Tests
>Affects Versions: 3.5.1
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Make python streaming data source pyspark test also runs on spark connect. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47777) Add spark connect test for python streaming data source

2024-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-4:


Assignee: Chaoqin Li

> Add spark connect test for python streaming data source
> ---
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SS, Tests
>Affects Versions: 3.5.1
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> Make python streaming data source pyspark test also runs on spark connect. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48597) Distinguish the streaming nodes from the text representation of logical plan

2024-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48597.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46953
[https://github.com/apache/spark/pull/46953]

> Distinguish the streaming nodes from the text representation of logical plan
> 
>
> Key: SPARK-48597
> URL: https://issues.apache.org/jira/browse/SPARK-48597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We had a hard time to figure out whether the nodes are streaming or not, when 
> we debugged the issue https://issues.apache.org/jira/browse/SPARK-47305 .
> Plan text for logical plan does not show the property of isStreaming, hence 
> we had to speculate the value based on other context. In addition, even 
> though the type of leaf node is explicitly meant to be streaming which 
> enables us to track down the isStreaming for certain subtree, the plan could 
> be very long and it’s a non-trivial effort to trace down to the leaf nodes. 
> Also, if the leaf nodes are skipped on the representation due to the size, 
> there is no way to get the information of isStreaming.
> We propose to introduce a marker of the representation for streaming, which 
> will be shown in the text logical plan. There is no concept of "isStreaming" 
> in physical plan, so the change only needs to happen in logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48597) Distinguish the streaming nodes from the text representation of logical plan

2024-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48597:
---

Assignee: Jungtaek Lim

> Distinguish the streaming nodes from the text representation of logical plan
> 
>
> Key: SPARK-48597
> URL: https://issues.apache.org/jira/browse/SPARK-48597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
>
> We had a hard time to figure out whether the nodes are streaming or not, when 
> we debugged the issue https://issues.apache.org/jira/browse/SPARK-47305 .
> Plan text for logical plan does not show the property of isStreaming, hence 
> we had to speculate the value based on other context. In addition, even 
> though the type of leaf node is explicitly meant to be streaming which 
> enables us to track down the isStreaming for certain subtree, the plan could 
> be very long and it’s a non-trivial effort to trace down to the leaf nodes. 
> Also, if the leaf nodes are skipped on the representation due to the size, 
> there is no way to get the information of isStreaming.
> We propose to introduce a marker of the representation for streaming, which 
> will be shown in the text logical plan. There is no concept of "isStreaming" 
> in physical plan, so the change only needs to happen in logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48574) Fix support for StructTypes with collations

2024-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48574:
---
Labels: pull-request-available  (was: )

> Fix support for StructTypes with collations
> ---
>
> Key: SPARK-48574
> URL: https://issues.apache.org/jira/browse/SPARK-48574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> While adding expression walker it was noticed that StructType support is 
> broken. One problem is that `CollationsTypeCasts` is doing a cast in all 
> BinaryExpressions which includes ExtractValue. Consequently, we are unable to 
> extract value if we do a cast there, as ExtractValue only supports 
> nonNullLiterals as extracting keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48638) Native QueryExecution information for the dataframe

2024-06-16 Thread Sem Sinchenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855443#comment-17855443
 ] 

Sem Sinchenko commented on SPARK-48638:
---

I'm working on the implementation of that logic in PySpark Classic.

> Native QueryExecution information for the dataframe
> ---
>
> Key: SPARK-48638
> URL: https://issues.apache.org/jira/browse/SPARK-48638
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Martin Grund
>Priority: Major
>  Labels: pull-request-available
>
> Adding a new property to `DataFrame` called `queryExecution` that returns a 
> class that contains information about the query execution and it's metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48638) Native QueryExecution information for the dataframe

2024-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48638:
---
Labels: pull-request-available  (was: )

> Native QueryExecution information for the dataframe
> ---
>
> Key: SPARK-48638
> URL: https://issues.apache.org/jira/browse/SPARK-48638
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Martin Grund
>Priority: Major
>  Labels: pull-request-available
>
> Adding a new property to `DataFrame` called `queryExecution` that returns a 
> class that contains information about the query execution and it's metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48638) Native QueryExecution information for the dataframe

2024-06-16 Thread Martin Grund (Jira)
Martin Grund created SPARK-48638:


 Summary: Native QueryExecution information for the dataframe
 Key: SPARK-48638
 URL: https://issues.apache.org/jira/browse/SPARK-48638
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Martin Grund


Adding a new property to `DataFrame` called `queryExecution` that returns a 
class that contains information about the query execution and it's metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48637) On-demand shuffle migration peer refresh during decommission

2024-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48637:
---
Labels: pull-request-available  (was: )

> On-demand shuffle migration peer refresh during decommission
> 
>
> Key: SPARK-48637
> URL: https://issues.apache.org/jira/browse/SPARK-48637
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.3, 3.2.4, 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: wuyi
>Priority: Major
>  Labels: pull-request-available
>
> Currently the shuffle migration peers is refreshed every 30s by default. It 
> could be more effecient if we refresh the peer immediately once there is a 
> peer aborted.
> (Strictly speaking, we only wait 30s to refresh the new peer when the there 
> no queued peers (i.e., `ShuffleMigrationRunnable`) in the 
> `shuffleMigrationPool`.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48637) On-demand shuffle migration peer refresh during decommission

2024-06-16 Thread wuyi (Jira)
wuyi created SPARK-48637:


 Summary: On-demand shuffle migration peer refresh during 
decommission
 Key: SPARK-48637
 URL: https://issues.apache.org/jira/browse/SPARK-48637
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.4.3, 3.3.4, 3.5.1, 3.2.4, 3.1.3, 4.0.0
Reporter: wuyi


Currently the shuffle migration peers is refreshed every 30s by default. It 
could be more effecient if we refresh the peer immediately once there is a peer 
aborted.

(Strictly speaking, we only wait 30s to refresh the new peer when the there no 
queued peers (i.e., `ShuffleMigrationRunnable`) in the `shuffleMigrationPool`.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48636) Event driven block manager decommissioner

2024-06-16 Thread wuyi (Jira)
wuyi created SPARK-48636:


 Summary: Event driven block manager decommissioner
 Key: SPARK-48636
 URL: https://issues.apache.org/jira/browse/SPARK-48636
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.4, 3.5.1, 3.2.4, 3.1.3, 4.0.0
Reporter: wuyi


The current blockmanager decommisssioner uses the periodic threads to refresh 
blocks/ peers, monitor progress. It can be low-effeicent this way. For example, 
in the worst case, it takes 30s (by default) at most for an executor to exit 
itself even if all the blocks have been migrated. The cause is that the 
migration status is checked every 30s (by default).

So this ticket proposes the blockmanager decommissioner to leverage the events 
driven way to improve its effeciency. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48634) Avoid statically initialize threadpool at ExecutePlanResponseReattachableIterator

2024-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48634:
--

Assignee: Apache Spark

> Avoid statically initialize threadpool at 
> ExecutePlanResponseReattachableIterator
> -
>
> Key: SPARK-48634
> URL: https://issues.apache.org/jira/browse/SPARK-48634
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Avoid having ExecutePlanResponseReattachableIterator._release_thread_pool to 
> initialize ThreadPool which might be dragged in pickling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48634) Avoid statically initialize threadpool at ExecutePlanResponseReattachableIterator

2024-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48634:
--

Assignee: (was: Apache Spark)

> Avoid statically initialize threadpool at 
> ExecutePlanResponseReattachableIterator
> -
>
> Key: SPARK-48634
> URL: https://issues.apache.org/jira/browse/SPARK-48634
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Avoid having ExecutePlanResponseReattachableIterator._release_thread_pool to 
> initialize ThreadPool which might be dragged in pickling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org