[jira] [Commented] (SPARK-43031) Enable tests for Python streaming spark-connect

2023-04-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710401#comment-17710401
 ] 

Hudson commented on SPARK-43031:


User 'WweiL' has created a pull request for this issue:
https://github.com/apache/spark/pull/40691

> Enable tests for Python streaming spark-connect
> ---
>
> Key: SPARK-43031
> URL: https://issues.apache.org/jira/browse/SPARK-43031
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43093) Test case "Add a directory when spark.sql.legacy.addSingleFileInAddFile set to false" should use random directories for testing

2023-04-10 Thread Yang Jie (Jira)
Yang Jie created SPARK-43093:


 Summary: Test case "Add a directory when 
spark.sql.legacy.addSingleFileInAddFile set to false" should use random 
directories for testing
 Key: SPARK-43093
 URL: https://issues.apache.org/jira/browse/SPARK-43093
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.3.2, 3.2.3, 3.4.0, 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43092) Cleanup unsuppoerted function `dropDuplicatesWithinWatermark` from `Dataset`

2023-04-10 Thread Yang Jie (Jira)
Yang Jie created SPARK-43092:


 Summary: Cleanup unsuppoerted function 
`dropDuplicatesWithinWatermark` from `Dataset`
 Key: SPARK-43092
 URL: https://issues.apache.org/jira/browse/SPARK-43092
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43088) Respect RequiresDistributionAndOrdering in CTAS/RTAS

2023-04-10 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710386#comment-17710386
 ] 

Snoot.io commented on SPARK-43088:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/40734

> Respect RequiresDistributionAndOrdering in CTAS/RTAS
> 
>
> Key: SPARK-43088
> URL: https://issues.apache.org/jira/browse/SPARK-43088
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We must respect {{RequiresDistributionAndOrdering}} writes constructed for 
> CTAS/RTAS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43033) Avoid task retries due to AssertNotNull checks

2023-04-10 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710385#comment-17710385
 ] 

Snoot.io commented on SPARK-43033:
--

User 'clownxc' has created a pull request for this issue:
https://github.com/apache/spark/pull/40707

> Avoid task retries due to AssertNotNull checks
> --
>
> Key: SPARK-43033
> URL: https://issues.apache.org/jira/browse/SPARK-43033
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> As discussed 
> [here|https://github.com/apache/spark/pull/40655#discussion_r1156693696], 
> tasks that failed because of exceptions generated by {{AssertNotNull}} should 
> not be retried.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43089) Redact debug string in UI

2023-04-10 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710384#comment-17710384
 ] 

Snoot.io commented on SPARK-43089:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/40733

> Redact debug string in UI
> -
>
> Key: SPARK-43089
> URL: https://issues.apache.org/jira/browse/SPARK-43089
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0
>
>
> https://github.com/apache/spark/pull/40603 exposes all data without 
> redaction. We should redact it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43089) Redact debug string in UI

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43089.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40733
[https://github.com/apache/spark/pull/40733]

> Redact debug string in UI
> -
>
> Key: SPARK-43089
> URL: https://issues.apache.org/jira/browse/SPARK-43089
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0
>
>
> https://github.com/apache/spark/pull/40603 exposes all data without 
> redaction. We should redact it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43089) Redact debug string in UI

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43089:


Assignee: Hyukjin Kwon

> Redact debug string in UI
> -
>
> Key: SPARK-43089
> URL: https://issues.apache.org/jira/browse/SPARK-43089
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/40603 exposes all data without 
> redaction. We should redact it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42916) JDBCCatalog Keep Char/Varchar meta information on the read-side

2023-04-10 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710382#comment-17710382
 ] 

Snoot.io commented on SPARK-42916:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/40543

> JDBCCatalog Keep Char/Varchar meta information on the read-side
> ---
>
> Key: SPARK-42916
> URL: https://issues.apache.org/jira/browse/SPARK-42916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> Fix error like:
> string cannot be cast to varchar(20)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43039) Support custom fields in the file source _metadata column

2023-04-10 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710381#comment-17710381
 ] 

Snoot.io commented on SPARK-43039:
--

User 'ryan-johnson-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/40677

> Support custom fields in the file source _metadata column
> -
>
> Key: SPARK-43039
> URL: https://issues.apache.org/jira/browse/SPARK-43039
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Today, the schema of the file source _metadata column depends on the file 
> format (e.g. parquet file format supports {{{}_metadata.row_index{}}}) but 
> this is hard-wired into the {{FileFormat}} itself. Not only is this an ugly 
> design, it also prevents custom file formats from adding their own fields to 
> the {{_metadata}} column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43077) Improve the error message of UNRECOGNIZED_SQL_TYPE

2023-04-10 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-43077.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40718
[https://github.com/apache/spark/pull/40718]

> Improve the error message of UNRECOGNIZED_SQL_TYPE
> --
>
> Key: SPARK-43077
> URL: https://issues.apache.org/jira/browse/SPARK-43077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0
>
>
> UNRECOGNIZED_SQL_TYPE prints the jdbc type id in the error message currently. 
> This is difficult for spark users to understand the meaning of this kind of 
> error, especially when the type id is from a vendor extension.
> For example, 
> {code:java}
>  org.apache.spark.SparkSQLException: Unrecognized SQL type -102{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43077) Improve the error message of UNRECOGNIZED_SQL_TYPE

2023-04-10 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-43077:


Assignee: Kent Yao

> Improve the error message of UNRECOGNIZED_SQL_TYPE
> --
>
> Key: SPARK-43077
> URL: https://issues.apache.org/jira/browse/SPARK-43077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> UNRECOGNIZED_SQL_TYPE prints the jdbc type id in the error message currently. 
> This is difficult for spark users to understand the meaning of this kind of 
> error, especially when the type id is from a vendor extension.
> For example, 
> {code:java}
>  org.apache.spark.SparkSQLException: Unrecognized SQL type -102{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43091) Support overloading UDF

2023-04-10 Thread Hang Wu (Jira)
Hang Wu created SPARK-43091:
---

 Summary: Support overloading UDF
 Key: SPARK-43091
 URL: https://issues.apache.org/jira/browse/SPARK-43091
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.2
Reporter: Hang Wu


It seems that Spark SQL does not support overloading UDF for a long while. If 
we register two functions with the same name, Spark complains "The function is 
replaced a previously registered function". The solution is to either enhancing 
the org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry class to 
support multiple functions with the same name, or enable users to extend and 
use their own FunctionRegistry class. Should you have any comment, please 
kindly let me know.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43090) Move withTable from RemoteSparkSession to SQLHelper

2023-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43090.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40723
[https://github.com/apache/spark/pull/40723]

> Move withTable from RemoteSparkSession to SQLHelper
> ---
>
> Key: SPARK-43090
> URL: https://issues.apache.org/jira/browse/SPARK-43090
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43090) Move withTable from RemoteSparkSession to SQLHelper

2023-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43090:
-

Assignee: Yang Jie

> Move withTable from RemoteSparkSession to SQLHelper
> ---
>
> Key: SPARK-43090
> URL: https://issues.apache.org/jira/browse/SPARK-43090
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43090) Move withTable from RemoteSparkSession to SQLHelper

2023-04-10 Thread Yang Jie (Jira)
Yang Jie created SPARK-43090:


 Summary: Move withTable from RemoteSparkSession to SQLHelper
 Key: SPARK-43090
 URL: https://issues.apache.org/jira/browse/SPARK-43090
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Tests
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43089) Redact debug string in UI

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43089:
-
Affects Version/s: 3.4.1
   (was: 3.4.0)

> Redact debug string in UI
> -
>
> Key: SPARK-43089
> URL: https://issues.apache.org/jira/browse/SPARK-43089
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/40603 exposes all data without 
> redaction. We should redact it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43089) Redact debug string in UI

2023-04-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-43089:


 Summary: Redact debug string in UI
 Key: SPARK-43089
 URL: https://issues.apache.org/jira/browse/SPARK-43089
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/pull/40603 exposes all data without redaction. 
We should redact it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43088) Respect RequiresDistributionAndOrdering in CTAS/RTAS

2023-04-10 Thread Anton Okolnychyi (Jira)
Anton Okolnychyi created SPARK-43088:


 Summary: Respect RequiresDistributionAndOrdering in CTAS/RTAS
 Key: SPARK-43088
 URL: https://issues.apache.org/jira/browse/SPARK-43088
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Anton Okolnychyi


We must respect {{RequiresDistributionAndOrdering}} writes constructed for 
CTAS/RTAS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43085) Fix bug in column DEFAULT assignment for target tables with multi-part names

2023-04-10 Thread Daniel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel updated SPARK-43085:
---
Summary: Fix bug in column DEFAULT assignment for target tables with 
multi-part names  (was: Fix bug in column DEFAULT assignment for target tables 
with three-part names)

> Fix bug in column DEFAULT assignment for target tables with multi-part names
> 
>
> Key: SPARK-43085
> URL: https://issues.apache.org/jira/browse/SPARK-43085
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>
> To reproduce:
> {{CREATE DATABASE If NOT EXISTS main.codydemos;}}
> {{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}
> {{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}
> {{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}
> {{SELECT * FROM main.codydemos.test_s}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43085) Fix bug in column DEFAULT assignment for target tables with multi-part names

2023-04-10 Thread Daniel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel updated SPARK-43085:
---
Description: (was: To reproduce:

{{CREATE DATABASE If NOT EXISTS main.codydemos;}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}

{{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}

{{SELECT * FROM main.codydemos.test_s}}

 )

> Fix bug in column DEFAULT assignment for target tables with multi-part names
> 
>
> Key: SPARK-43085
> URL: https://issues.apache.org/jira/browse/SPARK-43085
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread Mike K (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710355#comment-17710355
 ] 

Mike K commented on SPARK-42382:


User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40726

> Upgrade `cyclonedx-maven-plugin` to 2.7.6
> -
>
> Key: SPARK-42382
> URL: https://issues.apache.org/jira/browse/SPARK-42382
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4]
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.5]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42951) Spark Connect: Streaming DataStreamReader API except table()

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42951:


Assignee: Wei Liu

> Spark Connect: Streaming DataStreamReader API except table()
> 
>
> Key: SPARK-42951
> URL: https://issues.apache.org/jira/browse/SPARK-42951
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42951) Spark Connect: Streaming DataStreamReader API except table()

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42951.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40689
[https://github.com/apache/spark/pull/40689]

> Spark Connect: Streaming DataStreamReader API except table()
> 
>
> Key: SPARK-42951
> URL: https://issues.apache.org/jira/browse/SPARK-42951
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42382.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40726
[https://github.com/apache/spark/pull/40726]

> Upgrade `cyclonedx-maven-plugin` to 2.7.6
> -
>
> Key: SPARK-42382
> URL: https://issues.apache.org/jira/browse/SPARK-42382
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4]
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.5]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43083) Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43083:
--
Affects Version/s: 3.4.0
   (was: 3.5.0)

> Mark `*StateStoreSuite` as `ExtendedSQLTest`
> 
>
> Key: SPARK-43083
> URL: https://issues.apache.org/jira/browse/SPARK-43083
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43083) Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43083:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Mark `*StateStoreSuite` as `ExtendedSQLTest`
> 
>
> Key: SPARK-43083
> URL: https://issues.apache.org/jira/browse/SPARK-43083
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43083) Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43083.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40727
[https://github.com/apache/spark/pull/40727]

> Mark `*StateStoreSuite` as `ExtendedSQLTest`
> 
>
> Key: SPARK-43083
> URL: https://issues.apache.org/jira/browse/SPARK-43083
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43083) Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43083:
-

Assignee: Dongjoon Hyun

> Mark `*StateStoreSuite` as `ExtendedSQLTest`
> 
>
> Key: SPARK-43083
> URL: https://issues.apache.org/jira/browse/SPARK-43083
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43087) Support coalesce buckets in join in AQE

2023-04-10 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-43087:
---

 Summary: Support coalesce buckets in join in AQE
 Key: SPARK-43087
 URL: https://issues.apache.org/jira/browse/SPARK-43087
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43071) Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-10 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-43071.

Fix Version/s: 3.4.1
   Resolution: Fixed

Issue resolved by pull request 40710
[https://github.com/apache/spark/pull/40710]

> Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation
> --
>
> Key: SPARK-43071
> URL: https://issues.apache.org/jira/browse/SPARK-43071
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43071) Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

2023-04-10 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-43071:
--

Assignee: Daniel

> Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation
> --
>
> Key: SPARK-43071
> URL: https://issues.apache.org/jira/browse/SPARK-43071
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43086) Support bin pack task scheduling on executors

2023-04-10 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43086:


 Summary: Support bin pack task scheduling on executors 
 Key: SPARK-43086
 URL: https://issues.apache.org/jira/browse/SPARK-43086
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: Zhongwei Zhu


Dynamic allocation only remove or decommission an idle executor. The default 
task scheduling use round robin to do task assignment on executors. 

For example, we have 4 tasks to run, 4 executors(each has 4 cpu cores). Default 
task scheduling will assign 1 task per executors. With bin pack, one executor 
could assign 4 tasks, then dynamic allocation could remove other 3 executors to 
reduce resource waste.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43085) Fix bug in column DEFAULT assignment for target tables with three-part names

2023-04-10 Thread Daniel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel updated SPARK-43085:
---
Description: 
To reproduce:

{{CREATE DATABASE If NOT EXISTS main.codydemos;}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}

{{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}

{{SELECT * FROM main.codydemos.test_s}}

 

  was:
To reproduce:

 

{{CREATE DATABASE If NOT EXISTS main.codydemos;}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}

{{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}

{{SELECT * FROM main.codydemos.test_s}}

 


> Fix bug in column DEFAULT assignment for target tables with three-part names
> 
>
> Key: SPARK-43085
> URL: https://issues.apache.org/jira/browse/SPARK-43085
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>
> To reproduce:
> {{CREATE DATABASE If NOT EXISTS main.codydemos;}}
> {{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}
> {{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}
> {{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}
> {{SELECT * FROM main.codydemos.test_s}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43085) Fix bug in column DEFAULT assignment for target tables with three-part names

2023-04-10 Thread Daniel (Jira)
Daniel created SPARK-43085:
--

 Summary: Fix bug in column DEFAULT assignment for target tables 
with three-part names
 Key: SPARK-43085
 URL: https://issues.apache.org/jira/browse/SPARK-43085
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel


To reproduce:

```

{{CREATE DATABASE If NOT EXISTS main.codydemos;}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}

{{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}

{{SELECT * FROM main.codydemos.test_s}}

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43085) Fix bug in column DEFAULT assignment for target tables with three-part names

2023-04-10 Thread Daniel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel updated SPARK-43085:
---
Description: 
To reproduce:

 

{{CREATE DATABASE If NOT EXISTS main.codydemos;}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}

{{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}

{{SELECT * FROM main.codydemos.test_s}}

 

  was:
To reproduce:

```

{{CREATE DATABASE If NOT EXISTS main.codydemos;}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}

{{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}

{{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}

{{SELECT * FROM main.codydemos.test_s}}

```


> Fix bug in column DEFAULT assignment for target tables with three-part names
> 
>
> Key: SPARK-43085
> URL: https://issues.apache.org/jira/browse/SPARK-43085
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>
> To reproduce:
>  
> {{CREATE DATABASE If NOT EXISTS main.codydemos;}}
> {{CREATE OR REPLACE TABLE main.codydemos.test_ts (Id INT, ts timestamp);}}
> {{CREATE OR REPLACE TABLE main.codydemos.test_ts_other (ts timestamp);}}
> {{INSERT INTO main.codydemos.test_ts(ts) VALUES (current_timestamp());}}
> {{SELECT * FROM main.codydemos.test_s}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43084) Add Python state API (applyInPandasWithState) and verify UDFs

2023-04-10 Thread Raghu Angadi (Jira)
Raghu Angadi created SPARK-43084:


 Summary: Add Python state API (applyInPandasWithState) and verify 
UDFs
 Key: SPARK-43084
 URL: https://issues.apache.org/jira/browse/SPARK-43084
 Project: Spark
  Issue Type: Task
  Components: Connect, Structured Streaming
Affects Versions: 3.5.0
 Environment: * Add Python state API (applyInPandasWithState) to 
streaming Spark-connect.
 * verify the UDFs work (it may not need any code changes).
Reporter: Raghu Angadi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43083) Mark `*StateStoreSuite` as `ExtendedSQLTest`

2023-04-10 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-43083:
-

 Summary: Mark `*StateStoreSuite` as `ExtendedSQLTest`
 Key: SPARK-43083
 URL: https://issues.apache.org/jira/browse/SPARK-43083
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.6

2023-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42382:
--
Summary: Upgrade `cyclonedx-maven-plugin` to 2.7.6  (was: Upgrade 
`cyclonedx-maven-plugin` to 2.7.5)

> Upgrade `cyclonedx-maven-plugin` to 2.7.6
> -
>
> Key: SPARK-42382
> URL: https://issues.apache.org/jira/browse/SPARK-42382
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4]
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.5]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-04-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710273#comment-17710273
 ] 

Dongjoon Hyun commented on SPARK-42382:
---

Since [~LuciferYang] has been investigating this so far, I made a PR with his 
main-authorship.

- https://github.com/apache/spark/pull/40726

> Upgrade `cyclonedx-maven-plugin` to 2.7.5
> -
>
> Key: SPARK-42382
> URL: https://issues.apache.org/jira/browse/SPARK-42382
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4]
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.5]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-42382:
---
  Assignee: Yang Jie

> Upgrade `cyclonedx-maven-plugin` to 2.7.5
> -
>
> Key: SPARK-42382
> URL: https://issues.apache.org/jira/browse/SPARK-42382
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4]
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.5]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-04-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710270#comment-17710270
 ] 

Dongjoon Hyun commented on SPARK-42382:
---

Shall we reopen this because 2.7.6 is released one week ago?

> Upgrade `cyclonedx-maven-plugin` to 2.7.5
> -
>
> Key: SPARK-42382
> URL: https://issues.apache.org/jira/browse/SPARK-42382
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4]
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.5]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43082) Arrow-optimized Python UDFs in Spark Connect

2023-04-10 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43082:


 Summary: Arrow-optimized Python UDFs in Spark Connect
 Key: SPARK-43082
 URL: https://issues.apache.org/jira/browse/SPARK-43082
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


Implement Arrow-optimized Python UDFs in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43061) Introduce PartitionEvaluator for SQL operator execution

2023-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-43061:

Summary: Introduce PartitionEvaluator for SQL operator execution  (was: 
Introduce TaskEvaluator for SQL operator execution)

> Introduce PartitionEvaluator for SQL operator execution
> ---
>
> Key: SPARK-43061
> URL: https://issues.apache.org/jira/browse/SPARK-43061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43061) Introduce TaskEvaluator for SQL operator execution

2023-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43061.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40697
[https://github.com/apache/spark/pull/40697]

> Introduce TaskEvaluator for SQL operator execution
> --
>
> Key: SPARK-43061
> URL: https://issues.apache.org/jira/browse/SPARK-43061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43061) Introduce TaskEvaluator for SQL operator execution

2023-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43061:
---

Assignee: Wenchen Fan

> Introduce TaskEvaluator for SQL operator execution
> --
>
> Key: SPARK-43061
> URL: https://issues.apache.org/jira/browse/SPARK-43061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43033) Avoid task retries due to AssertNotNull checks

2023-04-10 Thread xiaochen zhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710174#comment-17710174
 ] 

xiaochen zhou commented on SPARK-43033:
---

Thanks for your reply. I have opened the PR, please help to review it, thanks a 
lot. [https://github.com/apache/spark/pull/40707]

> Avoid task retries due to AssertNotNull checks
> --
>
> Key: SPARK-43033
> URL: https://issues.apache.org/jira/browse/SPARK-43033
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> As discussed 
> [here|https://github.com/apache/spark/pull/40655#discussion_r1156693696], 
> tasks that failed because of exceptions generated by {{AssertNotNull}} should 
> not be retried.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43063) `df.show` handle null should print NULL instead of null

2023-04-10 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710133#comment-17710133
 ] 

GridGain Integration commented on SPARK-43063:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/40699

> `df.show` handle null should print NULL instead of null
> ---
>
> Key: SPARK-43063
> URL: https://issues.apache.org/jira/browse/SPARK-43063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yikaifei
>Priority: Trivial
>
> `df.show` handle null should print NULL instead of null to consistent 
> behavior;
> {code:java}
> Like as the following behavior is currently inconsistent:
> ``` shell
> scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 
> 'New Jersey', 4, 'Seattle') as result").show(false)
> +--+
> |result|
> +--+
> |null  |
> +--+
> ```
> ``` shell
> spark-sql> DESC FUNCTION EXTENDED decode;
> function_desc
> Function: decode
> Class: org.apache.spark.sql.catalyst.expressions.Decode
> Usage:
> decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> decode(expr, search, result [, search, result ] ... [, default]) - 
> Compares expr
>   to each search value in order. If expr is equal to a search value, 
> decode returns
>   the corresponding result. If no match is found, then it returns 
> default. If default
>   is omitted, it returns null.
> Extended Usage:
> Examples:
>   > SELECT decode(encode('abc', 'utf-8'), 'utf-8');
>abc
>   > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>San Francisco
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>Non domestic
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle');
>NULL
> Since: 3.2.0
> Time taken: 0.074 seconds, Fetched 4 row(s)
> ```
> ``` shell
> spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New 
> Jersey', 4, 'Seattle');
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43081) Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710127#comment-17710127
 ] 

Ignite TC Bot commented on SPARK-43081:
---

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/40724

> Add torch distributor data loader that loads data from spark partition data
> ---
>
> Key: SPARK-43081
> URL: https://issues.apache.org/jira/browse/SPARK-43081
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Add torch distributor data loader that loads data from spark partition data.
>  
> We can add 2 APIs like:
> Adds a `TorchDistributor` method API :
> {code:java}
>      def train_on_dataframe(self, train_function, spark_dataframe, *args, 
> **kwargs):
>         """
>         Runs distributed training using provided spark DataFrame as input 
> data.
>         You should ensure the input spark DataFrame have evenly divided 
> partitions,
>         and this method starts a barrier spark job that each spark task in 
> the job
>         process one partition of the input spark DataFrame.
>         Parameters
>         --
>         train_function :
>             Either a PyTorch function, PyTorch Lightning function that 
> launches distributed
>             training. Note that inside the function, you can call
>             `pyspark.ml.torch.distributor.get_spark_partition_data_loader` 
> API to get a torch
>             data loader, the data loader loads data from the corresponding 
> partition of the
>             input spark DataFrame.
>         spark_dataframe :
>             An input spark DataFrame that can be used in PyTorch 
> `train_function` function.
>             See `train_function` argument doc for details.
>         args :
>             `args` need to be the input parameters to `train_function` 
> function. It would look like
>             >>> model = distributor.run(train, 1e-3, 64)
>             where train is a function and 1e-3 and 64 are regular numeric 
> inputs to the function.
>         kwargs :
>             `kwargs` need to be the key-work input parameters to 
> `train_function` function.
>             It would look like
>             >>> model = distributor.run(train, tol=1e-3, max_iter=64)
>             where train is a function that has 2 arguments `tol` and 
> `max_iter`.
>         Returns
>         ---
>             Returns the output of `train_function` called with args inside 
> spark rank 0 task.
>         """{code}
>  
> Adds an loader API:
>  
> {code:java}
>  def get_spark_partition_data_loader(num_samples, batch_size, prefetch=2):
>     """
>     This function must be called inside the `train_function` where 
> `train_function`
>     is the input argument of `TorchDistributor.train_on_dataframe`.
>     The function returns a pytorch data loader that loads data from
>     the corresponding spark partition data.
>     Parameters
>     --
>     num_samples :
>         Number of samples to generate per epoch. If `num_samples` is less 
> than the number of
>         rows in the spark partition, it generate the first `num_samples` rows 
> of
>         the spark partition, if `num_samples` is greater than the number of
>         rows in the spark partition, then after the iterator loaded all rows 
> from the partition,
>         it wraps round back to the first row.
>     batch_size:
>         How many samples per batch to load.
>     prefetch:
>         Number of batches loaded in advance.
>     """{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43081) Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-43081:
--

Assignee: Weichen Xu

> Add torch distributor data loader that loads data from spark partition data
> ---
>
> Key: SPARK-43081
> URL: https://issues.apache.org/jira/browse/SPARK-43081
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Add torch distributor data loader that loads data from spark partition data.
>  
> We can add 2 APIs like:
> Adds a `TorchDistributor` method API :
> {code:java}
>      def train_on_dataframe(self, train_function, spark_dataframe, *args, 
> **kwargs):
>         """
>         Runs distributed training using provided spark DataFrame as input 
> data.
>         You should ensure the input spark DataFrame have evenly divided 
> partitions,
>         and this method starts a barrier spark job that each spark task in 
> the job
>         process one partition of the input spark DataFrame.
>         Parameters
>         --
>         train_function :
>             Either a PyTorch function, PyTorch Lightning function that 
> launches distributed
>             training. Note that inside the function, you can call
>             `pyspark.ml.torch.distributor.get_spark_partition_data_loader` 
> API to get a torch
>             data loader, the data loader loads data from the corresponding 
> partition of the
>             input spark DataFrame.
>         spark_dataframe :
>             An input spark DataFrame that can be used in PyTorch 
> `train_function` function.
>             See `train_function` argument doc for details.
>         args :
>             `args` need to be the input parameters to `train_function` 
> function. It would look like
>             >>> model = distributor.run(train, 1e-3, 64)
>             where train is a function and 1e-3 and 64 are regular numeric 
> inputs to the function.
>         kwargs :
>             `kwargs` need to be the key-work input parameters to 
> `train_function` function.
>             It would look like
>             >>> model = distributor.run(train, tol=1e-3, max_iter=64)
>             where train is a function that has 2 arguments `tol` and 
> `max_iter`.
>         Returns
>         ---
>             Returns the output of `train_function` called with args inside 
> spark rank 0 task.
>         """{code}
>  
> Adds an loader API:
>  
> {code:java}
>  def get_spark_partition_data_loader(num_samples, batch_size, prefetch=2):
>     """
>     This function must be called inside the `train_function` where 
> `train_function`
>     is the input argument of `TorchDistributor.train_on_dataframe`.
>     The function returns a pytorch data loader that loads data from
>     the corresponding spark partition data.
>     Parameters
>     --
>     num_samples :
>         Number of samples to generate per epoch. If `num_samples` is less 
> than the number of
>         rows in the spark partition, it generate the first `num_samples` rows 
> of
>         the spark partition, if `num_samples` is greater than the number of
>         rows in the spark partition, then after the iterator loaded all rows 
> from the partition,
>         it wraps round back to the first row.
>     batch_size:
>         How many samples per batch to load.
>     prefetch:
>         Number of batches loaded in advance.
>     """{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43081) Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-43081:
---
Description: 
Add torch distributor data loader that loads data from spark partition data.

 

We can add 2 APIs like:

 

Adds a `TorchDistributor` method API :

```
    def train_on_dataframe(self, train_function, spark_dataframe, *args, 
**kwargs):
        """
        Runs distributed training using provided spark DataFrame as input data.
        You should ensure the input spark DataFrame have evenly divided 
partitions,
        and this method starts a barrier spark job that each spark task in the 
job
        process one partition of the input spark DataFrame.

        Parameters
        --
        train_function :
            Either a PyTorch function, PyTorch Lightning function that launches 
distributed
            training. Note that inside the function, you can call
            `pyspark.ml.torch.distributor.get_spark_partition_data_loader` API 
to get a torch
            data loader, the data loader loads data from the corresponding 
partition of the
            input spark DataFrame.
        spark_dataframe :
            An input spark DataFrame that can be used in PyTorch 
`train_function` function.
            See `train_function` argument doc for details.
        args :
            `args` need to be the input parameters to `train_function` 
function. It would look like

            >>> model = distributor.run(train, 1e-3, 64)

            where train is a function and 1e-3 and 64 are regular numeric 
inputs to the function.
        kwargs :
            `kwargs` need to be the key-work input parameters to 
`train_function` function.
            It would look like

            >>> model = distributor.run(train, tol=1e-3, max_iter=64)

            where train is a function that has 2 arguments `tol` and `max_iter`.

        Returns
        ---
            Returns the output of `train_function` called with args inside 
spark rank 0 task.
        """
```

 

Adds an loader API:

```
def get_spark_partition_data_loader(num_samples, batch_size, prefetch=2):
    """
    This function must be called inside the `train_function` where 
`train_function`
    is the input argument of `TorchDistributor.train_on_dataframe`.
    The function returns a pytorch data loader that loads data from
    the corresponding spark partition data.

    Parameters
    --
    num_samples :
        Number of samples to generate per epoch. If `num_samples` is less than 
the number of
        rows in the spark partition, it generate the first `num_samples` rows of
        the spark partition, if `num_samples` is greater than the number of
        rows in the spark partition, then after the iterator loaded all rows 
from the partition,
        it wraps round back to the first row.
    batch_size:
        How many samples per batch to load.
    prefetch:
        Number of batches loaded in advance.
    """
```

  was:Add torch distributor data loader that loads data from spark partition 
data.


> Add torch distributor data loader that loads data from spark partition data
> ---
>
> Key: SPARK-43081
> URL: https://issues.apache.org/jira/browse/SPARK-43081
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add torch distributor data loader that loads data from spark partition data.
>  
> We can add 2 APIs like:
>  
> Adds a `TorchDistributor` method API :
> ```
>     def train_on_dataframe(self, train_function, spark_dataframe, *args, 
> **kwargs):
>         """
>         Runs distributed training using provided spark DataFrame as input 
> data.
>         You should ensure the input spark DataFrame have evenly divided 
> partitions,
>         and this method starts a barrier spark job that each spark task in 
> the job
>         process one partition of the input spark DataFrame.
>         Parameters
>         --
>         train_function :
>             Either a PyTorch function, PyTorch Lightning function that 
> launches distributed
>             training. Note that inside the function, you can call
>             `pyspark.ml.torch.distributor.get_spark_partition_data_loader` 
> API to get a torch
>             data loader, the data loader loads data from the corresponding 
> partition of the
>             input spark DataFrame.
>         spark_dataframe :
>             An input spark DataFrame that can be used in PyTorch 
> `train_function` function.
>             See `train_function` argument doc for details.
>         args :
>             `args` need to be the input parameters to `train_function` 
> function. It would look like
>             >>> model = 

[jira] [Updated] (SPARK-43081) Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-43081:
---
Description: 
Add torch distributor data loader that loads data from spark partition data.

 

We can add 2 APIs like:

Adds a `TorchDistributor` method API :
{code:java}
     def train_on_dataframe(self, train_function, spark_dataframe, *args, 
**kwargs):
        """
        Runs distributed training using provided spark DataFrame as input data.
        You should ensure the input spark DataFrame have evenly divided 
partitions,
        and this method starts a barrier spark job that each spark task in the 
job
        process one partition of the input spark DataFrame.
        Parameters
        --
        train_function :
            Either a PyTorch function, PyTorch Lightning function that launches 
distributed
            training. Note that inside the function, you can call
            `pyspark.ml.torch.distributor.get_spark_partition_data_loader` API 
to get a torch
            data loader, the data loader loads data from the corresponding 
partition of the
            input spark DataFrame.
        spark_dataframe :
            An input spark DataFrame that can be used in PyTorch 
`train_function` function.
            See `train_function` argument doc for details.
        args :
            `args` need to be the input parameters to `train_function` 
function. It would look like
            >>> model = distributor.run(train, 1e-3, 64)
            where train is a function and 1e-3 and 64 are regular numeric 
inputs to the function.
        kwargs :
            `kwargs` need to be the key-work input parameters to 
`train_function` function.
            It would look like
            >>> model = distributor.run(train, tol=1e-3, max_iter=64)
            where train is a function that has 2 arguments `tol` and `max_iter`.
        Returns
        ---
            Returns the output of `train_function` called with args inside 
spark rank 0 task.
        """{code}
 

Adds an loader API:

 
{code:java}
 def get_spark_partition_data_loader(num_samples, batch_size, prefetch=2):
    """
    This function must be called inside the `train_function` where 
`train_function`
    is the input argument of `TorchDistributor.train_on_dataframe`.
    The function returns a pytorch data loader that loads data from
    the corresponding spark partition data.
    Parameters
    --
    num_samples :
        Number of samples to generate per epoch. If `num_samples` is less than 
the number of
        rows in the spark partition, it generate the first `num_samples` rows of
        the spark partition, if `num_samples` is greater than the number of
        rows in the spark partition, then after the iterator loaded all rows 
from the partition,
        it wraps round back to the first row.
    batch_size:
        How many samples per batch to load.
    prefetch:
        Number of batches loaded in advance.
    """{code}

  was:
Add torch distributor data loader that loads data from spark partition data.

 

We can add 2 APIs like:

 

Adds a `TorchDistributor` method API :

```
    def train_on_dataframe(self, train_function, spark_dataframe, *args, 
**kwargs):
        """
        Runs distributed training using provided spark DataFrame as input data.
        You should ensure the input spark DataFrame have evenly divided 
partitions,
        and this method starts a barrier spark job that each spark task in the 
job
        process one partition of the input spark DataFrame.

        Parameters
        --
        train_function :
            Either a PyTorch function, PyTorch Lightning function that launches 
distributed
            training. Note that inside the function, you can call
            `pyspark.ml.torch.distributor.get_spark_partition_data_loader` API 
to get a torch
            data loader, the data loader loads data from the corresponding 
partition of the
            input spark DataFrame.
        spark_dataframe :
            An input spark DataFrame that can be used in PyTorch 
`train_function` function.
            See `train_function` argument doc for details.
        args :
            `args` need to be the input parameters to `train_function` 
function. It would look like

            >>> model = distributor.run(train, 1e-3, 64)

            where train is a function and 1e-3 and 64 are regular numeric 
inputs to the function.
        kwargs :
            `kwargs` need to be the key-work input parameters to 
`train_function` function.
            It would look like

            >>> model = distributor.run(train, tol=1e-3, max_iter=64)

            where train is a function that has 2 arguments `tol` and `max_iter`.

        Returns
        ---
            Returns the output of `train_function` called with args inside 
spark rank 0 task.
        """
```

 

Adds an loader API:


[jira] [Created] (SPARK-43081) Add torch distributor data loader that loads data from spark partition data

2023-04-10 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-43081:
--

 Summary: Add torch distributor data loader that loads data from 
spark partition data
 Key: SPARK-43081
 URL: https://issues.apache.org/jira/browse/SPARK-43081
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu


Add torch distributor data loader that loads data from spark partition data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43077) Improve the error message of UNRECOGNIZED_SQL_TYPE

2023-04-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710093#comment-17710093
 ] 

ASF GitHub Bot commented on SPARK-43077:


User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/40718

> Improve the error message of UNRECOGNIZED_SQL_TYPE
> --
>
> Key: SPARK-43077
> URL: https://issues.apache.org/jira/browse/SPARK-43077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
>
> UNRECOGNIZED_SQL_TYPE prints the jdbc type id in the error message currently. 
> This is difficult for spark users to understand the meaning of this kind of 
> error, especially when the type id is from a vendor extension.
> For example, 
> {code:java}
>  org.apache.spark.SparkSQLException: Unrecognized SQL type -102{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43080) Upgrade zstd-jni to 1.5.5-1

2023-04-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710091#comment-17710091
 ] 

ASF GitHub Bot commented on SPARK-43080:


User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40721

> Upgrade zstd-jni to 1.5.5-1
> ---
>
> Key: SPARK-43080
> URL: https://issues.apache.org/jira/browse/SPARK-43080
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> * 
> [luben/zstd-jni@{{{}v1.5.4-2...v1.5.5-1{}}}|https://github.com/luben/zstd-jni/compare/v1.5.4-2...v1.5.5-1]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43080) Upgrade zstd-jni to 1.5.5-1

2023-04-10 Thread Yang Jie (Jira)
Yang Jie created SPARK-43080:


 Summary: Upgrade zstd-jni to 1.5.5-1
 Key: SPARK-43080
 URL: https://issues.apache.org/jira/browse/SPARK-43080
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


* 
[luben/zstd-jni@{{{}v1.5.4-2...v1.5.5-1{}}}|https://github.com/luben/zstd-jni/compare/v1.5.4-2...v1.5.5-1]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43076) Removing the dependency on `grpcio` when remote session is not used.

2023-04-10 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710083#comment-17710083
 ] 

GridGain Integration commented on SPARK-43076:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/40722

> Removing the dependency on `grpcio` when remote session is not used.
> 
>
> Key: SPARK-43076
> URL: https://issues.apache.org/jira/browse/SPARK-43076
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should not enforce to install `grpcio` when remote session is not used for 
> pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40609) Casts types according to bucket info for Equality expression

2023-04-10 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710080#comment-17710080
 ] 

Yuming Wang commented on SPARK-40609:
-


{code:scala}
import org.apache.spark.benchmark.Benchmark
val numRows = 1024 * 1024 * 40
spark.sql(s"CREATE TABLE t using parquet AS SELECT id as a, cast(id as 
decimal(18, 0)) as b FROM range(${numRows}L)")
val benchmark = new Benchmark("Benchmark equal with cast", numRows, minNumIters 
= 2)

benchmark.addCase("default") { _ =>
  spark.sql("SELECT * FROM t t1 join t t2 on  t1.a = 
t2.b").write.format("noop").mode("Overwrite").save()
}

benchmark.addCase("cast to bigint") { _ =>
  spark.sql("SELECT * FROM t t1 join t t2 on  cast(t1.a as bigint) = cast(t2.b 
as bigint)").write.format("noop").mode("Overwrite").save()
}
benchmark.addCase("cast to decimal") { _ =>
  spark.sql("SELECT * FROM t t1 join t t2 on  cast(t1.a as decimal(18, 0)) = 
cast(t2.b as decimal(18, 0))").write.format("noop").mode("Overwrite").save()
}
benchmark.run()
{code}



{noformat}
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Mac OS X 13.2.1
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark equal with cast:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

default   34594  35381
1113  1.2 824.8   1.0X
cast to bigint29056  29367 
440  1.4 692.7   1.2X
cast to decimal   32528  33081 
783  1.3 775.5   1.1X
{noformat}




> Casts types according to bucket info for Equality expression
> 
>
> Key: SPARK-40609
> URL: https://issues.apache.org/jira/browse/SPARK-40609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43079) Add bloom filter details in spark history server plans/SVGs

2023-04-10 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created SPARK-43079:


 Summary: Add bloom filter details in spark history server 
plans/SVGs
 Key: SPARK-43079
 URL: https://issues.apache.org/jira/browse/SPARK-43079
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: Rajesh Balamohan


Spark bloom filter can be enabled via 
"spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled=true  and 
spark.sql.optimizer.runtime.bloomFilter.enabled=true".

Spark history server's SVG doesn't render the bloom filter details; It will be 
good to include this detail in the plan. (as of now, it shows up explain plan's 
text output).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43078) Separate test into `pyspark-conenct-pandas` and `pyspark-connect-pandas-slow`

2023-04-10 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43078:
---

 Summary: Separate test into `pyspark-conenct-pandas` and 
`pyspark-connect-pandas-slow`
 Key: SPARK-43078
 URL: https://issues.apache.org/jira/browse/SPARK-43078
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


The test `pyspark-connect` takes 2~3 hours due to recently added pandas API on 
Spark tests, so we'd better to separate pandas API on Spark tests into 
different test module to reduce the overhead onto `pyspark-connect`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43065) Set job description for tpcds queries

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43065:


Assignee: caican

> Set job description for tpcds queries
> -
>
> Key: SPARK-43065
> URL: https://issues.apache.org/jira/browse/SPARK-43065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Assignee: caican
>Priority: Major
>
>  
> When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not 
> display the sql information
> !https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43065) Set job description for tpcds queries

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43065.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40700
[https://github.com/apache/spark/pull/40700]

> Set job description for tpcds queries
> -
>
> Key: SPARK-43065
> URL: https://issues.apache.org/jira/browse/SPARK-43065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Assignee: caican
>Priority: Major
> Fix For: 3.5.0
>
>
>  
> When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not 
> display the sql information
> !https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43057) Migrate Spark Connect Column errors into error class

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43057.
--
  Assignee: Haejoon Lee
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/40694

> Migrate Spark Connect Column errors into error class
> 
>
> Key: SPARK-43057
> URL: https://issues.apache.org/jira/browse/SPARK-43057
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Migrate Spark Connect Column errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43059) Migrate TypeError from DataFrame(Reader|Writer) into error class

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43059.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40706
[https://github.com/apache/spark/pull/40706]

> Migrate TypeError from DataFrame(Reader|Writer) into error class
> 
>
> Key: SPARK-43059
> URL: https://issues.apache.org/jira/browse/SPARK-43059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Migrate TypeError from DataFrame(Reader|Writer) into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43059) Migrate TypeError from DataFrame(Reader|Writer) into error class

2023-04-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43059:


Assignee: Haejoon Lee

> Migrate TypeError from DataFrame(Reader|Writer) into error class
> 
>
> Key: SPARK-43059
> URL: https://issues.apache.org/jira/browse/SPARK-43059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Migrate TypeError from DataFrame(Reader|Writer) into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org