[jira] [Commented] (SPARK-42755) Factor literal value conversion out to connect-common

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699184#comment-17699184
 ] 

Apache Spark commented on SPARK-42755:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40375

> Factor literal value conversion out to connect-common
> -
>
> Key: SPARK-42755
> URL: https://issues.apache.org/jira/browse/SPARK-42755
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42755) Factor literal value conversion out to connect-common

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42755:


Assignee: Apache Spark

> Factor literal value conversion out to connect-common
> -
>
> Key: SPARK-42755
> URL: https://issues.apache.org/jira/browse/SPARK-42755
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42755) Factor literal value conversion out to connect-common

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699183#comment-17699183
 ] 

Apache Spark commented on SPARK-42755:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40375

> Factor literal value conversion out to connect-common
> -
>
> Key: SPARK-42755
> URL: https://issues.apache.org/jira/browse/SPARK-42755
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42755) Factor literal value conversion out to connect-common

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42755:


Assignee: (was: Apache Spark)

> Factor literal value conversion out to connect-common
> -
>
> Key: SPARK-42755
> URL: https://issues.apache.org/jira/browse/SPARK-42755
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42755) Factor literal value conversion out to connect-common

2023-03-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-42755:
--
Summary: Factor literal value conversion out to connect-common  (was: Move 
literal value conversion to connect-common)

> Factor literal value conversion out to connect-common
> -
>
> Key: SPARK-42755
> URL: https://issues.apache.org/jira/browse/SPARK-42755
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42755) Move literal value conversion to connect-common

2023-03-10 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42755:
-

 Summary: Move literal value conversion to connect-common
 Key: SPARK-42755
 URL: https://issues.apache.org/jira/browse/SPARK-42755
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42721) Add an Interceptor to log RPCs in connect-server

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699153#comment-17699153
 ] 

Apache Spark commented on SPARK-42721:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40374

> Add an Interceptor to log RPCs in connect-server
> 
>
> Key: SPARK-42721
> URL: https://issues.apache.org/jira/browse/SPARK-42721
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.4.1
>
>
> It would be useful to be able to log RPC to connect server during 
> development. It makes simpler to see the flow of messages. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier

2023-03-10 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-42754:
---
Description: 
In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL 
executions when replaying event logs generated by older Spark versions.

 

{*}Reproduction{*}:

{{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}}
{code:java}
sql("select * from range(10)").collect()
sql("select * from range(20)").collect()
sql("select * from range(30)").collect(){code}
Exit the shell and use the Spark History Server to replay this application's UI.

In the SQL tab I expect to see three separate queries, but Spark 3.4's history 
server incorrectly groups the second and third queries as nested queries of the 
first (see attached screenshot).

 

{*}Root cause{*}: 

[https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new 
*non-optional* {{rootExecutionId: Long}} field to the 
SparkListenerSQLExecutionStart case class.

When JsonProtocol deserializes this event it uses the "ignore missing 
properties" Jackson deserialization option, causing the {{rootExecutionField}} 
to be initialized with a default value of {{{}0{}}}.

The value {{0}} is a legitimate execution ID, so in the deserialized event we 
have no ability to distinguish between the absence of a value and a case where 
all queries have the first query as the root.

*Proposed* {*}fix{*}:

I think we should change this field to be of type {{Option[Long]}} . I believe 
this is a release blocker for Spark 3.4.0 because we cannot change the type of 
this new field in a future release without breaking binary compatibility.

  was:
In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL 
executions when replaying event logs generated by older Spark versions.

 

{*}Reproduction{*}:

{{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}}
{code:java}
sql("select * from range(10)").collect()
sql("select * from range(20)").collect()
sql("select * from range(30)").collect(){code}
Exit the shell and use the Spark History Server to replay this UI.

In the SQL tab I expect to see three separate queries, but Spark 3.4's history 
server incorrectly groups the second and third queries as nested queries of the 
first (see attached screenshot).

 

{*}Root cause{*}: 

[https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new 
*non-optional* {{rootExecutionId: Long}} field to the 
SparkListenerSQLExecutionStart case class.

When JsonProtocol deserializes this event it uses the "ignore missing 
properties" Jackson deserialization option, causing the {{rootExecutionField}} 
to be initialized with a default value of {{{}0{}}}.

The value {{0}} is a legitimate execution ID, so in the deserialized event we 
have no ability to distinguish between the absence of a value and a case where 
all queries have the first query as the root.

*Proposed* {*}fix{*}:

I think we should change this field to be of type {{Option[Long]}} . I believe 
this is a release blocker for Spark 3.4.0 because we cannot change the type of 
this new field in a future release without breaking binary compatibility.


> Spark 3.4 history server's SQL tab incorrectly groups SQL executions when 
> replaying event logs from Spark 3.3 and earlier
> -
>
> Key: SPARK-42754
> URL: https://issues.apache.org/jira/browse/SPARK-42754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: example.png
>
>
> In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL 
> executions when replaying event logs generated by older Spark versions.
>  
> {*}Reproduction{*}:
> {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
> spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}}
> {code:java}
> sql("select * from range(10)").collect()
> sql("select * from range(20)").collect()
> sql("select * from range(30)").collect(){code}
> Exit the shell and use the Spark History Server to replay this application's 
> UI.
> In the SQL tab I expect to see three separate queries, but Spark 3.4's 
> history server incorrectly groups the second and third queries as nested 
> queries of the first (see attached screenshot).
>  
> {*}Root cause{*}: 
> [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new 
> *non-optional* {{rootExecutionId: Long}} field to the 
> SparkListenerSQLExecutionStart case class.
> When JsonProtocol deserializes this event it uses the "ignore missing 
> pr

[jira] [Created] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier

2023-03-10 Thread Josh Rosen (Jira)
Josh Rosen created SPARK-42754:
--

 Summary: Spark 3.4 history server's SQL tab incorrectly groups SQL 
executions when replaying event logs from Spark 3.3 and earlier
 Key: SPARK-42754
 URL: https://issues.apache.org/jira/browse/SPARK-42754
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Josh Rosen
 Attachments: example.png

In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL 
executions when replaying event logs generated by older Spark versions.

 

{*}Reproduction{*}:

{{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}}
{code:java}
sql("select * from range(10)").collect()
sql("select * from range(20)").collect()
sql("select * from range(30)").collect(){code}
Exit the shell and use the Spark History Server to replay this UI.

In the SQL tab I expect to see three separate queries, but Spark 3.4's history 
server incorrectly groups the second and third queries as nested queries of the 
first (see attached screenshot).

 

{*}Root cause{*}: 

[https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new 
*non-optional* {{rootExecutionId: Long}} field to the 
SparkListenerSQLExecutionStart case class.

When JsonProtocol deserializes this event it uses the "ignore missing 
properties" Jackson deserialization option, causing the {{rootExecutionField}} 
to be initialized with a default value of {{{}0{}}}.

The value {{0}} is a legitimate execution ID, so in the deserialized event we 
have no ability to distinguish between the absence of a value and a case where 
all queries have the first query as the root.

*Proposed* {*}fix{*}:

I think we should change this field to be of type {{Option[Long]}} . I believe 
this is a release blocker for Spark 3.4.0 because we cannot change the type of 
this new field in a future release without breaking binary compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier

2023-03-10 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-42754:
---
Attachment: example.png

> Spark 3.4 history server's SQL tab incorrectly groups SQL executions when 
> replaying event logs from Spark 3.3 and earlier
> -
>
> Key: SPARK-42754
> URL: https://issues.apache.org/jira/browse/SPARK-42754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: example.png
>
>
> In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL 
> executions when replaying event logs generated by older Spark versions.
>  
> {*}Reproduction{*}:
> {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
> spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}}
> {code:java}
> sql("select * from range(10)").collect()
> sql("select * from range(20)").collect()
> sql("select * from range(30)").collect(){code}
> Exit the shell and use the Spark History Server to replay this UI.
> In the SQL tab I expect to see three separate queries, but Spark 3.4's 
> history server incorrectly groups the second and third queries as nested 
> queries of the first (see attached screenshot).
>  
> {*}Root cause{*}: 
> [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new 
> *non-optional* {{rootExecutionId: Long}} field to the 
> SparkListenerSQLExecutionStart case class.
> When JsonProtocol deserializes this event it uses the "ignore missing 
> properties" Jackson deserialization option, causing the 
> {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}.
> The value {{0}} is a legitimate execution ID, so in the deserialized event we 
> have no ability to distinguish between the absence of a value and a case 
> where all queries have the first query as the root.
> *Proposed* {*}fix{*}:
> I think we should change this field to be of type {{Option[Long]}} . I 
> believe this is a release blocker for Spark 3.4.0 because we cannot change 
> the type of this new field in a future release without breaking binary 
> compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42753) ReusedExchange refers to non-existent node

2023-03-10 Thread zzzzming95 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699148#comment-17699148
 ] 

ming95 commented on SPARK-42753:


[~steven.chen] 

Can you provide the reproduction code?

> ReusedExchange refers to non-existent node
> --
>
> Key: SPARK-42753
> URL: https://issues.apache.org/jira/browse/SPARK-42753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Steven Chen
>Priority: Major
>
> There is an AQE “issue“ where during AQE planning, the Exchange "that's 
> being" reused could be replaced in the plan tree. So, when we print the query 
> plan, the ReusedExchange will refer to an “unknown“ Exchange. An example 
> below:
>  
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]
>  Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code}
>  
>  
> Below is an example to demonstrate the root cause:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A
>           |-- SomeNode Y
>               |-- Exchange B
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C
>           |-- SomeNode N
>               |-- Exchange D
> {code}
>  
>  
> Step 1: Exchange B is materialized and the QueryStage is added to stage cache
> Step 2: Exchange D reuses Exchange B
> Step 3: Exchange C is materialized and the QueryStage is added to stage cache
> Step 4: Exchange A reuses Exchange C
>  
> Then the final plan looks like:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A -> ReusedExchange (reuses Exchange C)
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C -> PhotonShuffleMapStage 
>           |-- SomeNode N
>               |-- Exchange D -> ReusedExchange (reuses Exchange B)
> {code}
>  
>  
> As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
> node. This *DOES NOT* affect query execution but will cause the query 
> visualization malfunction in the following ways:
>  # The ReusedExchange child subtree will still appear in the Spark UI graph 
> but will contain no node IDs.
>  # The ReusedExchange node details in the Explain plan will refer to a 
> UNKNOWN node. Example below.
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]{code}
>  # The child exchange and its subtree may be missing from the Explain text 
> completely. No node details or tree string shown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42721) Add an Interceptor to log RPCs in connect-server

2023-03-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-42721.
---
Fix Version/s: 3.4.1
   (was: 3.5.0)
   Resolution: Fixed

> Add an Interceptor to log RPCs in connect-server
> 
>
> Key: SPARK-42721
> URL: https://issues.apache.org/jira/browse/SPARK-42721
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Priority: Major
> Fix For: 3.4.1
>
>
> It would be useful to be able to log RPC to connect server during 
> development. It makes simpler to see the flow of messages. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42721) Add an Interceptor to log RPCs in connect-server

2023-03-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-42721:
-

Assignee: Raghu Angadi

> Add an Interceptor to log RPCs in connect-server
> 
>
> Key: SPARK-42721
> URL: https://issues.apache.org/jira/browse/SPARK-42721
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.4.1
>
>
> It would be useful to be able to log RPC to connect server during 
> development. It makes simpler to see the flow of messages. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42749) CAST(x as int) does not generate error with overflow

2023-03-10 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699137#comment-17699137
 ] 

Yuming Wang commented on SPARK-42749:
-

Please enable ansi:
{code:sql}
spark-sql (default)> set spark.sql.ansi.enabled=true;
spark.sql.ansi.enabled  true
Time taken: 0.088 seconds, Fetched 1 row(s)
spark-sql (default)> select cast(7.415246799222789E19 as int);
[CAST_OVERFLOW] The value 7.415246799222789E19D of the type "DOUBLE" cannot be 
cast to "INT" due to an overflow. Use `try_cast` to tolerate overflow and 
return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to 
bypass this error.
org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value 
7.415246799222789E19D of the type "DOUBLE" cannot be cast to "INT" due to an 
overflow. Use `try_cast` to tolerate overflow and return NULL instead. If 
necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
{code}

> CAST(x as int) does not generate error with overflow
> 
>
> Key: SPARK-42749
> URL: https://issues.apache.org/jira/browse/SPARK-42749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.3.1, 3.3.2
> Environment: It was tested on a DataBricks environment with DBR 10.4 
> and above, running Spark v3.2.1 and above.
>Reporter: Tjomme Vergauwen
>Priority: Major
> Attachments: Spark-42749.PNG
>
>
> Hi,
> When performing the following code:
> {{select cast(7.415246799222789E19 as int)}}
> according to the documentation, an error is expected as 
> {{7.415246799222789E19 }}is an overflow value for datatype INT.
> However, the value 2147483647 is returned. 
> The behaviour of the following is correct as it returns NULL:
> {{select try_cast(7.415246799222789E19 as int) }}
> This results in unexpected behaviour and data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42753) ReusedExchange refers to non-existent node

2023-03-10 Thread Steven Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Chen updated SPARK-42753:

Summary: ReusedExchange refers to non-existent node  (was: ReusedExchange 
refers to non-existen node)

> ReusedExchange refers to non-existent node
> --
>
> Key: SPARK-42753
> URL: https://issues.apache.org/jira/browse/SPARK-42753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Steven Chen
>Priority: Major
>
> There is an AQE “issue“ where during AQE planning, the Exchange "that's 
> being" reused could be replaced in the plan tree. So, when we print the query 
> plan, the ReusedExchange will refer to an “unknown“ Exchange. An example 
> below:
>  
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]
>  Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code}
>  
>  
> Below is an example to demonstrate the root cause:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A
>           |-- SomeNode Y
>               |-- Exchange B
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C
>           |-- SomeNode N
>               |-- Exchange D
> {code}
>  
>  
> Step 1: Exchange B is materialized and the QueryStage is added to stage cache
> Step 2: Exchange D reuses Exchange B
> Step 3: Exchange C is materialized and the QueryStage is added to stage cache
> Step 4: Exchange A reuses Exchange C
>  
> Then the final plan looks like:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A -> ReusedExchange (reuses Exchange C)
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C -> PhotonShuffleMapStage 
>           |-- SomeNode N
>               |-- Exchange D -> ReusedExchange (reuses Exchange B)
> {code}
>  
>  
> As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
> node. This *DOES NOT* affect query execution but will cause the query 
> visualization malfunction in the following ways:
>  # The ReusedExchange child subtree will still appear in the Spark UI graph 
> but will contain no node IDs.
>  # The ReusedExchange node details in the Explain plan will refer to a 
> UNKNOWN node. Example below.
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]{code}
>  # The child exchange and its subtree may be missing from the Explain text 
> completely. No node details or tree string shown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42753) ReusedExchange refers to non-existen node

2023-03-10 Thread Steven Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Chen updated SPARK-42753:

Description: 
There is an AQE “issue“ where during AQE planning, the Exchange "that's being" 
reused could be replaced in the plan tree. So, when we print the query plan, 
the ReusedExchange will refer to an “unknown“ Exchange. An example below:

 
{code:java}
(2775) ReusedExchange [Reuses operator id: unknown]
 Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code}
 

 

Below is an example to demonstrate the root cause:

 
{code:java}
AdaptiveSparkPlan
  |-- SomeNode X (subquery xxx)
      |-- Exchange A
          |-- SomeNode Y
              |-- Exchange B
Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388
AdaptiveSparkPlan
  |-- SomeNode M
      |-- Exchange C
          |-- SomeNode N
              |-- Exchange D
{code}
 

 

Step 1: Exchange B is materialized and the QueryStage is added to stage cache

Step 2: Exchange D reuses Exchange B

Step 3: Exchange C is materialized and the QueryStage is added to stage cache

Step 4: Exchange A reuses Exchange C

 

Then the final plan looks like:

 
{code:java}
AdaptiveSparkPlan
  |-- SomeNode X (subquery xxx)
      |-- Exchange A -> ReusedExchange (reuses Exchange C)

Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388
AdaptiveSparkPlan
  |-- SomeNode M
      |-- Exchange C -> PhotonShuffleMapStage 
          |-- SomeNode N
              |-- Exchange D -> ReusedExchange (reuses Exchange B)
{code}
 

 

As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
node. This *DOES NOT* affect query execution but will cause the query 
visualization malfunction in the following ways:
 # The ReusedExchange child subtree will still appear in the Spark UI graph but 
will contain no node IDs.
 # The ReusedExchange node details in the Explain plan will refer to a UNKNOWN 
node. Example below.

{code:java}
(2775) ReusedExchange [Reuses operator id: unknown]{code}
 # The child exchange and its subtree may be missing from the Explain text 
completely. No node details or tree string shown.

  was:
There is an AQE “issue“ where during AQE planning, the Exchange "that's being" 
reused could be replaced in the plan tree. So, when we print the query plan, 
the ReusedExchange will refer to an “unknown“ Exchange. An example 
below:{{{}{}}}
{code:java}

{code}
{{ (2775) ReusedExchange [Reuses operator id: unknown]
 Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]}}

{{ }}

 

Below is an example to demonstrate the root cause:

{{}}
{code:java}

{code}
{{AdaptiveSparkPlan
  |-- SomeNode X (subquery xxx)
  |-- Exchange A
  |-- SomeNode Y
  |-- Exchange B

Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388
AdaptiveSparkPlan
  |-- SomeNode M
  |-- Exchange C
  |-- SomeNode N
  |-- Exchange D}}

{{ }}

 

Step 1: Exchange B is materialized and the QueryStage is added to stage cache

Step 2: Exchange D reuses Exchange B

Step 3: Exchange C is materialized and the QueryStage is added to stage cache

Step 4: Exchange A reuses Exchange C

 

Then the final plan looks like:

{{}}
{code:java}

{code}
{{AdaptiveSparkPlan
  |-- SomeNode X (subquery xxx)
  |-- Exchange A -> ReusedExchange (reuses Exchange C)


Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388
AdaptiveSparkPlan
  |-- SomeNode M
  |-- Exchange C -> PhotonShuffleMapStage 
  |-- SomeNode N
  |-- Exchange D -> ReusedExchange (reuses Exchange B)}}

{{ }}

 

As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
node. This *DOES NOT* affect query execution but will cause the query 
visualization malfunction in the following ways:
 # The ReusedExchange child subtree will still appear in the Spark UI graph but 
will contain no node IDs.
 # The ReusedExchange node details in the Explain plan will refer to a UNKNOWN 
node. Example below.

{code:java}
(2775) ReusedExchange [Reuses operator id: unknown]{code}

 # The child exchange and its subtree may be missing from the Explain text 
completely. No node details or tree string shown.


> ReusedExchange refers to non-existen node
> -
>
> Key: SPARK-42753
> URL: https://issues.apache.org/jira/browse/SPARK-42753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Steven Chen
>Priority: Major
>
> There is an AQE “issue“ where during AQE planning, the Exchange "that's 
> being" reused could be replaced in the plan tree. So, when we print the query 
> plan, the ReusedExchange will refer to an “unknown“ Exchange. An example 
> below:
>  
> {code:java}
> (2

[jira] [Created] (SPARK-42753) ReusedExchange refers to non-existen node

2023-03-10 Thread Steven Chen (Jira)
Steven Chen created SPARK-42753:
---

 Summary: ReusedExchange refers to non-existen node
 Key: SPARK-42753
 URL: https://issues.apache.org/jira/browse/SPARK-42753
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 3.4.0
Reporter: Steven Chen


There is an AQE “issue“ where during AQE planning, the Exchange "that's being" 
reused could be replaced in the plan tree. So, when we print the query plan, 
the ReusedExchange will refer to an “unknown“ Exchange. An example 
below:{{{}{}}}
{code:java}

{code}
{{ (2775) ReusedExchange [Reuses operator id: unknown]
 Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]}}

{{ }}

 

Below is an example to demonstrate the root cause:

{{}}
{code:java}

{code}
{{AdaptiveSparkPlan
  |-- SomeNode X (subquery xxx)
  |-- Exchange A
  |-- SomeNode Y
  |-- Exchange B

Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388
AdaptiveSparkPlan
  |-- SomeNode M
  |-- Exchange C
  |-- SomeNode N
  |-- Exchange D}}

{{ }}

 

Step 1: Exchange B is materialized and the QueryStage is added to stage cache

Step 2: Exchange D reuses Exchange B

Step 3: Exchange C is materialized and the QueryStage is added to stage cache

Step 4: Exchange A reuses Exchange C

 

Then the final plan looks like:

{{}}
{code:java}

{code}
{{AdaptiveSparkPlan
  |-- SomeNode X (subquery xxx)
  |-- Exchange A -> ReusedExchange (reuses Exchange C)


Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388
AdaptiveSparkPlan
  |-- SomeNode M
  |-- Exchange C -> PhotonShuffleMapStage 
  |-- SomeNode N
  |-- Exchange D -> ReusedExchange (reuses Exchange B)}}

{{ }}

 

As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
node. This *DOES NOT* affect query execution but will cause the query 
visualization malfunction in the following ways:
 # The ReusedExchange child subtree will still appear in the Spark UI graph but 
will contain no node IDs.
 # The ReusedExchange node details in the Explain plan will refer to a UNKNOWN 
node. Example below.

{code:java}
(2775) ReusedExchange [Reuses operator id: unknown]{code}

 # The child exchange and its subtree may be missing from the Explain text 
completely. No node details or tree string shown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42101:
--
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42718) Upgrade rocksdbjni to 7.10.2

2023-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42718.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40337
[https://github.com/apache/spark/pull/40337]

> Upgrade rocksdbjni to 7.10.2
> 
>
> Key: SPARK-42718
> URL: https://issues.apache.org/jira/browse/SPARK-42718
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> https://github.com/facebook/rocksdb/releases/tag/v7.10.2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42718) Upgrade rocksdbjni to 7.10.2

2023-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42718:
-

Assignee: Yang Jie

> Upgrade rocksdbjni to 7.10.2
> 
>
> Key: SPARK-42718
> URL: https://issues.apache.org/jira/browse/SPARK-42718
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> https://github.com/facebook/rocksdb/releases/tag/v7.10.2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42752:


Assignee: Apache Spark

> Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop 
> Free" distibution
> ---
>
> Key: SPARK-42752
> URL: https://issues.apache.org/jira/browse/SPARK-42752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
> Environment: local
>Reporter: Gera Shegalov
>Assignee: Apache Spark
>Priority: Major
>
> Reproduction steps:
> 1. download a standard "Hadoop Free" build
> 2. Start pyspark REPL with Hive support
> {code:java}
> SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
> ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
> spark.sql.catalogImplementation=hive
> {code}
> 3. Execute any simple dataframe operation
> {code:java}
> >>> spark.range(100).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py",
>  line 416, in range
> jdf = self._jsparkSession.range(0, int(start), int(step), 
> int(numPartitions))
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
> line 117, in deco
> raise converted from None
> pyspark.sql.utils.IllegalArgumentException: 
> {code}
> 4. In fact you can just call spark.conf to trigger this issue
> {code:java}
> >>> spark.conf
> Traceback (most recent call last):
>   File "", line 1, in 
> ...
> {code}
> There are probably two issues here:
> 1) that Hive support should be gracefully disabled if it the dependency not 
> on the classpath as claimed by 
> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> 2) but at the very least the user should be able to see the exception to 
> understand the issue, and take an action
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699108#comment-17699108
 ] 

Apache Spark commented on SPARK-42752:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/40372

> Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop 
> Free" distibution
> ---
>
> Key: SPARK-42752
> URL: https://issues.apache.org/jira/browse/SPARK-42752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
> Environment: local
>Reporter: Gera Shegalov
>Priority: Major
>
> Reproduction steps:
> 1. download a standard "Hadoop Free" build
> 2. Start pyspark REPL with Hive support
> {code:java}
> SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
> ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
> spark.sql.catalogImplementation=hive
> {code}
> 3. Execute any simple dataframe operation
> {code:java}
> >>> spark.range(100).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py",
>  line 416, in range
> jdf = self._jsparkSession.range(0, int(start), int(step), 
> int(numPartitions))
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
> line 117, in deco
> raise converted from None
> pyspark.sql.utils.IllegalArgumentException: 
> {code}
> 4. In fact you can just call spark.conf to trigger this issue
> {code:java}
> >>> spark.conf
> Traceback (most recent call last):
>   File "", line 1, in 
> ...
> {code}
> There are probably two issues here:
> 1) that Hive support should be gracefully disabled if it the dependency not 
> on the classpath as claimed by 
> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> 2) but at the very least the user should be able to see the exception to 
> understand the issue, and take an action
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42752:


Assignee: (was: Apache Spark)

> Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop 
> Free" distibution
> ---
>
> Key: SPARK-42752
> URL: https://issues.apache.org/jira/browse/SPARK-42752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
> Environment: local
>Reporter: Gera Shegalov
>Priority: Major
>
> Reproduction steps:
> 1. download a standard "Hadoop Free" build
> 2. Start pyspark REPL with Hive support
> {code:java}
> SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
> ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
> spark.sql.catalogImplementation=hive
> {code}
> 3. Execute any simple dataframe operation
> {code:java}
> >>> spark.range(100).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py",
>  line 416, in range
> jdf = self._jsparkSession.range(0, int(start), int(step), 
> int(numPartitions))
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
> line 117, in deco
> raise converted from None
> pyspark.sql.utils.IllegalArgumentException: 
> {code}
> 4. In fact you can just call spark.conf to trigger this issue
> {code:java}
> >>> spark.conf
> Traceback (most recent call last):
>   File "", line 1, in 
> ...
> {code}
> There are probably two issues here:
> 1) that Hive support should be gracefully disabled if it the dependency not 
> on the classpath as claimed by 
> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> 2) but at the very least the user should be able to see the exception to 
> understand the issue, and take an action
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699107#comment-17699107
 ] 

Apache Spark commented on SPARK-42752:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/40372

> Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop 
> Free" distibution
> ---
>
> Key: SPARK-42752
> URL: https://issues.apache.org/jira/browse/SPARK-42752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
> Environment: local
>Reporter: Gera Shegalov
>Priority: Major
>
> Reproduction steps:
> 1. download a standard "Hadoop Free" build
> 2. Start pyspark REPL with Hive support
> {code:java}
> SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
> ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
> spark.sql.catalogImplementation=hive
> {code}
> 3. Execute any simple dataframe operation
> {code:java}
> >>> spark.range(100).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py",
>  line 416, in range
> jdf = self._jsparkSession.range(0, int(start), int(step), 
> int(numPartitions))
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
> line 117, in deco
> raise converted from None
> pyspark.sql.utils.IllegalArgumentException: 
> {code}
> 4. In fact you can just call spark.conf to trigger this issue
> {code:java}
> >>> spark.conf
> Traceback (most recent call last):
>   File "", line 1, in 
> ...
> {code}
> There are probably two issues here:
> 1) that Hive support should be gracefully disabled if it the dependency not 
> on the classpath as claimed by 
> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> 2) but at the very least the user should be able to see the exception to 
> understand the issue, and take an action
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution

2023-03-10 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-42752:
--
Description: 
Reproduction steps:
1. download a standard "Hadoop Free" build
2. Start pyspark REPL with Hive support
{code:java}
SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
spark.sql.catalogImplementation=hive
{code}
3. Execute any simple dataframe operation
{code:java}
>>> spark.range(100).show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 416, in range
jdf = self._jsparkSession.range(0, int(start), int(step), 
int(numPartitions))
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
{code}
4. In fact you can just call spark.conf to trigger this issue
{code:java}
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
...
{code}

There are probably two issues here:
1) that Hive support should be gracefully disabled if it the dependency not on 
the classpath as claimed by 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
2) but at the very least the user should be able to see the exception to 
understand the issue, and take an action

 

  was:
Reproduction steps:
1. download a standard "Hadoop Free" build
2. Start pyspark REPL with Hive support
{code:java}
SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
spark.sql.catalogImplementation=hive
{code}
3. Execute any simple dataframe operation
{code:java}
>>> spark.range(100).show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 416, in range
jdf = self._jsparkSession.range(0, int(start), int(step), 
int(numPartitions))
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 347, in conf
self._conf = RuntimeConfig(self._jsparkSession.conf())
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
{code}
4. In fact you can just call spark.conf to trigger this issue
{code:java}
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
...
{code}

There are probably two issues here:
1) that Hive support should be gracefully disabled if it the dependency not on 
the classpath as claimed by 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
2) but at the very least the user should be able to see the exception to 
understand the issue, and take an action

 


> Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop 
> Free" distibution
> ---
>
> Key: SPARK-42752
> URL: https://issues.apache.org/jira/browse/SPARK-42752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
> Environment: local
>Reporter: Gera Shegalov
>Priority: Major
>
> Reproduction steps:
> 1. download a standard "Hadoop Free" build
> 2. Start pyspark REPL with Hive support
> {code:java}
> SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
> ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
> spark.sql.catalogImplementation=hive
> {code}
> 3. Execute any simple dataframe operation
> {code:java}
> >>> spark.range(100).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py",
>  line 416, in range
> jdf = self._jsparkSession.range(0, int(start), int(step), 
> int(numPartitions))
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in

[jira] [Created] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution

2023-03-10 Thread Gera Shegalov (Jira)
Gera Shegalov created SPARK-42752:
-

 Summary: Unprintable IllegalArgumentException with Hive catalog 
enabled in "Hadoop Free" distibution
 Key: SPARK-42752
 URL: https://issues.apache.org/jira/browse/SPARK-42752
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
 Environment: local
Reporter: Gera Shegalov


Reproduction steps:
1. download a standard "Hadoop Free" build
2. Start pyspark REPL with Hive support
{code:java}
SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
spark.sql.catalogImplementation=hive
{code}
3. Execute any simple dataframe operation
{code:java}
>>> spark.range(100).show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 416, in range
jdf = self._jsparkSession.range(0, int(start), int(step), 
int(numPartitions))
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 347, in conf
self._conf = RuntimeConfig(self._jsparkSession.conf())
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
{code}
4. In fact you can just call spark.conf to trigger this issue
{code:java}
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
...
{code}

There are probably two issues here:
1) that Hive support should be gracefully disabled if it the dependency not on 
the classpath as claimed by 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
2) but at the very least the user should be able to see the exception to 
understand the issue, and take an action

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41498) Union does not propagate Metadata output

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41498:


Assignee: Apache Spark

> Union does not propagate Metadata output
> 
>
> Key: SPARK-41498
> URL: https://issues.apache.org/jira/browse/SPARK-41498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: Fredrik Klauß
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the Union operator does not propagate any metadata output. This 
> makes it impossible to access any metadata if a Union operator is used, even 
> though the children have the exact same metadata output.
> Example:
>  
> {code:java}
> val df1 = spark.read.load(path1)
> val df2 = spark.read.load(path2)
> df1.union(df2).select("_metadata.file_path"). // <-- fails{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41498) Union does not propagate Metadata output

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41498:


Assignee: (was: Apache Spark)

> Union does not propagate Metadata output
> 
>
> Key: SPARK-41498
> URL: https://issues.apache.org/jira/browse/SPARK-41498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, the Union operator does not propagate any metadata output. This 
> makes it impossible to access any metadata if a Union operator is used, even 
> though the children have the exact same metadata output.
> Example:
>  
> {code:java}
> val df1 = spark.read.load(path1)
> val df2 = spark.read.load(path2)
> df1.union(df2).select("_metadata.file_path"). // <-- fails{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41498) Union does not propagate Metadata output

2023-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699076#comment-17699076
 ] 

Dongjoon Hyun commented on SPARK-41498:
---

This is reverted via 
https://github.com/apache/spark/commit/164db5ba3c39614017f5ef6428194a442d79b425

> Union does not propagate Metadata output
> 
>
> Key: SPARK-41498
> URL: https://issues.apache.org/jira/browse/SPARK-41498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, the Union operator does not propagate any metadata output. This 
> makes it impossible to access any metadata if a Union operator is used, even 
> though the children have the exact same metadata output.
> Example:
>  
> {code:java}
> val df1 = spark.read.load(path1)
> val df2 = spark.read.load(path2)
> df1.union(df2).select("_metadata.file_path"). // <-- fails{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41498) Union does not propagate Metadata output

2023-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41498:
--
Fix Version/s: (was: 3.4.0)

> Union does not propagate Metadata output
> 
>
> Key: SPARK-41498
> URL: https://issues.apache.org/jira/browse/SPARK-41498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, the Union operator does not propagate any metadata output. This 
> makes it impossible to access any metadata if a Union operator is used, even 
> though the children have the exact same metadata output.
> Example:
>  
> {code:java}
> val df1 = spark.read.load(path1)
> val df2 = spark.read.load(path2)
> df1.union(df2).select("_metadata.file_path"). // <-- fails{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41498) Union does not propagate Metadata output

2023-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-41498:
---
  Assignee: (was: Fredrik Klauß)

> Union does not propagate Metadata output
> 
>
> Key: SPARK-41498
> URL: https://issues.apache.org/jira/browse/SPARK-41498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the Union operator does not propagate any metadata output. This 
> makes it impossible to access any metadata if a Union operator is used, even 
> though the children have the exact same metadata output.
> Example:
>  
> {code:java}
> val df1 = spark.read.load(path1)
> val df2 = spark.read.load(path2)
> df1.union(df2).select("_metadata.file_path"). // <-- fails{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42685) optimize byteToString routines

2023-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42685:
--
Affects Version/s: 3.5.0
   (was: 3.3.2)

> optimize byteToString routines
> --
>
> Key: SPARK-42685
> URL: https://issues.apache.org/jira/browse/SPARK-42685
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Alkis Evlogimenos
>Priority: Minor
> Fix For: 3.5.0
>
>
> {{Utils.byteToString routines are slow because they use BigInt and 
> BigDecimal. This is causing visible CPU usage (1-2% in scan benchmarks).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex

2023-03-10 Thread IonK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

IonK updated SPARK-42751:
-
Description: 
When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,"value", )]{code}
 

In pyspark.pandas the result is:
{code:java}
org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.{code}
 

My temporary workaround is using 
{code:java}
df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code}

  was:
When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,value , )]{code}
 

In pyspark.pandas the result is:
{code:java}
org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.{code}
 

My temporary workaround is using 
{code:java}
df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code}


> Pyspark.pandas.series.str.findall can't handle tuples that are returned by 
> regex
> 
>
> Key: SPARK-42751
> URL: https://issues.apache.org/jira/browse/SPARK-42751
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.2
>Reporter: IonK
>Priority: Major
>
> When you use the str.findall accessor method on a ps.series and you're 
> passing a regex pattern that will return match groups, it will return a 
> pyarrow data error.
> In pandas the result is this:
> {code:java}
> df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)
> returns 
>  [("value", , , , )],
>  [("value", , , , )],
>  [(, , ,"value", )]{code}
>  
> In pyspark.pandas the result is:
> {code:java}
> org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
> Expected bytes, got a 'tuple' object'.{code}
>  
> My temporary workaround is using 
> {code:java}
> df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex

2023-03-10 Thread IonK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

IonK updated SPARK-42751:
-
Description: 
When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,value , )]{code}
 

In pyspark.pandas the result is:
{code:java}
org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.{code}
 

My temporary workaround is using 
{code:java}
df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code}

  was:
When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,value , )]{code}
 

In pyspark.pandas the result is:
{code:java}
org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.{code}
 


> Pyspark.pandas.series.str.findall can't handle tuples that are returned by 
> regex
> 
>
> Key: SPARK-42751
> URL: https://issues.apache.org/jira/browse/SPARK-42751
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.2
>Reporter: IonK
>Priority: Major
>
> When you use the str.findall accessor method on a ps.series and you're 
> passing a regex pattern that will return match groups, it will return a 
> pyarrow data error.
> In pandas the result is this:
> {code:java}
> df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)
> returns 
>  [("value", , , , )],
>  [("value", , , , )],
>  [(, , ,value , )]{code}
>  
> In pyspark.pandas the result is:
> {code:java}
> org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
> Expected bytes, got a 'tuple' object'.{code}
>  
> My temporary workaround is using 
> {code:java}
> df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex

2023-03-10 Thread IonK (Jira)
IonK created SPARK-42751:


 Summary: Pyspark.pandas.series.str.findall can't handle tuples 
that are returned by regex
 Key: SPARK-42751
 URL: https://issues.apache.org/jira/browse/SPARK-42751
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.3.2
Reporter: IonK


When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:

 
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,value , )]{code}
In pyspark.pandas the result is:

org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex

2023-03-10 Thread IonK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

IonK updated SPARK-42751:
-
Description: 
When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,value , )]{code}
 

In pyspark.pandas the result is:
{code:java}
org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.{code}
 

  was:
When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:

 
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,value , )]{code}
In pyspark.pandas the result is:
{code:java}
org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.{code}
 


> Pyspark.pandas.series.str.findall can't handle tuples that are returned by 
> regex
> 
>
> Key: SPARK-42751
> URL: https://issues.apache.org/jira/browse/SPARK-42751
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.2
>Reporter: IonK
>Priority: Major
>
> When you use the str.findall accessor method on a ps.series and you're 
> passing a regex pattern that will return match groups, it will return a 
> pyarrow data error.
> In pandas the result is this:
> {code:java}
> df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)
> returns 
>  [("value", , , , )],
>  [("value", , , , )],
>  [(, , ,value , )]{code}
>  
> In pyspark.pandas the result is:
> {code:java}
> org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
> Expected bytes, got a 'tuple' object'.{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex

2023-03-10 Thread IonK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

IonK updated SPARK-42751:
-
Description: 
When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:

 
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,value , )]{code}
In pyspark.pandas the result is:
{code:java}
org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.{code}
 

  was:
When you use the str.findall accessor method on a ps.series and you're passing 
a regex pattern that will return match groups, it will return a pyarrow data 
error.

In pandas the result is this:

 
{code:java}
df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)

returns 

 [("value", , , , )],
 [("value", , , , )],
 [(, , ,value , )]{code}
In pyspark.pandas the result is:

org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
Expected bytes, got a 'tuple' object'.

 


> Pyspark.pandas.series.str.findall can't handle tuples that are returned by 
> regex
> 
>
> Key: SPARK-42751
> URL: https://issues.apache.org/jira/browse/SPARK-42751
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.2
>Reporter: IonK
>Priority: Major
>
> When you use the str.findall accessor method on a ps.series and you're 
> passing a regex pattern that will return match groups, it will return a 
> pyarrow data error.
> In pandas the result is this:
>  
> {code:java}
> df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE)
> returns 
>  [("value", , , , )],
>  [("value", , , , )],
>  [(, , ,value , )]{code}
> In pyspark.pandas the result is:
> {code:java}
> org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: 
> Expected bytes, got a 'tuple' object'.{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42743) Support analyze TimestampNTZ columns

2023-03-10 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-42743:
---
Fix Version/s: 3.4.0
   (was: 3.5.0)

> Support analyze TimestampNTZ columns
> 
>
> Key: SPARK-42743
> URL: https://issues.apache.org/jira/browse/SPARK-42743
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42661) CSV Reader - multiline without quoted fields

2023-03-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699056#comment-17699056
 ] 

Sean R. Owen commented on SPARK-42661:
--

I don't know that multi-line makes sense without quoting in your case, as you 
have values broken across lines. You should quote

> CSV Reader - multiline without quoted fields
> 
>
> Key: SPARK-42661
> URL: https://issues.apache.org/jira/browse/SPARK-42661
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: unquoted data
> {code}
> NAME,Address,CITY
> Atlassian,Level 6 341 George Street
> Sydney NSW 2000 Australia,Sydney
> Github,88 Colin P Kelly Junior Street
> San Francisco CA 94107 USA,San Francisco
> {code}
> quoted data : 
> {code}
> "NAME","Address","CITY"
> "Atlassian","Level 6 341 George Street
> Sydney NSW 2000 Australia","Sydney"
> "Github","88 Colin P Kelly Junior Street
> San Francisco CA 94107 USA","San Francisco"
> {code}
>Reporter: Florian FERREIRA
>Priority: Minor
> Attachments: Capture d’écran 2023-03-03 à 12.18.07.png
>
>
> Hello,
> We are facing an issue with the CSV format.
> When we try to read a "multiline file without quoted fields" the expected 
> result is not good.
> With quoted fields, all is ok. ( cf the screenshot ) 
> You can reproduce it easily with this code (just replace file path ) :
> {code:java}
> spark.read.options(Map(
> "multiline" -> "true",
> "quote" -> "",
> "header" -> "true",
>   )).csv("/Users/fferreira/correct_multiline.csv").show(false)
> spark.read.options(Map(
> "multiline" -> "true",
> "header" -> "true",  
> )).csv("/Users/fferreira/correct_multiline_with_quote.csv").show(false)
> {code}
> We continue to investigate on our side.
> Thanks you.
> !image-2023-03-03-12-11-21-258.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42685) optimize byteToString routines

2023-03-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42685:
-
Priority: Minor  (was: Major)

> optimize byteToString routines
> --
>
> Key: SPARK-42685
> URL: https://issues.apache.org/jira/browse/SPARK-42685
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Alkis Evlogimenos
>Priority: Minor
>
> {{Utils.byteToString routines are slow because they use BigInt and 
> BigDecimal. This is causing visible CPU usage (1-2% in scan benchmarks).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42685) optimize byteToString routines

2023-03-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42685.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40301

> optimize byteToString routines
> --
>
> Key: SPARK-42685
> URL: https://issues.apache.org/jira/browse/SPARK-42685
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Alkis Evlogimenos
>Priority: Minor
> Fix For: 3.5.0
>
>
> {{Utils.byteToString routines are slow because they use BigInt and 
> BigDecimal. This is causing visible CPU usage (1-2% in scan benchmarks).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42036) Kryo ClassCastException getting task result when JDK versions mismatch

2023-03-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42036.
--
Resolution: Not A Problem

Mismatching java versions would never be supported per se

> Kryo ClassCastException getting task result when JDK versions mismatch
> --
>
> Key: SPARK-42036
> URL: https://issues.apache.org/jira/browse/SPARK-42036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> {noformat}
> 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.nio.ByteBuffer
> Serialization trace:
> lowerBounds (org.apache.iceberg.GenericDataFile)
> taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit)
> writerCommitMessage 
> (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
> {noformat}
> Iceberg 1.1 `BaseFile.lowerBounds` is defined as
> {code:java}
> Map {code}
> Driver JDK version: 1.8.0_352 (Azul Systems, Inc.)
> Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS
> Kryo version: 4.0.2
>  
> Same Spark job works when both driver and executors run the same JDK 8 or JDK 
> 17.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42198) spark.read fails to read filenames with accented characters

2023-03-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699044#comment-17699044
 ] 

Sean R. Owen commented on SPARK-42198:
--

You would not add /dbfs on Databricks in this case, that's not relevant or the 
issue.
What if you escape the path as if in a URL?

> spark.read fails to read filenames with accented characters
> ---
>
> Key: SPARK-42198
> URL: https://issues.apache.org/jira/browse/SPARK-42198
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Tarique Anwer
>Priority: Major
>
> Unable to read filenames with accented characters in the filename.
> *Sample error:*
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 43.3 in stage 1.0 
> (TID 105) (10.139.64.5 executor 0): java.io.FileNotFoundException: 
> /4842022074360943/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml{code}
>  
> *{{Steps to reproduce error:}}*
> {code:java}
> %sh
> mkdir -p /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass
> wget  
> https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_ccda_sep2019.zip
>  -O ./synthea_sample_data_ccda_sep2019.zip 
> unzip ./synthea_sample_data_ccda_sep2019.zip -d 
> /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/
> {code}
>  
> {code:java}
> spark.conf.set("spark.sql.caseSensitive", "true")
> df = (
>   spark.read.format('xml')
>    .option("rowTag", "ClinicalDocument")
>   .load('/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/')
> ){code}
> Is there a way to deal with this situation where I don't have control over 
> the file names for some reason?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42127) Spark 3.3.0, Error with java.io.IOException: Mkdirs failed to create file

2023-03-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42127.
--
Resolution: Not A Problem

No detail here, not obvious that this isn't just a permissions issue

> Spark 3.3.0, Error with java.io.IOException: Mkdirs failed to create file
> -
>
> Key: SPARK-42127
> URL: https://issues.apache.org/jira/browse/SPARK-42127
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: shamim
>Priority: Major
>
> 23/01/18 20:23:24 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4) 
> (10.64.109.72 executor 0): java.io.IOException: Mkdirs failed to create 
> file:/var/backup/_temporary/0/_temporary/attempt_202301182023173234741341853025716_0005_m_04_0
>  (exists=false, cwd=file:/opt/spark-3.3.0/work/app-20230118202317-0001/0)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:515)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:500)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1081)
>         at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:113)
>         at 
> org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:238)
>         at 
> org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:126)
>         at 
> org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42425) spark-hadoop-cloud is not provided in the default Spark distribution

2023-03-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699042#comment-17699042
 ] 

Sean R. Owen commented on SPARK-42425:
--

The docs don't say it's part of the Spark distro. in fact it tells you to 
bundle it in your app. It is not bundled on purpose.

> spark-hadoop-cloud is not provided in the default Spark distribution
> 
>
> Key: SPARK-42425
> URL: https://issues.apache.org/jira/browse/SPARK-42425
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.3.1
>Reporter: Arseniy Tashoyan
>Priority: Major
>
> The library spark-hadoop-cloud is absent in the default Spark distribution 
> (as well as its dependencies like hadoop-aws). Therefore the dependency 
> management section described in [Integration with Cloud 
> Infrastructures|https://spark.apache.org/docs/3.3.1/cloud-integration.html#installation]
>  is invalid. Actually the libraries for cloud integration are not provided.
> A naive workaround would be to add the spark-hadoop-cloud library as a 
> compile-scope dependency. However, this does not work due to Spark classpath 
> hierarchy. Spark system classloader does not see classes loaded by the 
> application classloader.
> Therefore a proper fix would be to enable the hadoop-cloud build profile by 
> default: -Phadoop-cloud



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42425) spark-hadoop-cloud is not provided in the default Spark distribution

2023-03-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42425.
--
Resolution: Not A Problem

> spark-hadoop-cloud is not provided in the default Spark distribution
> 
>
> Key: SPARK-42425
> URL: https://issues.apache.org/jira/browse/SPARK-42425
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.3.1
>Reporter: Arseniy Tashoyan
>Priority: Major
>
> The library spark-hadoop-cloud is absent in the default Spark distribution 
> (as well as its dependencies like hadoop-aws). Therefore the dependency 
> management section described in [Integration with Cloud 
> Infrastructures|https://spark.apache.org/docs/3.3.1/cloud-integration.html#installation]
>  is invalid. Actually the libraries for cloud integration are not provided.
> A naive workaround would be to add the spark-hadoop-cloud library as a 
> compile-scope dependency. However, this does not work due to Spark classpath 
> hierarchy. Spark system classloader does not see classes loaded by the 
> application classloader.
> Therefore a proper fix would be to enable the hadoop-cloud build profile by 
> default: -Phadoop-cloud



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42479) Log4j2 doesn't works with Spark 3.3.0

2023-03-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699041#comment-17699041
 ] 

Sean R. Owen commented on SPARK-42479:
--

This is because you're pointing at some local path not visible on the workers. 
It's quite expected

> Log4j2 doesn't works with Spark 3.3.0
> -
>
> Key: SPARK-42479
> URL: https://issues.apache.org/jira/browse/SPARK-42479
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Pratik Malani
>Priority: Major
>
> Hi All,
> Was trying to run spark application on the cluster mode using log4j2 and 
> Spark 3.3.0.
> When I run the below spark-submit command, only one worker (out of 3) starts 
> executing the job.
> {code:java}
> // code placeholder
> spark-submit --master spark://spark-master-svc:7077 \
> --conf spark.cores.max=4 \
> --conf spark.sql.broadcastTimeout=3600 \
> --conf spark.executor.cores=1 \
> --jars /opt/spark/work-dir/.jar \
> --deploy-mode cluster \
> --class  \
> --properties-file /opt/spark/conf/spark-defaults.conf \
> --conf 
> spark.driver.extraJavaOptions="-Dcom.amazonaws.sdk.disableCertChecking=true 
> -Dlog4j.configurationFile=file:/opt/spark/work-dir/.properties" \
> --conf 
> spark.executor.extraJavaOptions="-Dcom.amazonaws.sdk.disableCertChecking=true 
> -Dlog4j.configurationFile=file:/opt/spark/work-dir/.properties" \
> --files "/opt/spark/work-dir/.properties" \
> /opt/spark/work-dir/.jar 
> /opt/spark/work-dir/application.properties >> /var/log/containers/hourly.log 
> 2>&1 {code}
> It means, on only one worker, I can see the driver logs and the other workers 
> are idle and there are no app or executors logs created on other workers.
> Below is the log4j2.properties file being used.
> {code:java}
> // code placeholder
> rootLogger.level = INFO
> rootLogger.appenderRef.rolling.ref = loggerId
> appender.rolling.type = RollingFile
> appender.rolling.name = loggerId
> appender.rolling.fileName=/var/log/containers/hourly.log
> appender.rolling.filePattern=hourly-.%d{MMdd}.log.gz
> appender.rolling.layout.type = PatternLayout
> appender.rolling.layout.pattern=%d [%t] %-5p (%F:%L) - %m%n
> appender.rolling.policies.type = Policies
> appender.rolling.policies.size.type = TimeBasedTriggeringPolicy
> appender.rolling.strategy.type = DefaultRolloverStrategy
> appender.rolling.strategy.max = 5
> logger.spark.name = org.apache.spark
> logger.spark.level = WARN
> logger.spark.additivity = false
> logger.spark.repl.SparkIMain$exprTyper.level = INFO
> logger.spark.repl.SparkILoop$SparkILoopInterpreter.level = INFO
> # Settings to quiet third party logs that are too verbose
> logger.jetty.name = org.eclipse.jetty
> logger.jetty.level = WARN
> logger.jetty.util.component.AbstractLifeCycle.level = ERROR
> logger.parquet.name = org.apache.parquet
> logger.parquet.level = ERROR
> logger.kafka.name = org.apache.kafka
> logger.kafka.level = WARN
> logger.kafka.clients.consumer.internals.Fetcher.level=WARN {code}
> All log4j2 jars are included in the Spark home classpath under the jars 
> directory.
>  * log4j-1.2-api-2.17.2.jar
>  * log4j-api-2.17.2.jar
>  * log4j-api-scala_2.12-12.0.jar
>  * log4j-core-2.17.2.jar
>  * log4j-slf4j-impl-2.17.2.jar
> Can you please check and let me know whether I need to add or update anything 
> to start the job in cluster mode supporting log4j2
> Note : Things work fine with log4j1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42479) Log4j2 doesn't works with Spark 3.3.0

2023-03-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42479.
--
Resolution: Not A Problem

> Log4j2 doesn't works with Spark 3.3.0
> -
>
> Key: SPARK-42479
> URL: https://issues.apache.org/jira/browse/SPARK-42479
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Pratik Malani
>Priority: Major
>
> Hi All,
> Was trying to run spark application on the cluster mode using log4j2 and 
> Spark 3.3.0.
> When I run the below spark-submit command, only one worker (out of 3) starts 
> executing the job.
> {code:java}
> // code placeholder
> spark-submit --master spark://spark-master-svc:7077 \
> --conf spark.cores.max=4 \
> --conf spark.sql.broadcastTimeout=3600 \
> --conf spark.executor.cores=1 \
> --jars /opt/spark/work-dir/.jar \
> --deploy-mode cluster \
> --class  \
> --properties-file /opt/spark/conf/spark-defaults.conf \
> --conf 
> spark.driver.extraJavaOptions="-Dcom.amazonaws.sdk.disableCertChecking=true 
> -Dlog4j.configurationFile=file:/opt/spark/work-dir/.properties" \
> --conf 
> spark.executor.extraJavaOptions="-Dcom.amazonaws.sdk.disableCertChecking=true 
> -Dlog4j.configurationFile=file:/opt/spark/work-dir/.properties" \
> --files "/opt/spark/work-dir/.properties" \
> /opt/spark/work-dir/.jar 
> /opt/spark/work-dir/application.properties >> /var/log/containers/hourly.log 
> 2>&1 {code}
> It means, on only one worker, I can see the driver logs and the other workers 
> are idle and there are no app or executors logs created on other workers.
> Below is the log4j2.properties file being used.
> {code:java}
> // code placeholder
> rootLogger.level = INFO
> rootLogger.appenderRef.rolling.ref = loggerId
> appender.rolling.type = RollingFile
> appender.rolling.name = loggerId
> appender.rolling.fileName=/var/log/containers/hourly.log
> appender.rolling.filePattern=hourly-.%d{MMdd}.log.gz
> appender.rolling.layout.type = PatternLayout
> appender.rolling.layout.pattern=%d [%t] %-5p (%F:%L) - %m%n
> appender.rolling.policies.type = Policies
> appender.rolling.policies.size.type = TimeBasedTriggeringPolicy
> appender.rolling.strategy.type = DefaultRolloverStrategy
> appender.rolling.strategy.max = 5
> logger.spark.name = org.apache.spark
> logger.spark.level = WARN
> logger.spark.additivity = false
> logger.spark.repl.SparkIMain$exprTyper.level = INFO
> logger.spark.repl.SparkILoop$SparkILoopInterpreter.level = INFO
> # Settings to quiet third party logs that are too verbose
> logger.jetty.name = org.eclipse.jetty
> logger.jetty.level = WARN
> logger.jetty.util.component.AbstractLifeCycle.level = ERROR
> logger.parquet.name = org.apache.parquet
> logger.parquet.level = ERROR
> logger.kafka.name = org.apache.kafka
> logger.kafka.level = WARN
> logger.kafka.clients.consumer.internals.Fetcher.level=WARN {code}
> All log4j2 jars are included in the Spark home classpath under the jars 
> directory.
>  * log4j-1.2-api-2.17.2.jar
>  * log4j-api-2.17.2.jar
>  * log4j-api-scala_2.12-12.0.jar
>  * log4j-core-2.17.2.jar
>  * log4j-slf4j-impl-2.17.2.jar
> Can you please check and let me know whether I need to add or update anything 
> to start the job in cluster mode supporting log4j2
> Note : Things work fine with log4j1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42607) [MESOS] OMP_NUM_THREADS not set to number of executor cores by default

2023-03-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42607.
--
Resolution: Not A Problem

> [MESOS] OMP_NUM_THREADS not set to number of executor cores by default
> --
>
> Key: SPARK-42607
> URL: https://issues.apache.org/jira/browse/SPARK-42607
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
> We could have similar issue to SPARK-42596 (YARN) in Mesos.
> Could someone verify? Unfortunately I am not able to due to lack of infra. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42607) [MESOS] OMP_NUM_THREADS not set to number of executor cores by default

2023-03-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699040#comment-17699040
 ] 

Sean R. Owen commented on SPARK-42607:
--

I don't think we're touching Mesos at this point - all but deprecated

> [MESOS] OMP_NUM_THREADS not set to number of executor cores by default
> --
>
> Key: SPARK-42607
> URL: https://issues.apache.org/jira/browse/SPARK-42607
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
> We could have similar issue to SPARK-42596 (YARN) in Mesos.
> Could someone verify? Unfortunately I am not able to due to lack of infra. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42750) Support INSERT INTO by name

2023-03-10 Thread Jose Torres (Jira)
Jose Torres created SPARK-42750:
---

 Summary: Support INSERT INTO by name
 Key: SPARK-42750
 URL: https://issues.apache.org/jira/browse/SPARK-42750
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jose Torres


In some use cases, users have incoming dataframes with fixed column names which 
might differ from the canonical order. Currently there's no way to handle this 
easily through the INSERT INTO API - the user has to make sure the columns are 
in the right order as they would when inserting a tuple. We should add an 
optional BY NAME clause, such that:

INSERT INTO tgt BY NAME 



takes each column of  and inserts it into the column in `tgt` which has 
the same name according to the configured `resolver` logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42627) Spark: Getting SQLException: Unsupported type -102 reading from Oracle

2023-03-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699039#comment-17699039
 ] 

Sean R. Owen commented on SPARK-42627:
--

Not enough info here. What type is not supported?

> Spark: Getting SQLException: Unsupported type -102 reading from Oracle
> --
>
> Key: SPARK-42627
> URL: https://issues.apache.org/jira/browse/SPARK-42627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: melin
>Priority: Major
>
>  
> {code:java}
> Exception in thread "main" org.apache.spark.SparkSQLException: Unrecognized 
> SQL type -102
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.unrecognizedSqlTypeError(QueryExecutionErrors.scala:832)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:225)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:308)
>     at scala.Option.getOrElse(Option.scala:189)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:308)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:70)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:242)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:37)
>     at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>     at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> {code}
> oracle driver
> {code:java}
> 
>     com.oracle.database.jdbc
>     ojdbc8
>     21.9.0.0
> {code}
>  
> oracle sql:
>  
> {code:java}
> CREATE TABLE "ORDERS" 
>    (    "ORDER_ID" NUMBER(9,0) NOT NULL ENABLE, 
>     "ORDER_DATE" TIMESTAMP (3) WITH LOCAL TIME ZONE NOT NULL ENABLE, 
>     "CUSTOMER_NAME" VARCHAR2(255) NOT NULL ENABLE, 
>     "PRICE" NUMBER(10,5) NOT NULL ENABLE, 
>     "PRODUCT_ID" NUMBER(9,0) NOT NULL ENABLE, 
>     "ORDER_STATUS" NUMBER(1,0) NOT NULL ENABLE, 
>      PRIMARY KEY ("ORDER_ID")
>   USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS 
>   STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
>   PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE 
> DEFAULT CELL_FLASH_CACHE DEFAULT)
>   TABLESPACE "LOGMINER_TBS"  ENABLE, 
>      SUPPLEMENTAL LOG DATA (ALL) COLUMNS
>    ) SEGMENT CREATION IMMEDIATE 
>   PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
>   STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
>   PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE 
> DEFAULT CELL_FLASH_CACHE DEFAULT)
>   TABLESPACE "LOGMINER_TBS"
>  
> {code}
> [~beliefer] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42714) Sparksql temporary file conflict

2023-03-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699038#comment-17699038
 ] 

Sean R. Owen commented on SPARK-42714:
--

Not really enough info here. How does it happen?

> Sparksql temporary file conflict
> 
>
> Key: SPARK-42714
> URL: https://issues.apache.org/jira/browse/SPARK-42714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: hao
>Priority: Major
>
> When sparksql inserts overwrite, the name of the temporary file in the middle 
> is not unique. This will cause that when multiple applications write 
> different partition data to the same partition table, it will be possible to 
> delete each other's temporary files between applications, resulting in task 
> failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42749) CAST(x as int) does not generate error with overflow

2023-03-10 Thread Tjomme Vergauwen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tjomme Vergauwen updated SPARK-42749:
-
Attachment: Spark-42749.PNG

> CAST(x as int) does not generate error with overflow
> 
>
> Key: SPARK-42749
> URL: https://issues.apache.org/jira/browse/SPARK-42749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.3.1, 3.3.2
> Environment: It was tested on a DataBricks environment with DBR 10.4 
> and above, running Spark v3.2.1 and above.
>Reporter: Tjomme Vergauwen
>Priority: Major
> Attachments: Spark-42749.PNG
>
>
> Hi,
> When performing the following code:
> {{select cast(7.415246799222789E19 as int)}}
> according to the documentation, an error is expected as 
> {{7.415246799222789E19 }}is an overflow value for datatype INT.
> However, the value 2147483647 is returned. 
> The behaviour of the following is correct as it returns NULL:
> {{select try_cast(7.415246799222789E19 as int) }}
> This results in unexpected behaviour and data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42749) CAST(x as int) does not generate error with overflow

2023-03-10 Thread Tjomme Vergauwen (Jira)
Tjomme Vergauwen created SPARK-42749:


 Summary: CAST(x as int) does not generate error with overflow
 Key: SPARK-42749
 URL: https://issues.apache.org/jira/browse/SPARK-42749
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.3.1, 3.3.0, 3.2.1
 Environment: It was tested on a DataBricks environment with DBR 10.4 
and above, running Spark v3.2.1 and above.
Reporter: Tjomme Vergauwen


Hi,

When performing the following code:

{{select cast(7.415246799222789E19 as int)}}

according to the documentation, an error is expected as {{7.415246799222789E19 
}}is an overflow value for datatype INT.

However, the value 2147483647 is returned. 

The behaviour of the following is correct as it returns NULL:

{{select try_cast(7.415246799222789E19 as int) }}

This results in unexpected behaviour and data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41498) Union does not propagate Metadata output

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699010#comment-17699010
 ] 

Apache Spark commented on SPARK-41498:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/40371

> Union does not propagate Metadata output
> 
>
> Key: SPARK-41498
> URL: https://issues.apache.org/jira/browse/SPARK-41498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: Fredrik Klauß
>Assignee: Fredrik Klauß
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the Union operator does not propagate any metadata output. This 
> makes it impossible to access any metadata if a Union operator is used, even 
> though the children have the exact same metadata output.
> Example:
>  
> {code:java}
> val df1 = spark.read.load(path1)
> val df2 = spark.read.load(path2)
> df1.union(df2).select("_metadata.file_path"). // <-- fails{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42620:


Assignee: (was: Apache Spark)

> Add `inclusive` parameter for (DataFrame|Series).between_time
> -
>
> Key: SPARK-42620
> URL: https://issues.apache.org/jira/browse/SPARK-42620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> See https://github.com/pandas-dev/pandas/pull/43248



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42620:


Assignee: Apache Spark

> Add `inclusive` parameter for (DataFrame|Series).between_time
> -
>
> Key: SPARK-42620
> URL: https://issues.apache.org/jira/browse/SPARK-42620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> See https://github.com/pandas-dev/pandas/pull/43248



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698999#comment-17698999
 ] 

Apache Spark commented on SPARK-42620:
--

User 'dzhigimont' has created a pull request for this issue:
https://github.com/apache/spark/pull/40370

> Add `inclusive` parameter for (DataFrame|Series).between_time
> -
>
> Key: SPARK-42620
> URL: https://issues.apache.org/jira/browse/SPARK-42620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> See https://github.com/pandas-dev/pandas/pull/43248



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42398) refine default column value framework

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698982#comment-17698982
 ] 

Apache Spark commented on SPARK-42398:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/40369

> refine default column value framework
> -
>
> Key: SPARK-42398
> URL: https://issues.apache.org/jira/browse/SPARK-42398
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42748) Server-side Artifact Management

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698952#comment-17698952
 ] 

Apache Spark commented on SPARK-42748:
--

User 'vicennial' has created a pull request for this issue:
https://github.com/apache/spark/pull/40368

> Server-side Artifact Management
> ---
>
> Key: SPARK-42748
> URL: https://issues.apache.org/jira/browse/SPARK-42748
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side 
> transfer of artifacts to the server but currently, the server does not 
> process these requests.
>  
> We need to implement a server-side management mechanism to handle storage of 
> these artifacts on the driver as well as perform further processing (such as 
> adding jars and moving class files to the right directories)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42748) Server-side Artifact Management

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42748:


Assignee: Apache Spark

> Server-side Artifact Management
> ---
>
> Key: SPARK-42748
> URL: https://issues.apache.org/jira/browse/SPARK-42748
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Apache Spark
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side 
> transfer of artifacts to the server but currently, the server does not 
> process these requests.
>  
> We need to implement a server-side management mechanism to handle storage of 
> these artifacts on the driver as well as perform further processing (such as 
> adding jars and moving class files to the right directories)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42748) Server-side Artifact Management

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42748:


Assignee: (was: Apache Spark)

> Server-side Artifact Management
> ---
>
> Key: SPARK-42748
> URL: https://issues.apache.org/jira/browse/SPARK-42748
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side 
> transfer of artifacts to the server but currently, the server does not 
> process these requests.
>  
> We need to implement a server-side management mechanism to handle storage of 
> these artifacts on the driver as well as perform further processing (such as 
> adding jars and moving class files to the right directories)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42748) Server-side Artifact Management

2023-03-10 Thread Venkata Sai Akhil Gudesa (Jira)
Venkata Sai Akhil Gudesa created SPARK-42748:


 Summary: Server-side Artifact Management
 Key: SPARK-42748
 URL: https://issues.apache.org/jira/browse/SPARK-42748
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.4.0
Reporter: Venkata Sai Akhil Gudesa


https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side 
transfer of artifacts to the server but currently, the server does not process 
these requests.

 

We need to implement a server-side management mechanism to handle storage of 
these artifacts on the driver as well as perform further processing (such as 
adding jars and moving class files to the right directories)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42747) Fix incorrect internal status of LoR and AFT

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698943#comment-17698943
 ] 

Apache Spark commented on SPARK-42747:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40367

> Fix incorrect internal status of LoR and AFT
> 
>
> Key: SPARK-42747
> URL: https://issues.apache.org/jira/browse/SPARK-42747
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> LoR and AFT applied internal status to optimize prediction/transform, but the 
> status is not correctly updated in some case:
> {code:java}
> from pyspark.sql import Row
> from pyspark.ml.classification import *
> from pyspark.ml.linalg import Vectors
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(0.0, 5.0)),
> (0.0, 2.0, Vectors.dense(1.0, 2.0)),
> (1.0, 3.0, Vectors.dense(2.0, 1.0)),
> (0.0, 4.0, Vectors.dense(3.0, 3.0)),
> ],
> ["label", "weight", "features"],
> )
> lor = LogisticRegression(weightCol="weight")
> model = lor.fit(df)
> # status changes 1
> for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
> model.setThreshold(t).transform(df)
> # status changes 2
> [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, 
> 0.2, 0.5, 1.0]]
> for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
> print(t)
> model.setThreshold(t).transform(df).show()
> #  <- error results
> {code}
> results:
> {code:java}
> 0.0
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.1
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.2
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.5
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 1.0
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-++---

[jira] [Assigned] (SPARK-42747) Fix incorrect internal status of LoR and AFT

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42747:


Assignee: Apache Spark

> Fix incorrect internal status of LoR and AFT
> 
>
> Key: SPARK-42747
> URL: https://issues.apache.org/jira/browse/SPARK-42747
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>
> LoR and AFT applied internal status to optimize prediction/transform, but the 
> status is not correctly updated in some case:
> {code:java}
> from pyspark.sql import Row
> from pyspark.ml.classification import *
> from pyspark.ml.linalg import Vectors
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(0.0, 5.0)),
> (0.0, 2.0, Vectors.dense(1.0, 2.0)),
> (1.0, 3.0, Vectors.dense(2.0, 1.0)),
> (0.0, 4.0, Vectors.dense(3.0, 3.0)),
> ],
> ["label", "weight", "features"],
> )
> lor = LogisticRegression(weightCol="weight")
> model = lor.fit(df)
> # status changes 1
> for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
> model.setThreshold(t).transform(df)
> # status changes 2
> [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, 
> 0.2, 0.5, 1.0]]
> for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
> print(t)
> model.setThreshold(t).transform(df).show()
> #  <- error results
> {code}
> results:
> {code:java}
> 0.0
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.1
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.2
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.5
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 1.0
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820

[jira] [Assigned] (SPARK-42747) Fix incorrect internal status of LoR and AFT

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42747:


Assignee: (was: Apache Spark)

> Fix incorrect internal status of LoR and AFT
> 
>
> Key: SPARK-42747
> URL: https://issues.apache.org/jira/browse/SPARK-42747
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> LoR and AFT applied internal status to optimize prediction/transform, but the 
> status is not correctly updated in some case:
> {code:java}
> from pyspark.sql import Row
> from pyspark.ml.classification import *
> from pyspark.ml.linalg import Vectors
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(0.0, 5.0)),
> (0.0, 2.0, Vectors.dense(1.0, 2.0)),
> (1.0, 3.0, Vectors.dense(2.0, 1.0)),
> (0.0, 4.0, Vectors.dense(3.0, 3.0)),
> ],
> ["label", "weight", "features"],
> )
> lor = LogisticRegression(weightCol="weight")
> model = lor.fit(df)
> # status changes 1
> for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
> model.setThreshold(t).transform(df)
> # status changes 2
> [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, 
> 0.2, 0.5, 1.0]]
> for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
> print(t)
> model.setThreshold(t).transform(df).show()
> #  <- error results
> {code}
> results:
> {code:java}
> 0.0
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.1
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.2
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.5
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 1.0
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---

[jira] (SPARK-42691) Implement Dataset.semanticHash

2023-03-10 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-42691 ]


jiaan.geng deleted comment on SPARK-42691:


was (Author: beliefer):
I will take a look!

> Implement Dataset.semanticHash
> --
>
> Key: SPARK-42691
> URL: https://issues.apache.org/jira/browse/SPARK-42691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Implement Dataset.semanticHash:
> {code:java}
> /**
> * Returns a `hashCode` of the logical query plan against this [[Dataset]].
> *
> * @note Unlike the standard `hashCode`, the hash is calculated against the 
> query plan
> * simplified by tolerating the cosmetic differences such as attribute names.
> * @since 3.4.0
> */
> @DeveloperApi
> def semanticHash(): Int{code}
> This has to be computed on the spark connect server to do this. Please extend 
> the 
> AnalyzePlanRequest and AnalyzePlanResponse messages for this.
> Also make sure this works in PySpark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42691) Implement Dataset.semanticHash

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42691:


Assignee: (was: Apache Spark)

> Implement Dataset.semanticHash
> --
>
> Key: SPARK-42691
> URL: https://issues.apache.org/jira/browse/SPARK-42691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Implement Dataset.semanticHash:
> {code:java}
> /**
> * Returns a `hashCode` of the logical query plan against this [[Dataset]].
> *
> * @note Unlike the standard `hashCode`, the hash is calculated against the 
> query plan
> * simplified by tolerating the cosmetic differences such as attribute names.
> * @since 3.4.0
> */
> @DeveloperApi
> def semanticHash(): Int{code}
> This has to be computed on the spark connect server to do this. Please extend 
> the 
> AnalyzePlanRequest and AnalyzePlanResponse messages for this.
> Also make sure this works in PySpark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42691) Implement Dataset.semanticHash

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42691:


Assignee: Apache Spark

> Implement Dataset.semanticHash
> --
>
> Key: SPARK-42691
> URL: https://issues.apache.org/jira/browse/SPARK-42691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>
> Implement Dataset.semanticHash:
> {code:java}
> /**
> * Returns a `hashCode` of the logical query plan against this [[Dataset]].
> *
> * @note Unlike the standard `hashCode`, the hash is calculated against the 
> query plan
> * simplified by tolerating the cosmetic differences such as attribute names.
> * @since 3.4.0
> */
> @DeveloperApi
> def semanticHash(): Int{code}
> This has to be computed on the spark connect server to do this. Please extend 
> the 
> AnalyzePlanRequest and AnalyzePlanResponse messages for this.
> Also make sure this works in PySpark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42691) Implement Dataset.semanticHash

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698935#comment-17698935
 ] 

Apache Spark commented on SPARK-42691:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40366

> Implement Dataset.semanticHash
> --
>
> Key: SPARK-42691
> URL: https://issues.apache.org/jira/browse/SPARK-42691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Implement Dataset.semanticHash:
> {code:java}
> /**
> * Returns a `hashCode` of the logical query plan against this [[Dataset]].
> *
> * @note Unlike the standard `hashCode`, the hash is calculated against the 
> query plan
> * simplified by tolerating the cosmetic differences such as attribute names.
> * @since 3.4.0
> */
> @DeveloperApi
> def semanticHash(): Int{code}
> This has to be computed on the spark connect server to do this. Please extend 
> the 
> AnalyzePlanRequest and AnalyzePlanResponse messages for this.
> Also make sure this works in PySpark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42745:
---
Description: 
After SPARK-40086 / SPARK-42049 the following, simple subselect expression 
containing query:
{noformat}
select (select sum(id) from t1)
{noformat}
fails with:

{noformat}
09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in 
stage 3.0 (TID 3)
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60)
at scala.runtime.Statics.anyHash(Statics.java:122)
...
at 
org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249)
at scala.runtime.Statics.anyHash(Statics.java:122)
at 
scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416)
at 
scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416)
at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44)
at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149)
at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148)
at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44)
at scala.collection.mutable.HashTable.init(HashTable.scala:110)
at scala.collection.mutable.HashTable.init$(HashTable.scala:89)
at scala.collection.mutable.HashMap.init(HashMap.scala:44)
at scala.collection.mutable.HashMap.readObject(HashMap.scala:195)
...
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:85)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{noformat}
when DSv2 is enabled.

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.0
>
>
> After SPARK-40086 / SPARK-42049 the following, simple subselect expression 
> containing query:
> {noformat}
> select (select sum(id) from t1)
> {noformat}
> fails with:
> {noformat}
> 09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60)
>   at scala.runtime.Statics.anyHash(Statics.java:122)
> ...
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249)
>   at scala.runtime.Statics.anyHash(Statics.java:122)
>   at 
> scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416)
>   at 
> scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416)
>   at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44)
>   at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149)
>   at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148)
>   at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashTable.init(HashTable.scala:110)
>   at scala.collection.mutable.HashTable.init$(HashTable.scala:89)
>   at scala.collection.mutable.HashMap.init(HashMap.scala:44)
>   at scala.collection.mutable.HashMap.readObject(HashMap.scala:195)
> ...
>   at java.io.ObjectInp

[jira] [Assigned] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42745:
---

Assignee: Peter Toth

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42745.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40364
[https://github.com/apache/spark/pull/40364]

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42747) Fix incorrect internal status of LoR and AFT

2023-03-10 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42747:
-

 Summary: Fix incorrect internal status of LoR and AFT
 Key: SPARK-42747
 URL: https://issues.apache.org/jira/browse/SPARK-42747
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.4.0
Reporter: Ruifeng Zheng


LoR and AFT applied internal status to optimize prediction/transform, but the 
status is not correctly updated in some case:


{code:java}
from pyspark.sql import Row
from pyspark.ml.classification import *
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame(
[
(1.0, 1.0, Vectors.dense(0.0, 5.0)),
(0.0, 2.0, Vectors.dense(1.0, 2.0)),
(1.0, 3.0, Vectors.dense(2.0, 1.0)),
(0.0, 4.0, Vectors.dense(3.0, 3.0)),
],
["label", "weight", "features"],
)

lor = LogisticRegression(weightCol="weight")
model = lor.fit(df)

# status changes 1
for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
model.setThreshold(t).transform(df)

# status changes 2
[model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, 
0.2, 0.5, 1.0]]

for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
print(t)
model.setThreshold(t).transform(df).show()  
  #  <- error results
{code}


results:

{code:java}
0.0
+-+--+-+++--+
|label|weight| features|   rawPrediction| probability|prediction|
+-+--+-+++--+
|  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
|  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
|  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
|  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
+-+--+-+++--+

0.1
+-+--+-+++--+
|label|weight| features|   rawPrediction| probability|prediction|
+-+--+-+++--+
|  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
|  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
|  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
|  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
+-+--+-+++--+

0.2
+-+--+-+++--+
|label|weight| features|   rawPrediction| probability|prediction|
+-+--+-+++--+
|  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
|  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
|  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
|  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
+-+--+-+++--+

0.5
+-+--+-+++--+
|label|weight| features|   rawPrediction| probability|prediction|
+-+--+-+++--+
|  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
|  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
|  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
|  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
+-+--+-+++--+

1.0
+-+--+-+++--+
|label|weight| features|   rawPrediction| probability|prediction|
+-+--+-+++--+
|  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
|  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
|  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
|  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
+-+--+-+++--+

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42743) Support analyze TimestampNTZ columns

2023-03-10 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42743.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40362
[https://github.com/apache/spark/pull/40362]

> Support analyze TimestampNTZ columns
> 
>
> Key: SPARK-42743
> URL: https://issues.apache.org/jira/browse/SPARK-42743
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42743) Support analyze TimestampNTZ columns

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42743:


Assignee: Gengliang Wang  (was: Apache Spark)

> Support analyze TimestampNTZ columns
> 
>
> Key: SPARK-42743
> URL: https://issues.apache.org/jira/browse/SPARK-42743
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42743) Support analyze TimestampNTZ columns

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42743:


Assignee: Apache Spark  (was: Gengliang Wang)

> Support analyze TimestampNTZ columns
> 
>
> Key: SPARK-42743
> URL: https://issues.apache.org/jira/browse/SPARK-42743
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42743) Support analyze TimestampNTZ columns

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698876#comment-17698876
 ] 

Apache Spark commented on SPARK-42743:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40362

> Support analyze TimestampNTZ columns
> 
>
> Key: SPARK-42743
> URL: https://issues.apache.org/jira/browse/SPARK-42743
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42691) Implement Dataset.semanticHash

2023-03-10 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698872#comment-17698872
 ] 

jiaan.geng commented on SPARK-42691:


I will take a look!

> Implement Dataset.semanticHash
> --
>
> Key: SPARK-42691
> URL: https://issues.apache.org/jira/browse/SPARK-42691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Implement Dataset.semanticHash:
> {code:java}
> /**
> * Returns a `hashCode` of the logical query plan against this [[Dataset]].
> *
> * @note Unlike the standard `hashCode`, the hash is calculated against the 
> query plan
> * simplified by tolerating the cosmetic differences such as attribute names.
> * @since 3.4.0
> */
> @DeveloperApi
> def semanticHash(): Int{code}
> This has to be computed on the spark connect server to do this. Please extend 
> the 
> AnalyzePlanRequest and AnalyzePlanResponse messages for this.
> Also make sure this works in PySpark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42746) Add the LISTAGG() aggregate function

2023-03-10 Thread Max Gekk (Jira)
Max Gekk created SPARK-42746:


 Summary: Add the LISTAGG() aggregate function
 Key: SPARK-42746
 URL: https://issues.apache.org/jira/browse/SPARK-42746
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


{{listagg()}} is a common and useful aggregation function to concatenate string 
values in a column, optionally by a certain order. The systems below have 
supported such function already:
 *  Oracle: 
https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030
 * Snowflake: https://docs.snowflake.com/en/sql-reference/functions/listagg
 * Amazon Redshift: 
https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html
 * Google BigQuery: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg

Need to introduce this new aggregate in Spark, both as a regular aggregate and 
as a window function.

Proposed syntax:

{code:sql}
LISTAGG( [ DISTINCT ]  [,  ] ) [ WITHIN GROUP ( 
 ) ]
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42746) Add the LISTAGG() aggregate function

2023-03-10 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-42746:
-
Description: 
{{listagg()}} is a common and useful aggregation function to concatenate string 
values in a column, optionally by a certain order. The systems below have 
supported such function already:
 * Oracle: 
[https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030]
 * Snowflake: [https://docs.snowflake.com/en/sql-reference/functions/listagg]
 * Amazon Redshift: 
[https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html]
 * Google BigQuery: 
[https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg]

Need to introduce this new aggregate in Spark, both as a regular aggregate and 
as a window function.

Proposed syntax:
{code:sql}
LISTAGG( [ DISTINCT ]  [,  ] ) [ WITHIN GROUP ( 
 ) ]
{code}

  was:
{{listagg()}} is a common and useful aggregation function to concatenate string 
values in a column, optionally by a certain order. The systems below have 
supported such function already:
 *  Oracle: 
https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030
 * Snowflake: https://docs.snowflake.com/en/sql-reference/functions/listagg
 * Amazon Redshift: 
https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html
 * Google BigQuery: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg

Need to introduce this new aggregate in Spark, both as a regular aggregate and 
as a window function.

Proposed syntax:

{code:sql}
LISTAGG( [ DISTINCT ]  [,  ] ) [ WITHIN GROUP ( 
 ) ]
{code}



> Add the LISTAGG() aggregate function
> 
>
> Key: SPARK-42746
> URL: https://issues.apache.org/jira/browse/SPARK-42746
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> {{listagg()}} is a common and useful aggregation function to concatenate 
> string values in a column, optionally by a certain order. The systems below 
> have supported such function already:
>  * Oracle: 
> [https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030]
>  * Snowflake: [https://docs.snowflake.com/en/sql-reference/functions/listagg]
>  * Amazon Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html]
>  * Google BigQuery: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg]
> Need to introduce this new aggregate in Spark, both as a regular aggregate 
> and as a window function.
> Proposed syntax:
> {code:sql}
> LISTAGG( [ DISTINCT ]  [,  ] ) [ WITHIN GROUP ( 
>  ) ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42737) Shuffle files lost with graceful decommission fallback storage enabled

2023-03-10 Thread Yeachan Park (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeachan Park updated SPARK-42737:
-
Description: 
During testing of graceful decommissioning, the driver logs indicate that 
shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`:

{code:bash}
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 3
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
23/03/09 15:22:42 WARN 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 1
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
23/03/09 15:22:42 WARN 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 2
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: 
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason 
statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: 
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason 
statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: 
Executor decommission.
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 
from BlockManagerMaster.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason 
statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(3, 100.96.5.11, 44707, None)
23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in 
removeExecutor
23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 
0)
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, 100.96.5.9, 44491, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 
1)
23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(2, 100.96.5.10, 39011, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in 
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 
2)
23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
23/03/09 15:22:52 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 1
{code}

The decommission logs from the executor also seems to indicate that no shuffle 
data was necessary to migrate:

{code:java}
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished 
decommissioning
23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning 
process...
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can 
shutdown.
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking 
migrations
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet 
migrated.
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all 
RDD blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all 
shuffle blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable 
shuffle blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all 
cached RDD blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are 
added. In total, 0 shuffles are remained.
23/03/09 15:22:4

[jira] [Updated] (SPARK-42737) Shuffle files lost with graceful decommission fallback storage enabled

2023-03-10 Thread Yeachan Park (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeachan Park updated SPARK-42737:
-
Description: 
During testing of graceful decommissioning, the driver logs indicate that 
shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`:

{code:bash}
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 3
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
23/03/09 15:22:42 WARN 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 1
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
23/03/09 15:22:42 WARN 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 2
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: 
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason 
statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: 
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason 
statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: 
Executor decommission.
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 
from BlockManagerMaster.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason 
statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(3, 100.96.5.11, 44707, None)
23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in 
removeExecutor
23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 
0)
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, 100.96.5.9, 44491, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 
1)
23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(2, 100.96.5.10, 39011, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in 
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 
2)
23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
23/03/09 15:22:52 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 1
{code}

The decommission logs from the executor also seems to indicate that no shuffle 
data was necessary to migrate:

{code:java}
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished 
decommissioning
23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning 
process...
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can 
shutdown.
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking 
migrations
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet 
migrated.
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all 
RDD blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all 
shuffle blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable 
shuffle blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all 
cached RDD blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are 
added. In total, 0 shuffles are remained.
23/03/09 15:22:4

[jira] [Assigned] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42745:


Assignee: (was: Apache Spark)

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698851#comment-17698851
 ] 

Apache Spark commented on SPARK-42745:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40364

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42745:


Assignee: Apache Spark

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42744) delete uploaded file when job finish

2023-03-10 Thread Hu Ziqian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698833#comment-17698833
 ] 

Hu Ziqian commented on SPARK-42744:
---

add pr https://github.com/apache/spark/pull/40363

> delete uploaded file when job finish
> 
>
> Key: SPARK-42744
> URL: https://issues.apache.org/jira/browse/SPARK-42744
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.2
>Reporter: Hu Ziqian
>Priority: Major
>
> On kubernetes, spark use spark.kubernetes.file.upload.path to upload local 
> files to 
> Hadoop compatible file system. But spark do not delete those files at all.
> In this issue, we add a configuration 
> spark.kubernetes.uploaded.file.delete.on.termination to delete those files by 
> driver when job finishs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42745:
---
Summary: Improved AliasAwareOutputExpression works with DSv2  (was: Fix NPE 
after recent AliasAwareOutputExpression changes)

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42745) Fix NPE after recent AliasAwareOutputExpression changes

2023-03-10 Thread Peter Toth (Jira)
Peter Toth created SPARK-42745:
--

 Summary: Fix NPE after recent AliasAwareOutputExpression changes
 Key: SPARK-42745
 URL: https://issues.apache.org/jira/browse/SPARK-42745
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0, 3.5.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42744) delete uploaded file when job finish

2023-03-10 Thread Hu Ziqian (Jira)
Hu Ziqian created SPARK-42744:
-

 Summary: delete uploaded file when job finish
 Key: SPARK-42744
 URL: https://issues.apache.org/jira/browse/SPARK-42744
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.2
Reporter: Hu Ziqian


On kubernetes, spark use spark.kubernetes.file.upload.path to upload local 
files to 

Hadoop compatible file system. But spark do not delete those files at all.

In this issue, we add a configuration 
spark.kubernetes.uploaded.file.delete.on.termination to delete those files by 
driver when job finishs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42743) Support analyze TimestampNTZ columns

2023-03-10 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-42743:
--

 Summary: Support analyze TimestampNTZ columns
 Key: SPARK-42743
 URL: https://issues.apache.org/jira/browse/SPARK-42743
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org