[jira] [Created] (SPARK-43104) Set `shadeTestJar` of protobuf module to false

2023-04-12 Thread Yang Jie (Jira)
Yang Jie created SPARK-43104:


 Summary: Set `shadeTestJar` of protobuf module to false
 Key: SPARK-43104
 URL: https://issues.apache.org/jira/browse/SPARK-43104
 Project: Spark
  Issue Type: Improvement
  Components: Protobuf
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42994) Torch Distributor support Local Mode

2023-04-12 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42994.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40695
[https://github.com/apache/spark/pull/40695]

> Torch Distributor support Local Mode
> 
>
> Key: SPARK-42994
> URL: https://issues.apache.org/jira/browse/SPARK-42994
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42994) Torch Distributor support Local Mode

2023-04-12 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42994:
-

Assignee: Ruifeng Zheng

> Torch Distributor support Local Mode
> 
>
> Key: SPARK-42994
> URL: https://issues.apache.org/jira/browse/SPARK-42994
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43105) Abbreviate Bytes in proto message's debug string

2023-04-12 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-43105:
-

 Summary: Abbreviate Bytes in proto message's debug string
 Key: SPARK-43105
 URL: https://issues.apache.org/jira/browse/SPARK-43105
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-12 Thread vaibhav beriwala (Jira)
vaibhav beriwala created SPARK-43106:


 Summary: Data lost from the table if the INSERT OVERWRITE query 
fails
 Key: SPARK-43106
 URL: https://issues.apache.org/jira/browse/SPARK-43106
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.2
Reporter: vaibhav beriwala


When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
Spark has the following behavior:

1) It will first clean up all the data from the actual table path.

2) It will then launch a job that performs the actual insert.

 

There are 2 major issues with this approach:

1) If the insert job launched in step 2 above fails for any reason, the data 
from the original table is lost. 

2) If the insert job in step 2 above takes a huge time to complete, then table 
data is unavailable to other readers for the entire duration the job takes.

 

This behavior is the same even for the partitioned tables when using static 
partitioning. For dynamic partitioning, we do not delete the table data before 
the job launch.

 

Is there a reason as to why we perform this delete before the job launch and 
not as part of the Job commit operation? This issue is not there with Hive - 
where the data is cleaned up as part of the Job commit operation probably.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-12 Thread vaibhav beriwala (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vaibhav beriwala updated SPARK-43106:
-
Issue Type: Bug  (was: Improvement)

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: vaibhav beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
>  
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-12 Thread vaibhav beriwala (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vaibhav beriwala updated SPARK-43106:
-
Description: 
When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
Spark has the following behavior:

1) It will first clean up all the data from the actual table path.

2) It will then launch a job that performs the actual insert.

 

There are 2 major issues with this approach:

1) If the insert job launched in step 2 above fails for any reason, the data 
from the original table is lost. 

2) If the insert job in step 2 above takes a huge time to complete, then table 
data is unavailable to other readers for the entire duration the job takes.

 

This behavior is the same even for the partitioned tables when using static 
partitioning. For dynamic partitioning, we do not delete the table data before 
the job launch.

 

Is there a reason as to why we perform this delete before the job launch and 
not as part of the Job commit operation? This issue is not there with Hive - 
where the data is cleaned up as part of the Job commit operation probably.

 

As part of SPARK-19183, we did add a new job commit hook for this exact same 
purpose, but seems like it's default behavior is still to delete the table data 
before job launch.

  was:
When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
Spark has the following behavior:

1) It will first clean up all the data from the actual table path.

2) It will then launch a job that performs the actual insert.

 

There are 2 major issues with this approach:

1) If the insert job launched in step 2 above fails for any reason, the data 
from the original table is lost. 

2) If the insert job in step 2 above takes a huge time to complete, then table 
data is unavailable to other readers for the entire duration the job takes.

 

This behavior is the same even for the partitioned tables when using static 
partitioning. For dynamic partitioning, we do not delete the table data before 
the job launch.

 

Is there a reason as to why we perform this delete before the job launch and 
not as part of the Job commit operation? This issue is not there with Hive - 
where the data is cleaned up as part of the Job commit operation probably.


> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: vaibhav beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
>  
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably.
>  
> As part of SPARK-19183, we did add a new job commit hook for this exact same 
> purpose, but seems like it's default behavior is still to delete the table 
> data before job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-12 Thread vaibhav beriwala (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vaibhav beriwala updated SPARK-43106:
-
Description: 
When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
Spark has the following behavior:

1) It will first clean up all the data from the actual table path.

2) It will then launch a job that performs the actual insert.

 

There are 2 major issues with this approach:

1) If the insert job launched in step 2 above fails for any reason, the data 
from the original table is lost. 

2) If the insert job in step 2 above takes a huge time to complete, then table 
data is unavailable to other readers for the entire duration the job takes.

This behavior is the same even for the partitioned tables when using static 
partitioning. For dynamic partitioning, we do not delete the table data before 
the job launch.

 

Is there a reason as to why we perform this delete before the job launch and 
not as part of the Job commit operation? This issue is not there with Hive - 
where the data is cleaned up as part of the Job commit operation probably. As 
part of SPARK-19183, we did add a new hook in the commit protocol for this 
exact same purpose, but seems like its default behavior is still to delete the 
table data before the job launch.

  was:
When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
Spark has the following behavior:

1) It will first clean up all the data from the actual table path.

2) It will then launch a job that performs the actual insert.

 

There are 2 major issues with this approach:

1) If the insert job launched in step 2 above fails for any reason, the data 
from the original table is lost. 

2) If the insert job in step 2 above takes a huge time to complete, then table 
data is unavailable to other readers for the entire duration the job takes.

 

This behavior is the same even for the partitioned tables when using static 
partitioning. For dynamic partitioning, we do not delete the table data before 
the job launch.

 

Is there a reason as to why we perform this delete before the job launch and 
not as part of the Job commit operation? This issue is not there with Hive - 
where the data is cleaned up as part of the Job commit operation probably.

 

As part of SPARK-19183, we did add a new job commit hook for this exact same 
purpose, but seems like it's default behavior is still to delete the table data 
before job launch.


> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: vaibhav beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43105) Abbreviate Bytes in proto message's debug string

2023-04-12 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711311#comment-17711311
 ] 

GridGain Integration commented on SPARK-43105:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40750

> Abbreviate Bytes in proto message's debug string
> 
>
> Key: SPARK-43105
> URL: https://issues.apache.org/jira/browse/SPARK-43105
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43103) Move Integral to PhysicalDataType

2023-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43103.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40752
[https://github.com/apache/spark/pull/40752]

> Move Integral to PhysicalDataType
> -
>
> Key: SPARK-43103
> URL: https://issues.apache.org/jira/browse/SPARK-43103
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43107) Coalesce applied on broadcast join stream side

2023-04-12 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-43107:
---

 Summary: Coalesce applied on broadcast join stream side
 Key: SPARK-43107
 URL: https://issues.apache.org/jira/browse/SPARK-43107
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43038) Support the CBC mode by aes_encrypt()/aes_decrypt()

2023-04-12 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43038.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40704
[https://github.com/apache/spark/pull/40704]

> Support the CBC mode by aes_encrypt()/aes_decrypt()
> ---
>
> Key: SPARK-43038
> URL: https://issues.apache.org/jira/browse/SPARK-43038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.5.0
>
>
> Implement the GCM mode in CBC - aes_encrypt() and aes_decrypt(). Use 
> *AES/CBC/PKCS5Padding* of the Java Cryptographic Extension (JCE) framework. 
> Output should be compatible with OpenSSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43108) org.apache.spark.storage.StorageStatus NotSerializableException when try to access StorageStatus in a MapPartitionsFunction

2023-04-12 Thread surender godara (Jira)
surender godara created SPARK-43108:
---

 Summary: org.apache.spark.storage.StorageStatus 
NotSerializableException when try to access StorageStatus in a 
MapPartitionsFunction
 Key: SPARK-43108
 URL: https://issues.apache.org/jira/browse/SPARK-43108
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: surender godara


When you try to access the *storage status 
(org.apache.spark.storage.StorageStatus)* inside a MapPartitionsFunction,then 
getStorageStatus method throw the NotSerializableException. This exception is 
thrown because the StorageStatus object is not serializable.

Here is an example code snippet that demonstrates how to access the storage 
status inside a MapPartitionsFunction in Spark:
{code:java}
StorageStatus[] storageStatus = 
SparkEnv.get().blockManager().master().getStorageStatus();{code}

*Error stacktrace --*
{code:java}
Caused by: java.io.NotSerializableException: 
org.apache.spark.storage.StorageStatus
Serialization stack:
    - object not serializable (class: org.apache.spark.storage.StorageStatus, 
value: org.apache.spark.storage.StorageStatus@715b4e82)
    - element of array (index: 0)
    - array (class [Lorg.apache.spark.storage.StorageStatus;, size 2)
    at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
    at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286)
    at 
org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64)
    at 
org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32)
    at 
org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:156)
    at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
    at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264){code}

*Steps to reproduce*


step 1  create spark session with spark standalone mode.

step 2  Create a Dataset using the SparkSession object and load data from a 
file or from a database.

step 3  Define the MapPartitionsFunction and get storage status inside.

Here is the code snippet of MapPartitionsFunction


 
{code:java}
df = df.mapPartitions(new MapPartitionsFunction() {
            @Override
            public Iterator call(Iterator input) throws Exception {
                StorageStatus[] storageStatus = 
SparkEnv.get().blockManager().master().getStorageStatus();
                return input;
            }
        }, RowEncoder.apply(df.schema()));
{code}
 


Step4 - submit the spark job. 

 

*Solution -*

Implement the Serializable interface for org.apache.spark.storage.StorageStatus.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43108) org.apache.spark.storage.StorageStatus NotSerializableException when try to access StorageStatus in a MapPartitionsFunction

2023-04-12 Thread surender godara (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

surender godara updated SPARK-43108:

Description: 
When you try to access the *storage status 
(org.apache.spark.storage.StorageStatus)* inside a MapPartitionsFunction,then 
getStorageStatus method throw the NotSerializableException. This exception is 
thrown because the StorageStatus object is not serializable.

Here is an example code snippet that demonstrates how to access the storage 
status inside a MapPartitionsFunction in Spark:
{code:java}
StorageStatus[] storageStatus = 
SparkEnv.get().blockManager().master().getStorageStatus();{code}
*Error stacktrace --*
{code:java}
Caused by: java.io.NotSerializableException: 
org.apache.spark.storage.StorageStatus
Serialization stack:
    - object not serializable (class: org.apache.spark.storage.StorageStatus, 
value: org.apache.spark.storage.StorageStatus@715b4e82)
    - element of array (index: 0)
    - array (class [Lorg.apache.spark.storage.StorageStatus;, size 2)
    at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
    at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286)
    at 
org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64)
    at 
org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32)
    at 
org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:156)
    at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
    at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264){code}
*Steps to reproduce*

step 1  Initialize spark session with spark standalone mode.

step 2  Create a Dataset using the SparkSession and load data

step 3  Define the MapPartitionsFunction on Dataset and get storage status 
inside it.

Here is the code snippet of MapPartitionsFunction

 
{code:java}
df = df.mapPartitions(new MapPartitionsFunction() {
            @Override
            public Iterator call(Iterator input) throws Exception {
                StorageStatus[] storageStatus = 
SparkEnv.get().blockManager().master().getStorageStatus();
                return input;
            }
        }, RowEncoder.apply(df.schema()));
{code}
 

Step4 - submit the spark job. 

 

*Solution -*

Implement the Serializable interface for org.apache.spark.storage.StorageStatus.

 

 

  was:
When you try to access the *storage status 
(org.apache.spark.storage.StorageStatus)* inside a MapPartitionsFunction,then 
getStorageStatus method throw the NotSerializableException. This exception is 
thrown because the StorageStatus object is not serializable.

Here is an example code snippet that demonstrates how to access the storage 
status inside a MapPartitionsFunction in Spark:
{code:java}
StorageStatus[] storageStatus = 
SparkEnv.get().blockManager().master().getStorageStatus();{code}

*Error stacktrace --*
{code:java}
Caused by: java.io.NotSerializableException: 
org.apache.spark.storage.StorageStatus
Serialization stack:
    - object not serializable (class: org.apache.spark.storage.StorageStatus, 
value: org.apache.spark.storage.StorageStatus@715b4e82)
    - element of array (index: 0)
    - array (class [Lorg.apache.spark.storage.StorageStatus;, size 2)
    at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
    at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286)
    at 
org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64)
    at 
org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32)
    at 
org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:156)
    at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    at 
org.apache.spark.rpc.netty

[jira] [Created] (SPARK-43109) JavaRDD.saveAsTextFile Directory Creation issue using Spark 3.3.2, with hadoop3

2023-04-12 Thread shamim (Jira)
shamim created SPARK-43109:
--

 Summary:  JavaRDD.saveAsTextFile Directory Creation issue using 
Spark 3.3.2, with hadoop3
 Key: SPARK-43109
 URL: https://issues.apache.org/jira/browse/SPARK-43109
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: shamim


We are using Spark 3.3.2, with hadoop3 in Java Appliaction .

While Saving file using JavaRDD.saveAsTextFile(DIR_NAME/Sample.txt).

 

Spark is creating DIR_NAME with the same user by which Application is Running 
ex vs.

However we have provide another user to create the DIR_NAME like below in our 
Spark code .

UserGroupInformation theSparkUser= 
UserGroupInformation.createRemoteUser("spark");
UserGroupInformation.setLoginUser(theSparkUser);

 

Below is the dir structure created by the Application.

drwxrwxrwx+ 2  vs vs 6 Apr 12 17:09  main

 

We want Dir should be created by the user that we have provided, Like below

drwxrwxrwx+ 2  spark spark  6 Apr 12 17:09  main

 

When we are verifying the user by printing its showing same user as we pass in 
UserGroupInformation  both below printing same user "spark"

 

UserGroupInformation.getLoginUser().getUserName());   --> printing user as 
"spark"
UserGroupInformation.getCurrentUser().getUserName());  --> printing user as 
"spark"

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

2023-04-12 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711481#comment-17711481
 ] 

Hudson commented on SPARK-16484:


User 'RyanBerti' has created a pull request for this issue:
https://github.com/apache/spark/pull/40615

> Incremental Cardinality estimation operations with Hyperloglog
> --
>
> Key: SPARK-16484
> URL: https://issues.apache.org/jira/browse/SPARK-16484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yongjia Wang
>Priority: Major
>  Labels: bulk-closed
>
> Efficient cardinality estimation is very important, and SparkSQL has had 
> approxCountDistinct based on Hyperloglog for quite some time. However, there 
> isn't a way to do incremental estimation. For example, if we want to get 
> updated distinct counts of the last 90 days, we need to do the aggregation 
> for the entire window over and over again. The more efficient way involves 
> serializing the counter for smaller time windows (such as hourly) so the 
> counts can be efficiently updated in an incremental fashion for any time 
> window.
> With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus 
> implementation in the current Spark version, it's easy enough to extend the 
> functionality to include incremental counting, and even other general set 
> operations such as intersection and set difference. Spark API is already as 
> elegant as it can be, but it still takes quite some effort to do a custom 
> implementation of the aforementioned operations which are supposed to be in 
> high demand. I have been searching but failed to find an usable existing 
> solution nor any ongoing effort for this. The closest I got is the following 
> but it does not work with Spark 1.6 due to API changes. 
> https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala
> I wonder if it worth to integrate such operations into SparkSQL. The only 
> problem I see is it depends on serialization of a specific HLL implementation 
> and introduce compatibility issues. But as long as the user is aware of such 
> issue, it should be fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43110) Move asIntegral to PhysicalDataType

2023-04-12 Thread Rui Wang (Jira)
Rui Wang created SPARK-43110:


 Summary: Move asIntegral to PhysicalDataType
 Key: SPARK-43110
 URL: https://issues.apache.org/jira/browse/SPARK-43110
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43084) Add Python state API (applyInPandasWithState) and verify UDFs

2023-04-12 Thread Peng Zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711505#comment-17711505
 ] 

Peng Zhong commented on SPARK-43084:


Verified that udf working in streaming spark connect:

 
{code:java}
>>> from pyspark.sql.functions import col, udf
>>> def negative(value):
...   return -value
...
>>> negativeUDF = udf(lambda z: negative(z))
>>>
>>> query = (
...  spark
...  .readStream
...  .format("rate")
...  .option("numPartitions", "1")
...  .load()
...  .select(col("timestamp"), negativeUDF(col("value")).alias("negValue"))
...  .writeStream
...  .format("memory")
...  .queryName("rate_table")
...  .trigger(processingTime="10 seconds")
...  .start()
... )>>>
>>> query.status
{'message': 'Waiting for next trigger', 'isDataAvailable': True, 
'isTriggerActive': False}
>>>
>>> spark.sql("select * from rate_table").show()
+++
|           timestamp|negValue|
+++
|2023-04-12 11:04:...|       0|
|2023-04-12 11:04:...|      -1|
|2023-04-12 11:04:...|      -2|
|2023-04-12 11:04:...|      -3|
|2023-04-12 11:04:...|      -4|
|2023-04-12 11:04:...|      -5|
|2023-04-12 11:04:...|      -6|
|2023-04-12 11:04:...|      -7|
|2023-04-12 11:04:...|      -8|
|2023-04-12 11:05:...|      -9|
|2023-04-12 11:05:...|     -10|
|2023-04-12 11:05:...|     -11|
|2023-04-12 11:05:...|     -12|
|2023-04-12 11:05:...|     -13|
|2023-04-12 11:05:...|     -14|
|2023-04-12 11:05:...|     -15|
|2023-04-12 11:05:...|     -16|
|2023-04-12 11:05:...|     -17|
+++>>> {code}

> Add Python state API (applyInPandasWithState) and verify UDFs
> -
>
> Key: SPARK-43084
> URL: https://issues.apache.org/jira/browse/SPARK-43084
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
> Environment: * Add Python state API (applyInPandasWithState) to 
> streaming Spark-connect.
>  * verify the UDFs work (it may not need any code changes).
>Reporter: Raghu Angadi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43084) Add Python state API (applyInPandasWithState) and verify UDFs

2023-04-12 Thread Peng Zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711507#comment-17711507
 ] 

Peng Zhong commented on SPARK-43084:


applyInPandasWithState spark connect is added in this PR: 
https://github.com/apache/spark/pull/40736

> Add Python state API (applyInPandasWithState) and verify UDFs
> -
>
> Key: SPARK-43084
> URL: https://issues.apache.org/jira/browse/SPARK-43084
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
> Environment: * Add Python state API (applyInPandasWithState) to 
> streaming Spark-connect.
>  * verify the UDFs work (it may not need any code changes).
>Reporter: Raghu Angadi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42437) Pyspark catalog.cacheTable allow to specify storage level Connect add support Storagelevel

2023-04-12 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-42437.
---
Fix Version/s: 3.5.0
 Assignee: Khalid Mammadov
   Resolution: Fixed

Issue resolved by pull request 40015
https://github.com/apache/spark/pull/40015

> Pyspark catalog.cacheTable allow to specify storage level Connect add support 
> Storagelevel
> --
>
> Key: SPARK-42437
> URL: https://issues.apache.org/jira/browse/SPARK-42437
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Major
> Fix For: 3.5.0
>
>
> Currently PySpark version of catalog.cacheTable function does not support to 
> specify storage level. This is to add that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43111) Merge nested if statements into single if statements

2023-04-12 Thread Jira
Bjørn Jørgensen created SPARK-43111:
---

 Summary: Merge nested if statements into single if statements
 Key: SPARK-43111
 URL: https://issues.apache.org/jira/browse/SPARK-43111
 Project: Spark
  Issue Type: Improvement
  Components: Pandas API on Spark, PySpark
Affects Versions: 3.5.0
Reporter: Bjørn Jørgensen


Simplify the code by merging nested if statements into single if statements 
using the and operator. The changes do not affect the functionality of the 
code, but they improve readability and maintainability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43111) Merge nested if statements into single if statements

2023-04-12 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-43111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711554#comment-17711554
 ] 

Bjørn Jørgensen commented on SPARK-43111:
-

https://github.com/apache/spark/pull/40759

> Merge nested if statements into single if statements
> 
>
> Key: SPARK-43111
> URL: https://issues.apache.org/jira/browse/SPARK-43111
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Simplify the code by merging nested if statements into single if statements 
> using the and operator. The changes do not affect the functionality of the 
> code, but they improve readability and maintainability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables

2023-04-12 Thread Asif (Jira)
Asif created SPARK-43112:


 Summary: Spark may  use a column other than the actual specified 
partitioning column for partitioning, for Hive format tables
 Key: SPARK-43112
 URL: https://issues.apache.org/jira/browse/SPARK-43112
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.1
Reporter: Asif


The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
output method implemented as 
  // The partition column should always appear after data columns.
  override def output: Seq[AttributeReference] = dataCols ++ partitionCols

But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
that the out from HiveTableRelation is in the order in which the columns are 
actually defined in the DDL.

As a result, multiple mistmatch scenarios can happen like:
1) data type casting exception being thrown , even though the data frame being 
inserted has schema which is identical to what is used for creating ddl.
  OR
2) Wrong column being used for partitioning , if the datatypes are same or 
castable, like datetype and long

will be creating a PR with the bug test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables

2023-04-12 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-43112:
-
Description: 
The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
output method implemented as 
  // The partition column should always appear after data columns.
  override def output: Seq[AttributeReference] = dataCols ++ partitionCols

But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
that the output from HiveTableRelation is in the order in which the columns are 
actually defined in the DDL.

As a result, multiple mismatch scenarios can happen like:
1) data type casting exception being thrown , even though the data frame being 
inserted has schema which is identical to what is used for creating ddl.
  OR
2) Wrong column being used for partitioning , if the datatypes are same or 
cast-able, like date type and long

will be creating a PR with the bug test

  was:
The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
output method implemented as 
  // The partition column should always appear after data columns.
  override def output: Seq[AttributeReference] = dataCols ++ partitionCols

But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
that the out from HiveTableRelation is in the order in which the columns are 
actually defined in the DDL.

As a result, multiple mistmatch scenarios can happen like:
1) data type casting exception being thrown , even though the data frame being 
inserted has schema which is identical to what is used for creating ddl.
  OR
2) Wrong column being used for partitioning , if the datatypes are same or 
castable, like datetype and long

will be creating a PR with the bug test


> Spark may  use a column other than the actual specified partitioning column 
> for partitioning, for Hive format tables
> 
>
> Key: SPARK-43112
> URL: https://issues.apache.org/jira/browse/SPARK-43112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Asif
>Priority: Critical
>
> The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
> output method implemented as 
>   // The partition column should always appear after data columns.
>   override def output: Seq[AttributeReference] = dataCols ++ partitionCols
> But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
> that the output from HiveTableRelation is in the order in which the columns 
> are actually defined in the DDL.
> As a result, multiple mismatch scenarios can happen like:
> 1) data type casting exception being thrown , even though the data frame 
> being inserted has schema which is identical to what is used for creating ddl.
>   OR
> 2) Wrong column being used for partitioning , if the datatypes are same or 
> cast-able, like date type and long
> will be creating a PR with the bug test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column

2023-04-12 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-43113:
-

 Summary: Codegen error when full outer join's bound condition has 
multiple references to the same stream-side column
 Key: SPARK-43113
 URL: https://issues.apache.org/jira/browse/SPARK-43113
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.4.0, 3.5.0
Reporter: Bruce Robbins


Example # 1 (sort merge join):
{noformat}
create or replace temp view v1 as
select * from values
(1, 1),
(2, 2),
(3, 1)
as v1(key, value);

create or replace temp view v2 as
select * from values
(1, 22, 22),
(3, -1, -1),
(7, null, null)
as v2(a, b, c);

select *
from v1
full outer join v2
on key = a
and value > b
and value > c;
{noformat}
The join's generated code causes the following compilation error:
{noformat}
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
277, Column 9: Redefinition of local variable "smj_isNull_7"
{noformat}
Example #2 (shuffle hash join):
{noformat}
select /*+ SHUFFLE_HASH(v2) */ *
from v1
full outer join v2
on key = a
and value > b
and value > c;
{noformat}
The shuffle hash join's generated code causes the following compilation error:
{noformat}
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
174, Column 5: Redefinition of local variable "shj_value_1" 
{noformat}
With default configuration, both queries end up succeeding, since Spark falls 
back to running each query with whole-stage codegen disabled.

The issue happens only when the join's bound condition refers to the same 
stream-side column more than once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43114) Add interval types to TypeCoercionSuite

2023-04-12 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-43114:
---

 Summary: Add interval types to TypeCoercionSuite
 Key: SPARK-43114
 URL: https://issues.apache.org/jira/browse/SPARK-43114
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43115) Split pyspark-pandas-connect from pyspark-connect module.

2023-04-12 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-43115:
-

 Summary: Split pyspark-pandas-connect from pyspark-connect module.
 Key: SPARK-43115
 URL: https://issues.apache.org/jira/browse/SPARK-43115
 Project: Spark
  Issue Type: Test
  Components: PySpark, Tests
Affects Versions: 3.5.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43107) Coalesce applied on broadcast join stream side

2023-04-12 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711605#comment-17711605
 ] 

Yuming Wang commented on SPARK-43107:
-

https://github.com/apache/spark/pull/40756

> Coalesce applied on broadcast join stream side
> --
>
> Key: SPARK-43107
> URL: https://issues.apache.org/jira/browse/SPARK-43107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43114) Add interval types to TypeCoercionSuite

2023-04-12 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711604#comment-17711604
 ] 

Yuming Wang commented on SPARK-43114:
-

https://github.com/apache/spark/pull/40763

> Add interval types to TypeCoercionSuite
> ---
>
> Key: SPARK-43114
> URL: https://issues.apache.org/jira/browse/SPARK-43114
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43107) Coalesce buckets in join applied on broadcast join stream side

2023-04-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43107:

Summary: Coalesce buckets in join applied on broadcast join stream side  
(was: Coalesce applied on broadcast join stream side)

> Coalesce buckets in join applied on broadcast join stream side
> --
>
> Key: SPARK-43107
> URL: https://issues.apache.org/jira/browse/SPARK-43107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables

2023-04-12 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711607#comment-17711607
 ] 

Asif commented on SPARK-43112:
--

Open a WIP PR [SPARK-43112|https://github.com/apache/spark/pull/40765/] which 
has bug tests as of now

> Spark may  use a column other than the actual specified partitioning column 
> for partitioning, for Hive format tables
> 
>
> Key: SPARK-43112
> URL: https://issues.apache.org/jira/browse/SPARK-43112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Asif
>Priority: Critical
>
> The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
> output method implemented as 
>   // The partition column should always appear after data columns.
>   override def output: Seq[AttributeReference] = dataCols ++ partitionCols
> But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
> that the output from HiveTableRelation is in the order in which the columns 
> are actually defined in the DDL.
> As a result, multiple mismatch scenarios can happen like:
> 1) data type casting exception being thrown , even though the data frame 
> being inserted has schema which is identical to what is used for creating ddl.
>   OR
> 2) Wrong column being used for partitioning , if the datatypes are same or 
> cast-able, like date type and long
> will be creating a PR with the bug test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43031) Enable tests for Python streaming spark-connect

2023-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43031:


Assignee: Wei Liu

> Enable tests for Python streaming spark-connect
> ---
>
> Key: SPARK-43031
> URL: https://issues.apache.org/jira/browse/SPARK-43031
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43031) Enable tests for Python streaming spark-connect

2023-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43031.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40691
[https://github.com/apache/spark/pull/40691]

> Enable tests for Python streaming spark-connect
> ---
>
> Key: SPARK-43031
> URL: https://issues.apache.org/jira/browse/SPARK-43031
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column

2023-04-12 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711614#comment-17711614
 ] 

Bruce Robbins commented on SPARK-43113:
---

PR here: https://github.com/apache/spark/pull/40766/files

> Codegen error when full outer join's bound condition has multiple references 
> to the same stream-side column
> ---
>
> Key: SPARK-43113
> URL: https://issues.apache.org/jira/browse/SPARK-43113
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Example # 1 (sort merge join):
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1),
> (2, 2),
> (3, 1)
> as v1(key, value);
> create or replace temp view v2 as
> select * from values
> (1, 22, 22),
> (3, -1, -1),
> (7, null, null)
> as v2(a, b, c);
> select *
> from v1
> full outer join v2
> on key = a
> and value > b
> and value > c;
> {noformat}
> The join's generated code causes the following compilation error:
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 277, Column 9: Redefinition of local variable "smj_isNull_7"
> {noformat}
> Example #2 (shuffle hash join):
> {noformat}
> select /*+ SHUFFLE_HASH(v2) */ *
> from v1
> full outer join v2
> on key = a
> and value > b
> and value > c;
> {noformat}
> The shuffle hash join's generated code causes the following compilation error:
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 174, Column 5: Redefinition of local variable "shj_value_1" 
> {noformat}
> With default configuration, both queries end up succeeding, since Spark falls 
> back to running each query with whole-stage codegen disabled.
> The issue happens only when the join's bound condition refers to the same 
> stream-side column more than once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43101) Add CREATE/DROP catalog

2023-04-12 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-43101:
--
Description: 
Convenient registration of the catalog, in sts

ref: [https://github.com/trinodb/trino/issues/12709]

  was:
Convenient registration of the catalog, in sts

ref: [https://github.com/trinodb/trino/issues/12709]

 


> Add CREATE/DROP catalog 
> 
>
> Key: SPARK-43101
> URL: https://issues.apache.org/jira/browse/SPARK-43101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> Convenient registration of the catalog, in sts
> ref: [https://github.com/trinodb/trino/issues/12709]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43101) Dynamic Catalogs

2023-04-12 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-43101:
--
Summary: Dynamic Catalogs  (was: Add CREATE/DROP catalog )

> Dynamic Catalogs
> 
>
> Key: SPARK-43101
> URL: https://issues.apache.org/jira/browse/SPARK-43101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> Convenient registration of the catalog, in sts
> ref: [https://github.com/trinodb/trino/issues/12709]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43101) Add CREATE/DROP catalog

2023-04-12 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-43101:
--
Description: 
Convenient registration of the catalog, in sts

ref: [https://github.com/trinodb/trino/issues/12709]

 

  was:
Convenient registration of the catalog, in sts

ref: https://github.com/trinodb/trino/pull/13931


> Add CREATE/DROP catalog 
> 
>
> Key: SPARK-43101
> URL: https://issues.apache.org/jira/browse/SPARK-43101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> Convenient registration of the catalog, in sts
> ref: [https://github.com/trinodb/trino/issues/12709]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43110) Move asIntegral to PhysicalDataType

2023-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43110.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40758
[https://github.com/apache/spark/pull/40758]

> Move asIntegral to PhysicalDataType
> ---
>
> Key: SPARK-43110
> URL: https://issues.apache.org/jira/browse/SPARK-43110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43063) `df.show` handle null should print NULL instead of null

2023-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43063.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40699
[https://github.com/apache/spark/pull/40699]

> `df.show` handle null should print NULL instead of null
> ---
>
> Key: SPARK-43063
> URL: https://issues.apache.org/jira/browse/SPARK-43063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yikaifei
>Assignee: yikaifei
>Priority: Trivial
> Fix For: 3.5.0
>
>
> `df.show` handle null should print NULL instead of null to consistent 
> behavior;
> {code:java}
> Like as the following behavior is currently inconsistent:
> ``` shell
> scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 
> 'New Jersey', 4, 'Seattle') as result").show(false)
> +--+
> |result|
> +--+
> |null  |
> +--+
> ```
> ``` shell
> spark-sql> DESC FUNCTION EXTENDED decode;
> function_desc
> Function: decode
> Class: org.apache.spark.sql.catalyst.expressions.Decode
> Usage:
> decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> decode(expr, search, result [, search, result ] ... [, default]) - 
> Compares expr
>   to each search value in order. If expr is equal to a search value, 
> decode returns
>   the corresponding result. If no match is found, then it returns 
> default. If default
>   is omitted, it returns null.
> Extended Usage:
> Examples:
>   > SELECT decode(encode('abc', 'utf-8'), 'utf-8');
>abc
>   > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>San Francisco
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>Non domestic
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle');
>NULL
> Since: 3.2.0
> Time taken: 0.074 seconds, Fetched 4 row(s)
> ```
> ``` shell
> spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New 
> Jersey', 4, 'Seattle');
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43063) `df.show` handle null should print NULL instead of null

2023-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43063:
---

Assignee: yikaifei

> `df.show` handle null should print NULL instead of null
> ---
>
> Key: SPARK-43063
> URL: https://issues.apache.org/jira/browse/SPARK-43063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yikaifei
>Assignee: yikaifei
>Priority: Trivial
>
> `df.show` handle null should print NULL instead of null to consistent 
> behavior;
> {code:java}
> Like as the following behavior is currently inconsistent:
> ``` shell
> scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 
> 'New Jersey', 4, 'Seattle') as result").show(false)
> +--+
> |result|
> +--+
> |null  |
> +--+
> ```
> ``` shell
> spark-sql> DESC FUNCTION EXTENDED decode;
> function_desc
> Function: decode
> Class: org.apache.spark.sql.catalyst.expressions.Decode
> Usage:
> decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> decode(expr, search, result [, search, result ] ... [, default]) - 
> Compares expr
>   to each search value in order. If expr is equal to a search value, 
> decode returns
>   the corresponding result. If no match is found, then it returns 
> default. If default
>   is omitted, it returns null.
> Extended Usage:
> Examples:
>   > SELECT decode(encode('abc', 'utf-8'), 'utf-8');
>abc
>   > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>San Francisco
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>Non domestic
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle');
>NULL
> Since: 3.2.0
> Time taken: 0.074 seconds, Fetched 4 row(s)
> ```
> ``` shell
> spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New 
> Jersey', 4, 'Seattle');
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43116) Fix Cast.forceNullable

2023-04-12 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-43116:
---

 Summary: Fix Cast.forceNullable
 Key: SPARK-43116
 URL: https://issues.apache.org/jira/browse/SPARK-43116
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang


It should include TimestampNTZType,AnsiIntervalType.
ArrayType/MapType/StructType to other data type seems should be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43116) Fix Cast.forceNullable

2023-04-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43116:

Description: 
https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399

It should include TimestampNTZType,AnsiIntervalType.
ArrayType/MapType/StructType to other data type seems should be true.

  was:
It should include TimestampNTZType,AnsiIntervalType.
ArrayType/MapType/StructType to other data type seems should be true.


> Fix Cast.forceNullable
> --
>
> Key: SPARK-43116
> URL: https://issues.apache.org/jira/browse/SPARK-43116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399
> It should include TimestampNTZType,AnsiIntervalType.
> ArrayType/MapType/StructType to other data type seems should be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43116) Fix Cast.forceNullable

2023-04-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43116:

Description: 
https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399

1. It should include TimestampNTZType,AnsiIntervalType.
2. ArrayType/MapType/StructType to other data type seems should be true.

  was:
https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399

It should include TimestampNTZType,AnsiIntervalType.
ArrayType/MapType/StructType to other data type seems should be true.


> Fix Cast.forceNullable
> --
>
> Key: SPARK-43116
> URL: https://issues.apache.org/jira/browse/SPARK-43116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399
> 1. It should include TimestampNTZType,AnsiIntervalType.
> 2. ArrayType/MapType/StructType to other data type seems should be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43115) Split pyspark-pandas-connect from pyspark-connect module.

2023-04-12 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43115:
-

Assignee: Takuya Ueshin

> Split pyspark-pandas-connect from pyspark-connect module.
> -
>
> Key: SPARK-43115
> URL: https://issues.apache.org/jira/browse/SPARK-43115
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43115) Split pyspark-pandas-connect from pyspark-connect module.

2023-04-12 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43115.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40764
[https://github.com/apache/spark/pull/40764]

> Split pyspark-pandas-connect from pyspark-connect module.
> -
>
> Key: SPARK-43115
> URL: https://issues.apache.org/jira/browse/SPARK-43115
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43117) proto message abbreviation should support repeated and map fields

2023-04-12 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-43117:
-

 Summary: proto message abbreviation should support repeated and 
map fields
 Key: SPARK-43117
 URL: https://issues.apache.org/jira/browse/SPARK-43117
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream

2023-04-12 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-43118:
-

 Summary: Remove unnecessary assert for UninterruptibleThread in 
KafkaMicroBatchStream
 Key: SPARK-43118
 URL: https://issues.apache.org/jira/browse/SPARK-43118
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.2
Reporter: Boyang Jerry Peng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream

2023-04-12 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-43118:
--
Description: 
The assert 

 
{code:java}
 assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code}
 

found 
[https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239]
 

 

is not needed.  The reason is the following

 
 # This assert was put there due to some issues when the old and deprecated 
KafkaOffsetReaderConsumer is used.  The default offset reader implementation 
has been changed to KafkaOffsetReaderAdmin which no longer require it run via 
UninterruptedThread.
 # Even if the deprecated KafkaOffsetReaderConsumer is used, there are already 
asserts in that impl to check if it is running via UninterruptedThread e.g. 
[https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130]
 thus the assert in KafkaMicroBatchStream is redundant.

> Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream
> 
>
> Key: SPARK-43118
> URL: https://issues.apache.org/jira/browse/SPARK-43118
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Boyang Jerry Peng
>Priority: Minor
>
> The assert 
>  
> {code:java}
>  assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code}
>  
> found 
> [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239]
>  
>  
> is not needed.  The reason is the following
>  
>  # This assert was put there due to some issues when the old and deprecated 
> KafkaOffsetReaderConsumer is used.  The default offset reader implementation 
> has been changed to KafkaOffsetReaderAdmin which no longer require it run via 
> UninterruptedThread.
>  # Even if the deprecated KafkaOffsetReaderConsumer is used, there are 
> already asserts in that impl to check if it is running via 
> UninterruptedThread e.g. 
> [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130]
>  thus the assert in KafkaMicroBatchStream is redundant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37099) Introduce a rank-based filter to optimize top-k computation

2023-04-12 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711665#comment-17711665
 ] 

Snoot.io commented on SPARK-37099:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40754

> Introduce a rank-based filter to optimize top-k computation
> ---
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
> Attachments: q67.png, q67_optimized.png, skewed_window.png
>
>
> in JD, we found that more than 90% usage of window function follows this 
> pattern:
> {code:java}
>  select (... (row_number|rank|dense_rank) () over( [partition by ...] order 
> by ... ) as rn)
> where rn (==|<|<=) k and other conditions{code}
>  
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> For these three rank functions (row_number|rank|dense_rank), the rank of a 
> key computed on partitial dataset  is always <=  its final rank computed on 
> the whole dataset. so we can safely discard rows with partitial rank > k, 
> anywhere.
>  
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream

2023-04-12 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711667#comment-17711667
 ] 

Snoot.io commented on SPARK-43118:
--

User 'jerrypeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40767

> Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream
> 
>
> Key: SPARK-43118
> URL: https://issues.apache.org/jira/browse/SPARK-43118
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Boyang Jerry Peng
>Priority: Minor
>
> The assert 
>  
> {code:java}
>  assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code}
>  
> found 
> [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239]
>  
>  
> is not needed.  The reason is the following
>  
>  # This assert was put there due to some issues when the old and deprecated 
> KafkaOffsetReaderConsumer is used.  The default offset reader implementation 
> has been changed to KafkaOffsetReaderAdmin which no longer require it run via 
> UninterruptedThread.
>  # Even if the deprecated KafkaOffsetReaderConsumer is used, there are 
> already asserts in that impl to check if it is running via 
> UninterruptedThread e.g. 
> [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130]
>  thus the assert in KafkaMicroBatchStream is redundant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream

2023-04-12 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711668#comment-17711668
 ] 

Snoot.io commented on SPARK-43118:
--

User 'jerrypeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40767

> Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream
> 
>
> Key: SPARK-43118
> URL: https://issues.apache.org/jira/browse/SPARK-43118
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Boyang Jerry Peng
>Priority: Minor
>
> The assert 
>  
> {code:java}
>  assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code}
>  
> found 
> [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239]
>  
>  
> is not needed.  The reason is the following
>  
>  # This assert was put there due to some issues when the old and deprecated 
> KafkaOffsetReaderConsumer is used.  The default offset reader implementation 
> has been changed to KafkaOffsetReaderAdmin which no longer require it run via 
> UninterruptedThread.
>  # Even if the deprecated KafkaOffsetReaderConsumer is used, there are 
> already asserts in that impl to check if it is running via 
> UninterruptedThread e.g. 
> [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130]
>  thus the assert in KafkaMicroBatchStream is redundant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43021) Shuffle happens when Coalesce Buckets should occur

2023-04-12 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711666#comment-17711666
 ] 

Snoot.io commented on SPARK-43021:
--

User 'ming95' has created a pull request for this issue:
https://github.com/apache/spark/pull/40688

> Shuffle happens when Coalesce Buckets should occur
> --
>
> Key: SPARK-43021
> URL: https://issues.apache.org/jira/browse/SPARK-43021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Nikita Eshkeev
>Priority: Minor
>
> h1. What I did
> I define the following code:
> {{from pyspark.sql import SparkSession}}
> {{spark = (}}
> {{  SparkSession}}
> {{    .builder}}
> {{    .appName("Bucketing")}}
> {{    .master("local[4]")}}
> {{    .config("spark.sql.bucketing.coalesceBucketsInJoin.enabled", True)}}
> {{    .config("spark.sql.autoBroadcastJoinThreshold", "-1")}}
> {{    .getOrCreate()}}
> {{)}}
> {{df1 = spark.range(0, 100)}}
> {{df2 = spark.range(0, 100, 2)}}
> {{df1.write.bucketBy(4, "id").mode("overwrite").saveAsTable("t1")}}
> {{df2.write.bucketBy(2, "id").mode("overwrite").saveAsTable("t2")}}
> {{t1 = spark.table("t1")}}
> {{t2 = spark.table("t2")}}
> {{t2.join(t1, "id").explain()}}
> h1. What happened
> There is an Exchange node in the join plan
> h1. What is expected
> The plan should not contain any Exchange/Shuffle nodes, because {{t1}}'s 
> number of buckets is 4 and {{t2}}'s number of buckets is 2, and their ratio 
> is 2 which is less than 4 
> ({{spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio}}) and 
> [CoalesceBucketsInJoin|https://github.com/apache/spark/blob/c9878a212958bc54be529ef99f5e5d1ddf513ec8/sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala]
>  should be applied



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43119) Support Get SQL Keywords Dynamically

2023-04-12 Thread Kent Yao (Jira)
Kent Yao created SPARK-43119:


 Summary: Support Get SQL Keywords Dynamically
 Key: SPARK-43119
 URL: https://issues.apache.org/jira/browse/SPARK-43119
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Kent Yao


Implements the JDBC standard API and an auxiliary function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43119) Support Get SQL Keywords Dynamically

2023-04-12 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711675#comment-17711675
 ] 

Snoot.io commented on SPARK-43119:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/40768

> Support Get SQL Keywords Dynamically
> 
>
> Key: SPARK-43119
> URL: https://issues.apache.org/jira/browse/SPARK-43119
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
>
> Implements the JDBC standard API and an auxiliary function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43119) Support Get SQL Keywords Dynamically

2023-04-12 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711676#comment-17711676
 ] 

Snoot.io commented on SPARK-43119:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/40768

> Support Get SQL Keywords Dynamically
> 
>
> Key: SPARK-43119
> URL: https://issues.apache.org/jira/browse/SPARK-43119
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
>
> Implements the JDBC standard API and an auxiliary function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42916) JDBCCatalog Keep Char/Varchar meta information on the read-side

2023-04-12 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-42916:


Assignee: Kent Yao

> JDBCCatalog Keep Char/Varchar meta information on the read-side
> ---
>
> Key: SPARK-42916
> URL: https://issues.apache.org/jira/browse/SPARK-42916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Fix error like:
> string cannot be cast to varchar(20)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42916) JDBCCatalog Keep Char/Varchar meta information on the read-side

2023-04-12 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-42916.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40543
[https://github.com/apache/spark/pull/40543]

> JDBCCatalog Keep Char/Varchar meta information on the read-side
> ---
>
> Key: SPARK-42916
> URL: https://issues.apache.org/jira/browse/SPARK-42916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0
>
>
> Fix error like:
> string cannot be cast to varchar(20)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42975) Cast result type to timestamp type for string +/- interval

2023-04-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-42975.
-
Resolution: Won't Fix

Disable it in ANSI mode.

> Cast result type to timestamp type for string +/- interval
> --
>
> Key: SPARK-42975
> URL: https://issues.apache.org/jira/browse/SPARK-42975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> select '2023-01-01' >= '2023-01-08' - interval '7' day;
> should be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43120) Add tracking for block cache pinned usage for RocksDB state store

2023-04-12 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-43120:
--

 Summary: Add tracking for block cache pinned usage for RocksDB 
state store
 Key: SPARK-43120
 URL: https://issues.apache.org/jira/browse/SPARK-43120
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Anish Shrigondekar


Add tracking for block cache pinned usage for RocksDB state store



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43120) Add tracking for block cache pinned usage for RocksDB state store

2023-04-12 Thread Anish Shrigondekar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711711#comment-17711711
 ] 

Anish Shrigondekar commented on SPARK-43120:


cc - [~kabhwan] - will send the PR soon. Thx

> Add tracking for block cache pinned usage for RocksDB state store
> -
>
> Key: SPARK-43120
> URL: https://issues.apache.org/jira/browse/SPARK-43120
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Priority: Major
>
> Add tracking for block cache pinned usage for RocksDB state store



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org