[jira] [Created] (SPARK-43104) Set `shadeTestJar` of protobuf module to false
Yang Jie created SPARK-43104: Summary: Set `shadeTestJar` of protobuf module to false Key: SPARK-43104 URL: https://issues.apache.org/jira/browse/SPARK-43104 Project: Spark Issue Type: Improvement Components: Protobuf Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42994) Torch Distributor support Local Mode
[ https://issues.apache.org/jira/browse/SPARK-42994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42994. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40695 [https://github.com/apache/spark/pull/40695] > Torch Distributor support Local Mode > > > Key: SPARK-42994 > URL: https://issues.apache.org/jira/browse/SPARK-42994 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42994) Torch Distributor support Local Mode
[ https://issues.apache.org/jira/browse/SPARK-42994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42994: - Assignee: Ruifeng Zheng > Torch Distributor support Local Mode > > > Key: SPARK-42994 > URL: https://issues.apache.org/jira/browse/SPARK-42994 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43105) Abbreviate Bytes in proto message's debug string
Ruifeng Zheng created SPARK-43105: - Summary: Abbreviate Bytes in proto message's debug string Key: SPARK-43105 URL: https://issues.apache.org/jira/browse/SPARK-43105 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails
vaibhav beriwala created SPARK-43106: Summary: Data lost from the table if the INSERT OVERWRITE query fails Key: SPARK-43106 URL: https://issues.apache.org/jira/browse/SPARK-43106 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2 Reporter: vaibhav beriwala When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, Spark has the following behavior: 1) It will first clean up all the data from the actual table path. 2) It will then launch a job that performs the actual insert. There are 2 major issues with this approach: 1) If the insert job launched in step 2 above fails for any reason, the data from the original table is lost. 2) If the insert job in step 2 above takes a huge time to complete, then table data is unavailable to other readers for the entire duration the job takes. This behavior is the same even for the partitioned tables when using static partitioning. For dynamic partitioning, we do not delete the table data before the job launch. Is there a reason as to why we perform this delete before the job launch and not as part of the Job commit operation? This issue is not there with Hive - where the data is cleaned up as part of the Job commit operation probably. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails
[ https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vaibhav beriwala updated SPARK-43106: - Issue Type: Bug (was: Improvement) > Data lost from the table if the INSERT OVERWRITE query fails > > > Key: SPARK-43106 > URL: https://issues.apache.org/jira/browse/SPARK-43106 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: vaibhav beriwala >Priority: Major > > When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, > Spark has the following behavior: > 1) It will first clean up all the data from the actual table path. > 2) It will then launch a job that performs the actual insert. > > There are 2 major issues with this approach: > 1) If the insert job launched in step 2 above fails for any reason, the data > from the original table is lost. > 2) If the insert job in step 2 above takes a huge time to complete, then > table data is unavailable to other readers for the entire duration the job > takes. > > This behavior is the same even for the partitioned tables when using static > partitioning. For dynamic partitioning, we do not delete the table data > before the job launch. > > Is there a reason as to why we perform this delete before the job launch and > not as part of the Job commit operation? This issue is not there with Hive - > where the data is cleaned up as part of the Job commit operation probably. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails
[ https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vaibhav beriwala updated SPARK-43106: - Description: When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, Spark has the following behavior: 1) It will first clean up all the data from the actual table path. 2) It will then launch a job that performs the actual insert. There are 2 major issues with this approach: 1) If the insert job launched in step 2 above fails for any reason, the data from the original table is lost. 2) If the insert job in step 2 above takes a huge time to complete, then table data is unavailable to other readers for the entire duration the job takes. This behavior is the same even for the partitioned tables when using static partitioning. For dynamic partitioning, we do not delete the table data before the job launch. Is there a reason as to why we perform this delete before the job launch and not as part of the Job commit operation? This issue is not there with Hive - where the data is cleaned up as part of the Job commit operation probably. As part of SPARK-19183, we did add a new job commit hook for this exact same purpose, but seems like it's default behavior is still to delete the table data before job launch. was: When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, Spark has the following behavior: 1) It will first clean up all the data from the actual table path. 2) It will then launch a job that performs the actual insert. There are 2 major issues with this approach: 1) If the insert job launched in step 2 above fails for any reason, the data from the original table is lost. 2) If the insert job in step 2 above takes a huge time to complete, then table data is unavailable to other readers for the entire duration the job takes. This behavior is the same even for the partitioned tables when using static partitioning. For dynamic partitioning, we do not delete the table data before the job launch. Is there a reason as to why we perform this delete before the job launch and not as part of the Job commit operation? This issue is not there with Hive - where the data is cleaned up as part of the Job commit operation probably. > Data lost from the table if the INSERT OVERWRITE query fails > > > Key: SPARK-43106 > URL: https://issues.apache.org/jira/browse/SPARK-43106 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: vaibhav beriwala >Priority: Major > > When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, > Spark has the following behavior: > 1) It will first clean up all the data from the actual table path. > 2) It will then launch a job that performs the actual insert. > > There are 2 major issues with this approach: > 1) If the insert job launched in step 2 above fails for any reason, the data > from the original table is lost. > 2) If the insert job in step 2 above takes a huge time to complete, then > table data is unavailable to other readers for the entire duration the job > takes. > > This behavior is the same even for the partitioned tables when using static > partitioning. For dynamic partitioning, we do not delete the table data > before the job launch. > > Is there a reason as to why we perform this delete before the job launch and > not as part of the Job commit operation? This issue is not there with Hive - > where the data is cleaned up as part of the Job commit operation probably. > > As part of SPARK-19183, we did add a new job commit hook for this exact same > purpose, but seems like it's default behavior is still to delete the table > data before job launch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails
[ https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vaibhav beriwala updated SPARK-43106: - Description: When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, Spark has the following behavior: 1) It will first clean up all the data from the actual table path. 2) It will then launch a job that performs the actual insert. There are 2 major issues with this approach: 1) If the insert job launched in step 2 above fails for any reason, the data from the original table is lost. 2) If the insert job in step 2 above takes a huge time to complete, then table data is unavailable to other readers for the entire duration the job takes. This behavior is the same even for the partitioned tables when using static partitioning. For dynamic partitioning, we do not delete the table data before the job launch. Is there a reason as to why we perform this delete before the job launch and not as part of the Job commit operation? This issue is not there with Hive - where the data is cleaned up as part of the Job commit operation probably. As part of SPARK-19183, we did add a new hook in the commit protocol for this exact same purpose, but seems like its default behavior is still to delete the table data before the job launch. was: When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, Spark has the following behavior: 1) It will first clean up all the data from the actual table path. 2) It will then launch a job that performs the actual insert. There are 2 major issues with this approach: 1) If the insert job launched in step 2 above fails for any reason, the data from the original table is lost. 2) If the insert job in step 2 above takes a huge time to complete, then table data is unavailable to other readers for the entire duration the job takes. This behavior is the same even for the partitioned tables when using static partitioning. For dynamic partitioning, we do not delete the table data before the job launch. Is there a reason as to why we perform this delete before the job launch and not as part of the Job commit operation? This issue is not there with Hive - where the data is cleaned up as part of the Job commit operation probably. As part of SPARK-19183, we did add a new job commit hook for this exact same purpose, but seems like it's default behavior is still to delete the table data before job launch. > Data lost from the table if the INSERT OVERWRITE query fails > > > Key: SPARK-43106 > URL: https://issues.apache.org/jira/browse/SPARK-43106 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: vaibhav beriwala >Priority: Major > > When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, > Spark has the following behavior: > 1) It will first clean up all the data from the actual table path. > 2) It will then launch a job that performs the actual insert. > > There are 2 major issues with this approach: > 1) If the insert job launched in step 2 above fails for any reason, the data > from the original table is lost. > 2) If the insert job in step 2 above takes a huge time to complete, then > table data is unavailable to other readers for the entire duration the job > takes. > This behavior is the same even for the partitioned tables when using static > partitioning. For dynamic partitioning, we do not delete the table data > before the job launch. > > Is there a reason as to why we perform this delete before the job launch and > not as part of the Job commit operation? This issue is not there with Hive - > where the data is cleaned up as part of the Job commit operation probably. As > part of SPARK-19183, we did add a new hook in the commit protocol for this > exact same purpose, but seems like its default behavior is still to delete > the table data before the job launch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43105) Abbreviate Bytes in proto message's debug string
[ https://issues.apache.org/jira/browse/SPARK-43105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711311#comment-17711311 ] GridGain Integration commented on SPARK-43105: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40750 > Abbreviate Bytes in proto message's debug string > > > Key: SPARK-43105 > URL: https://issues.apache.org/jira/browse/SPARK-43105 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43103) Move Integral to PhysicalDataType
[ https://issues.apache.org/jira/browse/SPARK-43103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43103. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40752 [https://github.com/apache/spark/pull/40752] > Move Integral to PhysicalDataType > - > > Key: SPARK-43103 > URL: https://issues.apache.org/jira/browse/SPARK-43103 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43107) Coalesce applied on broadcast join stream side
Yuming Wang created SPARK-43107: --- Summary: Coalesce applied on broadcast join stream side Key: SPARK-43107 URL: https://issues.apache.org/jira/browse/SPARK-43107 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43038) Support the CBC mode by aes_encrypt()/aes_decrypt()
[ https://issues.apache.org/jira/browse/SPARK-43038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43038. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40704 [https://github.com/apache/spark/pull/40704] > Support the CBC mode by aes_encrypt()/aes_decrypt() > --- > > Key: SPARK-43038 > URL: https://issues.apache.org/jira/browse/SPARK-43038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.5.0 > > > Implement the GCM mode in CBC - aes_encrypt() and aes_decrypt(). Use > *AES/CBC/PKCS5Padding* of the Java Cryptographic Extension (JCE) framework. > Output should be compatible with OpenSSL. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43108) org.apache.spark.storage.StorageStatus NotSerializableException when try to access StorageStatus in a MapPartitionsFunction
surender godara created SPARK-43108: --- Summary: org.apache.spark.storage.StorageStatus NotSerializableException when try to access StorageStatus in a MapPartitionsFunction Key: SPARK-43108 URL: https://issues.apache.org/jira/browse/SPARK-43108 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1 Reporter: surender godara When you try to access the *storage status (org.apache.spark.storage.StorageStatus)* inside a MapPartitionsFunction,then getStorageStatus method throw the NotSerializableException. This exception is thrown because the StorageStatus object is not serializable. Here is an example code snippet that demonstrates how to access the storage status inside a MapPartitionsFunction in Spark: {code:java} StorageStatus[] storageStatus = SparkEnv.get().blockManager().master().getStorageStatus();{code} *Error stacktrace --* {code:java} Caused by: java.io.NotSerializableException: org.apache.spark.storage.StorageStatus Serialization stack: - object not serializable (class: org.apache.spark.storage.StorageStatus, value: org.apache.spark.storage.StorageStatus@715b4e82) - element of array (index: 0) - array (class [Lorg.apache.spark.storage.StorageStatus;, size 2) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286) at org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64) at org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32) at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:156) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264){code} *Steps to reproduce* step 1 create spark session with spark standalone mode. step 2 Create a Dataset using the SparkSession object and load data from a file or from a database. step 3 Define the MapPartitionsFunction and get storage status inside. Here is the code snippet of MapPartitionsFunction {code:java} df = df.mapPartitions(new MapPartitionsFunction() { @Override public Iterator call(Iterator input) throws Exception { StorageStatus[] storageStatus = SparkEnv.get().blockManager().master().getStorageStatus(); return input; } }, RowEncoder.apply(df.schema())); {code} Step4 - submit the spark job. *Solution -* Implement the Serializable interface for org.apache.spark.storage.StorageStatus. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43108) org.apache.spark.storage.StorageStatus NotSerializableException when try to access StorageStatus in a MapPartitionsFunction
[ https://issues.apache.org/jira/browse/SPARK-43108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] surender godara updated SPARK-43108: Description: When you try to access the *storage status (org.apache.spark.storage.StorageStatus)* inside a MapPartitionsFunction,then getStorageStatus method throw the NotSerializableException. This exception is thrown because the StorageStatus object is not serializable. Here is an example code snippet that demonstrates how to access the storage status inside a MapPartitionsFunction in Spark: {code:java} StorageStatus[] storageStatus = SparkEnv.get().blockManager().master().getStorageStatus();{code} *Error stacktrace --* {code:java} Caused by: java.io.NotSerializableException: org.apache.spark.storage.StorageStatus Serialization stack: - object not serializable (class: org.apache.spark.storage.StorageStatus, value: org.apache.spark.storage.StorageStatus@715b4e82) - element of array (index: 0) - array (class [Lorg.apache.spark.storage.StorageStatus;, size 2) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286) at org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64) at org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32) at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:156) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264){code} *Steps to reproduce* step 1 Initialize spark session with spark standalone mode. step 2 Create a Dataset using the SparkSession and load data step 3 Define the MapPartitionsFunction on Dataset and get storage status inside it. Here is the code snippet of MapPartitionsFunction {code:java} df = df.mapPartitions(new MapPartitionsFunction() { @Override public Iterator call(Iterator input) throws Exception { StorageStatus[] storageStatus = SparkEnv.get().blockManager().master().getStorageStatus(); return input; } }, RowEncoder.apply(df.schema())); {code} Step4 - submit the spark job. *Solution -* Implement the Serializable interface for org.apache.spark.storage.StorageStatus. was: When you try to access the *storage status (org.apache.spark.storage.StorageStatus)* inside a MapPartitionsFunction,then getStorageStatus method throw the NotSerializableException. This exception is thrown because the StorageStatus object is not serializable. Here is an example code snippet that demonstrates how to access the storage status inside a MapPartitionsFunction in Spark: {code:java} StorageStatus[] storageStatus = SparkEnv.get().blockManager().master().getStorageStatus();{code} *Error stacktrace --* {code:java} Caused by: java.io.NotSerializableException: org.apache.spark.storage.StorageStatus Serialization stack: - object not serializable (class: org.apache.spark.storage.StorageStatus, value: org.apache.spark.storage.StorageStatus@715b4e82) - element of array (index: 0) - array (class [Lorg.apache.spark.storage.StorageStatus;, size 2) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286) at org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64) at org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32) at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:156) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty
[jira] [Created] (SPARK-43109) JavaRDD.saveAsTextFile Directory Creation issue using Spark 3.3.2, with hadoop3
shamim created SPARK-43109: -- Summary: JavaRDD.saveAsTextFile Directory Creation issue using Spark 3.3.2, with hadoop3 Key: SPARK-43109 URL: https://issues.apache.org/jira/browse/SPARK-43109 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.2 Reporter: shamim We are using Spark 3.3.2, with hadoop3 in Java Appliaction . While Saving file using JavaRDD.saveAsTextFile(DIR_NAME/Sample.txt). Spark is creating DIR_NAME with the same user by which Application is Running ex vs. However we have provide another user to create the DIR_NAME like below in our Spark code . UserGroupInformation theSparkUser= UserGroupInformation.createRemoteUser("spark"); UserGroupInformation.setLoginUser(theSparkUser); Below is the dir structure created by the Application. drwxrwxrwx+ 2 vs vs 6 Apr 12 17:09 main We want Dir should be created by the user that we have provided, Like below drwxrwxrwx+ 2 spark spark 6 Apr 12 17:09 main When we are verifying the user by printing its showing same user as we pass in UserGroupInformation both below printing same user "spark" UserGroupInformation.getLoginUser().getUserName()); --> printing user as "spark" UserGroupInformation.getCurrentUser().getUserName()); --> printing user as "spark" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog
[ https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711481#comment-17711481 ] Hudson commented on SPARK-16484: User 'RyanBerti' has created a pull request for this issue: https://github.com/apache/spark/pull/40615 > Incremental Cardinality estimation operations with Hyperloglog > -- > > Key: SPARK-16484 > URL: https://issues.apache.org/jira/browse/SPARK-16484 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yongjia Wang >Priority: Major > Labels: bulk-closed > > Efficient cardinality estimation is very important, and SparkSQL has had > approxCountDistinct based on Hyperloglog for quite some time. However, there > isn't a way to do incremental estimation. For example, if we want to get > updated distinct counts of the last 90 days, we need to do the aggregation > for the entire window over and over again. The more efficient way involves > serializing the counter for smaller time windows (such as hourly) so the > counts can be efficiently updated in an incremental fashion for any time > window. > With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus > implementation in the current Spark version, it's easy enough to extend the > functionality to include incremental counting, and even other general set > operations such as intersection and set difference. Spark API is already as > elegant as it can be, but it still takes quite some effort to do a custom > implementation of the aforementioned operations which are supposed to be in > high demand. I have been searching but failed to find an usable existing > solution nor any ongoing effort for this. The closest I got is the following > but it does not work with Spark 1.6 due to API changes. > https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala > I wonder if it worth to integrate such operations into SparkSQL. The only > problem I see is it depends on serialization of a specific HLL implementation > and introduce compatibility issues. But as long as the user is aware of such > issue, it should be fine. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43110) Move asIntegral to PhysicalDataType
Rui Wang created SPARK-43110: Summary: Move asIntegral to PhysicalDataType Key: SPARK-43110 URL: https://issues.apache.org/jira/browse/SPARK-43110 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43084) Add Python state API (applyInPandasWithState) and verify UDFs
[ https://issues.apache.org/jira/browse/SPARK-43084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711505#comment-17711505 ] Peng Zhong commented on SPARK-43084: Verified that udf working in streaming spark connect: {code:java} >>> from pyspark.sql.functions import col, udf >>> def negative(value): ... return -value ... >>> negativeUDF = udf(lambda z: negative(z)) >>> >>> query = ( ... spark ... .readStream ... .format("rate") ... .option("numPartitions", "1") ... .load() ... .select(col("timestamp"), negativeUDF(col("value")).alias("negValue")) ... .writeStream ... .format("memory") ... .queryName("rate_table") ... .trigger(processingTime="10 seconds") ... .start() ... )>>> >>> query.status {'message': 'Waiting for next trigger', 'isDataAvailable': True, 'isTriggerActive': False} >>> >>> spark.sql("select * from rate_table").show() +++ | timestamp|negValue| +++ |2023-04-12 11:04:...| 0| |2023-04-12 11:04:...| -1| |2023-04-12 11:04:...| -2| |2023-04-12 11:04:...| -3| |2023-04-12 11:04:...| -4| |2023-04-12 11:04:...| -5| |2023-04-12 11:04:...| -6| |2023-04-12 11:04:...| -7| |2023-04-12 11:04:...| -8| |2023-04-12 11:05:...| -9| |2023-04-12 11:05:...| -10| |2023-04-12 11:05:...| -11| |2023-04-12 11:05:...| -12| |2023-04-12 11:05:...| -13| |2023-04-12 11:05:...| -14| |2023-04-12 11:05:...| -15| |2023-04-12 11:05:...| -16| |2023-04-12 11:05:...| -17| +++>>> {code} > Add Python state API (applyInPandasWithState) and verify UDFs > - > > Key: SPARK-43084 > URL: https://issues.apache.org/jira/browse/SPARK-43084 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 > Environment: * Add Python state API (applyInPandasWithState) to > streaming Spark-connect. > * verify the UDFs work (it may not need any code changes). >Reporter: Raghu Angadi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43084) Add Python state API (applyInPandasWithState) and verify UDFs
[ https://issues.apache.org/jira/browse/SPARK-43084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711507#comment-17711507 ] Peng Zhong commented on SPARK-43084: applyInPandasWithState spark connect is added in this PR: https://github.com/apache/spark/pull/40736 > Add Python state API (applyInPandasWithState) and verify UDFs > - > > Key: SPARK-43084 > URL: https://issues.apache.org/jira/browse/SPARK-43084 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 > Environment: * Add Python state API (applyInPandasWithState) to > streaming Spark-connect. > * verify the UDFs work (it may not need any code changes). >Reporter: Raghu Angadi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42437) Pyspark catalog.cacheTable allow to specify storage level Connect add support Storagelevel
[ https://issues.apache.org/jira/browse/SPARK-42437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-42437. --- Fix Version/s: 3.5.0 Assignee: Khalid Mammadov Resolution: Fixed Issue resolved by pull request 40015 https://github.com/apache/spark/pull/40015 > Pyspark catalog.cacheTable allow to specify storage level Connect add support > Storagelevel > -- > > Key: SPARK-42437 > URL: https://issues.apache.org/jira/browse/SPARK-42437 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Major > Fix For: 3.5.0 > > > Currently PySpark version of catalog.cacheTable function does not support to > specify storage level. This is to add that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43111) Merge nested if statements into single if statements
Bjørn Jørgensen created SPARK-43111: --- Summary: Merge nested if statements into single if statements Key: SPARK-43111 URL: https://issues.apache.org/jira/browse/SPARK-43111 Project: Spark Issue Type: Improvement Components: Pandas API on Spark, PySpark Affects Versions: 3.5.0 Reporter: Bjørn Jørgensen Simplify the code by merging nested if statements into single if statements using the and operator. The changes do not affect the functionality of the code, but they improve readability and maintainability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43111) Merge nested if statements into single if statements
[ https://issues.apache.org/jira/browse/SPARK-43111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711554#comment-17711554 ] Bjørn Jørgensen commented on SPARK-43111: - https://github.com/apache/spark/pull/40759 > Merge nested if statements into single if statements > > > Key: SPARK-43111 > URL: https://issues.apache.org/jira/browse/SPARK-43111 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark, PySpark >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Simplify the code by merging nested if statements into single if statements > using the and operator. The changes do not affect the functionality of the > code, but they improve readability and maintainability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables
Asif created SPARK-43112: Summary: Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables Key: SPARK-43112 URL: https://issues.apache.org/jira/browse/SPARK-43112 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.1 Reporter: Asif The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its output method implemented as // The partition column should always appear after data columns. override def output: Seq[AttributeReference] = dataCols ++ partitionCols But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect that the out from HiveTableRelation is in the order in which the columns are actually defined in the DDL. As a result, multiple mistmatch scenarios can happen like: 1) data type casting exception being thrown , even though the data frame being inserted has schema which is identical to what is used for creating ddl. OR 2) Wrong column being used for partitioning , if the datatypes are same or castable, like datetype and long will be creating a PR with the bug test -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables
[ https://issues.apache.org/jira/browse/SPARK-43112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-43112: - Description: The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its output method implemented as // The partition column should always appear after data columns. override def output: Seq[AttributeReference] = dataCols ++ partitionCols But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect that the output from HiveTableRelation is in the order in which the columns are actually defined in the DDL. As a result, multiple mismatch scenarios can happen like: 1) data type casting exception being thrown , even though the data frame being inserted has schema which is identical to what is used for creating ddl. OR 2) Wrong column being used for partitioning , if the datatypes are same or cast-able, like date type and long will be creating a PR with the bug test was: The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its output method implemented as // The partition column should always appear after data columns. override def output: Seq[AttributeReference] = dataCols ++ partitionCols But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect that the out from HiveTableRelation is in the order in which the columns are actually defined in the DDL. As a result, multiple mistmatch scenarios can happen like: 1) data type casting exception being thrown , even though the data frame being inserted has schema which is identical to what is used for creating ddl. OR 2) Wrong column being used for partitioning , if the datatypes are same or castable, like datetype and long will be creating a PR with the bug test > Spark may use a column other than the actual specified partitioning column > for partitioning, for Hive format tables > > > Key: SPARK-43112 > URL: https://issues.apache.org/jira/browse/SPARK-43112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Asif >Priority: Critical > > The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its > output method implemented as > // The partition column should always appear after data columns. > override def output: Seq[AttributeReference] = dataCols ++ partitionCols > But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect > that the output from HiveTableRelation is in the order in which the columns > are actually defined in the DDL. > As a result, multiple mismatch scenarios can happen like: > 1) data type casting exception being thrown , even though the data frame > being inserted has schema which is identical to what is used for creating ddl. > OR > 2) Wrong column being used for partitioning , if the datatypes are same or > cast-able, like date type and long > will be creating a PR with the bug test -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column
Bruce Robbins created SPARK-43113: - Summary: Codegen error when full outer join's bound condition has multiple references to the same stream-side column Key: SPARK-43113 URL: https://issues.apache.org/jira/browse/SPARK-43113 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2, 3.4.0, 3.5.0 Reporter: Bruce Robbins Example # 1 (sort merge join): {noformat} create or replace temp view v1 as select * from values (1, 1), (2, 2), (3, 1) as v1(key, value); create or replace temp view v2 as select * from values (1, 22, 22), (3, -1, -1), (7, null, null) as v2(a, b, c); select * from v1 full outer join v2 on key = a and value > b and value > c; {noformat} The join's generated code causes the following compilation error: {noformat} org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 277, Column 9: Redefinition of local variable "smj_isNull_7" {noformat} Example #2 (shuffle hash join): {noformat} select /*+ SHUFFLE_HASH(v2) */ * from v1 full outer join v2 on key = a and value > b and value > c; {noformat} The shuffle hash join's generated code causes the following compilation error: {noformat} org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 174, Column 5: Redefinition of local variable "shj_value_1" {noformat} With default configuration, both queries end up succeeding, since Spark falls back to running each query with whole-stage codegen disabled. The issue happens only when the join's bound condition refers to the same stream-side column more than once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43114) Add interval types to TypeCoercionSuite
Yuming Wang created SPARK-43114: --- Summary: Add interval types to TypeCoercionSuite Key: SPARK-43114 URL: https://issues.apache.org/jira/browse/SPARK-43114 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43115) Split pyspark-pandas-connect from pyspark-connect module.
Takuya Ueshin created SPARK-43115: - Summary: Split pyspark-pandas-connect from pyspark-connect module. Key: SPARK-43115 URL: https://issues.apache.org/jira/browse/SPARK-43115 Project: Spark Issue Type: Test Components: PySpark, Tests Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43107) Coalesce applied on broadcast join stream side
[ https://issues.apache.org/jira/browse/SPARK-43107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711605#comment-17711605 ] Yuming Wang commented on SPARK-43107: - https://github.com/apache/spark/pull/40756 > Coalesce applied on broadcast join stream side > -- > > Key: SPARK-43107 > URL: https://issues.apache.org/jira/browse/SPARK-43107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43114) Add interval types to TypeCoercionSuite
[ https://issues.apache.org/jira/browse/SPARK-43114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711604#comment-17711604 ] Yuming Wang commented on SPARK-43114: - https://github.com/apache/spark/pull/40763 > Add interval types to TypeCoercionSuite > --- > > Key: SPARK-43114 > URL: https://issues.apache.org/jira/browse/SPARK-43114 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43107) Coalesce buckets in join applied on broadcast join stream side
[ https://issues.apache.org/jira/browse/SPARK-43107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43107: Summary: Coalesce buckets in join applied on broadcast join stream side (was: Coalesce applied on broadcast join stream side) > Coalesce buckets in join applied on broadcast join stream side > -- > > Key: SPARK-43107 > URL: https://issues.apache.org/jira/browse/SPARK-43107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables
[ https://issues.apache.org/jira/browse/SPARK-43112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711607#comment-17711607 ] Asif commented on SPARK-43112: -- Open a WIP PR [SPARK-43112|https://github.com/apache/spark/pull/40765/] which has bug tests as of now > Spark may use a column other than the actual specified partitioning column > for partitioning, for Hive format tables > > > Key: SPARK-43112 > URL: https://issues.apache.org/jira/browse/SPARK-43112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Asif >Priority: Critical > > The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its > output method implemented as > // The partition column should always appear after data columns. > override def output: Seq[AttributeReference] = dataCols ++ partitionCols > But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect > that the output from HiveTableRelation is in the order in which the columns > are actually defined in the DDL. > As a result, multiple mismatch scenarios can happen like: > 1) data type casting exception being thrown , even though the data frame > being inserted has schema which is identical to what is used for creating ddl. > OR > 2) Wrong column being used for partitioning , if the datatypes are same or > cast-able, like date type and long > will be creating a PR with the bug test -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43031) Enable tests for Python streaming spark-connect
[ https://issues.apache.org/jira/browse/SPARK-43031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43031: Assignee: Wei Liu > Enable tests for Python streaming spark-connect > --- > > Key: SPARK-43031 > URL: https://issues.apache.org/jira/browse/SPARK-43031 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43031) Enable tests for Python streaming spark-connect
[ https://issues.apache.org/jira/browse/SPARK-43031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43031. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40691 [https://github.com/apache/spark/pull/40691] > Enable tests for Python streaming spark-connect > --- > > Key: SPARK-43031 > URL: https://issues.apache.org/jira/browse/SPARK-43031 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column
[ https://issues.apache.org/jira/browse/SPARK-43113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711614#comment-17711614 ] Bruce Robbins commented on SPARK-43113: --- PR here: https://github.com/apache/spark/pull/40766/files > Codegen error when full outer join's bound condition has multiple references > to the same stream-side column > --- > > Key: SPARK-43113 > URL: https://issues.apache.org/jira/browse/SPARK-43113 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Example # 1 (sort merge join): > {noformat} > create or replace temp view v1 as > select * from values > (1, 1), > (2, 2), > (3, 1) > as v1(key, value); > create or replace temp view v2 as > select * from values > (1, 22, 22), > (3, -1, -1), > (7, null, null) > as v2(a, b, c); > select * > from v1 > full outer join v2 > on key = a > and value > b > and value > c; > {noformat} > The join's generated code causes the following compilation error: > {noformat} > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 277, Column 9: Redefinition of local variable "smj_isNull_7" > {noformat} > Example #2 (shuffle hash join): > {noformat} > select /*+ SHUFFLE_HASH(v2) */ * > from v1 > full outer join v2 > on key = a > and value > b > and value > c; > {noformat} > The shuffle hash join's generated code causes the following compilation error: > {noformat} > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 174, Column 5: Redefinition of local variable "shj_value_1" > {noformat} > With default configuration, both queries end up succeeding, since Spark falls > back to running each query with whole-stage codegen disabled. > The issue happens only when the join's bound condition refers to the same > stream-side column more than once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43101) Add CREATE/DROP catalog
[ https://issues.apache.org/jira/browse/SPARK-43101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-43101: -- Description: Convenient registration of the catalog, in sts ref: [https://github.com/trinodb/trino/issues/12709] was: Convenient registration of the catalog, in sts ref: [https://github.com/trinodb/trino/issues/12709] > Add CREATE/DROP catalog > > > Key: SPARK-43101 > URL: https://issues.apache.org/jira/browse/SPARK-43101 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > Convenient registration of the catalog, in sts > ref: [https://github.com/trinodb/trino/issues/12709] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43101) Dynamic Catalogs
[ https://issues.apache.org/jira/browse/SPARK-43101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-43101: -- Summary: Dynamic Catalogs (was: Add CREATE/DROP catalog ) > Dynamic Catalogs > > > Key: SPARK-43101 > URL: https://issues.apache.org/jira/browse/SPARK-43101 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > Convenient registration of the catalog, in sts > ref: [https://github.com/trinodb/trino/issues/12709] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43101) Add CREATE/DROP catalog
[ https://issues.apache.org/jira/browse/SPARK-43101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-43101: -- Description: Convenient registration of the catalog, in sts ref: [https://github.com/trinodb/trino/issues/12709] was: Convenient registration of the catalog, in sts ref: https://github.com/trinodb/trino/pull/13931 > Add CREATE/DROP catalog > > > Key: SPARK-43101 > URL: https://issues.apache.org/jira/browse/SPARK-43101 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > Convenient registration of the catalog, in sts > ref: [https://github.com/trinodb/trino/issues/12709] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43110) Move asIntegral to PhysicalDataType
[ https://issues.apache.org/jira/browse/SPARK-43110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43110. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40758 [https://github.com/apache/spark/pull/40758] > Move asIntegral to PhysicalDataType > --- > > Key: SPARK-43110 > URL: https://issues.apache.org/jira/browse/SPARK-43110 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43063) `df.show` handle null should print NULL instead of null
[ https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43063. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40699 [https://github.com/apache/spark/pull/40699] > `df.show` handle null should print NULL instead of null > --- > > Key: SPARK-43063 > URL: https://issues.apache.org/jira/browse/SPARK-43063 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: yikaifei >Assignee: yikaifei >Priority: Trivial > Fix For: 3.5.0 > > > `df.show` handle null should print NULL instead of null to consistent > behavior; > {code:java} > Like as the following behavior is currently inconsistent: > ``` shell > scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, > 'New Jersey', 4, 'Seattle') as result").show(false) > +--+ > |result| > +--+ > |null | > +--+ > ``` > ``` shell > spark-sql> DESC FUNCTION EXTENDED decode; > function_desc > Function: decode > Class: org.apache.spark.sql.catalyst.expressions.Decode > Usage: > decode(bin, charset) - Decodes the first argument using the second > argument character set. > decode(expr, search, result [, search, result ] ... [, default]) - > Compares expr > to each search value in order. If expr is equal to a search value, > decode returns > the corresponding result. If no match is found, then it returns > default. If default > is omitted, it returns null. > Extended Usage: > Examples: > > SELECT decode(encode('abc', 'utf-8'), 'utf-8'); >abc > > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle', 'Non domestic'); >San Francisco > > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle', 'Non domestic'); >Non domestic > > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle'); >NULL > Since: 3.2.0 > Time taken: 0.074 seconds, Fetched 4 row(s) > ``` > ``` shell > spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New > Jersey', 4, 'Seattle'); > NULL > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43063) `df.show` handle null should print NULL instead of null
[ https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43063: --- Assignee: yikaifei > `df.show` handle null should print NULL instead of null > --- > > Key: SPARK-43063 > URL: https://issues.apache.org/jira/browse/SPARK-43063 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: yikaifei >Assignee: yikaifei >Priority: Trivial > > `df.show` handle null should print NULL instead of null to consistent > behavior; > {code:java} > Like as the following behavior is currently inconsistent: > ``` shell > scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, > 'New Jersey', 4, 'Seattle') as result").show(false) > +--+ > |result| > +--+ > |null | > +--+ > ``` > ``` shell > spark-sql> DESC FUNCTION EXTENDED decode; > function_desc > Function: decode > Class: org.apache.spark.sql.catalyst.expressions.Decode > Usage: > decode(bin, charset) - Decodes the first argument using the second > argument character set. > decode(expr, search, result [, search, result ] ... [, default]) - > Compares expr > to each search value in order. If expr is equal to a search value, > decode returns > the corresponding result. If no match is found, then it returns > default. If default > is omitted, it returns null. > Extended Usage: > Examples: > > SELECT decode(encode('abc', 'utf-8'), 'utf-8'); >abc > > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle', 'Non domestic'); >San Francisco > > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle', 'Non domestic'); >Non domestic > > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle'); >NULL > Since: 3.2.0 > Time taken: 0.074 seconds, Fetched 4 row(s) > ``` > ``` shell > spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New > Jersey', 4, 'Seattle'); > NULL > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43116) Fix Cast.forceNullable
Yuming Wang created SPARK-43116: --- Summary: Fix Cast.forceNullable Key: SPARK-43116 URL: https://issues.apache.org/jira/browse/SPARK-43116 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Yuming Wang It should include TimestampNTZType,AnsiIntervalType. ArrayType/MapType/StructType to other data type seems should be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43116) Fix Cast.forceNullable
[ https://issues.apache.org/jira/browse/SPARK-43116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43116: Description: https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399 It should include TimestampNTZType,AnsiIntervalType. ArrayType/MapType/StructType to other data type seems should be true. was: It should include TimestampNTZType,AnsiIntervalType. ArrayType/MapType/StructType to other data type seems should be true. > Fix Cast.forceNullable > -- > > Key: SPARK-43116 > URL: https://issues.apache.org/jira/browse/SPARK-43116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399 > It should include TimestampNTZType,AnsiIntervalType. > ArrayType/MapType/StructType to other data type seems should be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43116) Fix Cast.forceNullable
[ https://issues.apache.org/jira/browse/SPARK-43116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43116: Description: https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399 1. It should include TimestampNTZType,AnsiIntervalType. 2. ArrayType/MapType/StructType to other data type seems should be true. was: https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399 It should include TimestampNTZType,AnsiIntervalType. ArrayType/MapType/StructType to other data type seems should be true. > Fix Cast.forceNullable > -- > > Key: SPARK-43116 > URL: https://issues.apache.org/jira/browse/SPARK-43116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > https://github.com/apache/spark/blob/ac105ccebf5f144f2d506cbe102c362a195afa9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L381-L399 > 1. It should include TimestampNTZType,AnsiIntervalType. > 2. ArrayType/MapType/StructType to other data type seems should be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43115) Split pyspark-pandas-connect from pyspark-connect module.
[ https://issues.apache.org/jira/browse/SPARK-43115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43115: - Assignee: Takuya Ueshin > Split pyspark-pandas-connect from pyspark-connect module. > - > > Key: SPARK-43115 > URL: https://issues.apache.org/jira/browse/SPARK-43115 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43115) Split pyspark-pandas-connect from pyspark-connect module.
[ https://issues.apache.org/jira/browse/SPARK-43115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43115. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40764 [https://github.com/apache/spark/pull/40764] > Split pyspark-pandas-connect from pyspark-connect module. > - > > Key: SPARK-43115 > URL: https://issues.apache.org/jira/browse/SPARK-43115 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43117) proto message abbreviation should support repeated and map fields
Ruifeng Zheng created SPARK-43117: - Summary: proto message abbreviation should support repeated and map fields Key: SPARK-43117 URL: https://issues.apache.org/jira/browse/SPARK-43117 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream
Boyang Jerry Peng created SPARK-43118: - Summary: Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream Key: SPARK-43118 URL: https://issues.apache.org/jira/browse/SPARK-43118 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.3.2 Reporter: Boyang Jerry Peng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream
[ https://issues.apache.org/jira/browse/SPARK-43118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-43118: -- Description: The assert {code:java} assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code} found [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239] is not needed. The reason is the following # This assert was put there due to some issues when the old and deprecated KafkaOffsetReaderConsumer is used. The default offset reader implementation has been changed to KafkaOffsetReaderAdmin which no longer require it run via UninterruptedThread. # Even if the deprecated KafkaOffsetReaderConsumer is used, there are already asserts in that impl to check if it is running via UninterruptedThread e.g. [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130] thus the assert in KafkaMicroBatchStream is redundant. > Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream > > > Key: SPARK-43118 > URL: https://issues.apache.org/jira/browse/SPARK-43118 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Boyang Jerry Peng >Priority: Minor > > The assert > > {code:java} > assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code} > > found > [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239] > > > is not needed. The reason is the following > > # This assert was put there due to some issues when the old and deprecated > KafkaOffsetReaderConsumer is used. The default offset reader implementation > has been changed to KafkaOffsetReaderAdmin which no longer require it run via > UninterruptedThread. > # Even if the deprecated KafkaOffsetReaderConsumer is used, there are > already asserts in that impl to check if it is running via > UninterruptedThread e.g. > [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130] > thus the assert in KafkaMicroBatchStream is redundant. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37099) Introduce a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711665#comment-17711665 ] Snoot.io commented on SPARK-37099: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40754 > Introduce a rank-based filter to optimize top-k computation > --- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0 > > Attachments: q67.png, q67_optimized.png, skewed_window.png > > > in JD, we found that more than 90% usage of window function follows this > pattern: > {code:java} > select (... (row_number|rank|dense_rank) () over( [partition by ...] order > by ... ) as rn) > where rn (==|<|<=) k and other conditions{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > For these three rank functions (row_number|rank|dense_rank), the rank of a > key computed on partitial dataset is always <= its final rank computed on > the whole dataset. so we can safely discard rows with partitial rank > k, > anywhere. > > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream
[ https://issues.apache.org/jira/browse/SPARK-43118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711667#comment-17711667 ] Snoot.io commented on SPARK-43118: -- User 'jerrypeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40767 > Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream > > > Key: SPARK-43118 > URL: https://issues.apache.org/jira/browse/SPARK-43118 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Boyang Jerry Peng >Priority: Minor > > The assert > > {code:java} > assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code} > > found > [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239] > > > is not needed. The reason is the following > > # This assert was put there due to some issues when the old and deprecated > KafkaOffsetReaderConsumer is used. The default offset reader implementation > has been changed to KafkaOffsetReaderAdmin which no longer require it run via > UninterruptedThread. > # Even if the deprecated KafkaOffsetReaderConsumer is used, there are > already asserts in that impl to check if it is running via > UninterruptedThread e.g. > [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130] > thus the assert in KafkaMicroBatchStream is redundant. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream
[ https://issues.apache.org/jira/browse/SPARK-43118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711668#comment-17711668 ] Snoot.io commented on SPARK-43118: -- User 'jerrypeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40767 > Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream > > > Key: SPARK-43118 > URL: https://issues.apache.org/jira/browse/SPARK-43118 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Boyang Jerry Peng >Priority: Minor > > The assert > > {code:java} > assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code} > > found > [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239] > > > is not needed. The reason is the following > > # This assert was put there due to some issues when the old and deprecated > KafkaOffsetReaderConsumer is used. The default offset reader implementation > has been changed to KafkaOffsetReaderAdmin which no longer require it run via > UninterruptedThread. > # Even if the deprecated KafkaOffsetReaderConsumer is used, there are > already asserts in that impl to check if it is running via > UninterruptedThread e.g. > [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130] > thus the assert in KafkaMicroBatchStream is redundant. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43021) Shuffle happens when Coalesce Buckets should occur
[ https://issues.apache.org/jira/browse/SPARK-43021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711666#comment-17711666 ] Snoot.io commented on SPARK-43021: -- User 'ming95' has created a pull request for this issue: https://github.com/apache/spark/pull/40688 > Shuffle happens when Coalesce Buckets should occur > -- > > Key: SPARK-43021 > URL: https://issues.apache.org/jira/browse/SPARK-43021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Nikita Eshkeev >Priority: Minor > > h1. What I did > I define the following code: > {{from pyspark.sql import SparkSession}} > {{spark = (}} > {{ SparkSession}} > {{ .builder}} > {{ .appName("Bucketing")}} > {{ .master("local[4]")}} > {{ .config("spark.sql.bucketing.coalesceBucketsInJoin.enabled", True)}} > {{ .config("spark.sql.autoBroadcastJoinThreshold", "-1")}} > {{ .getOrCreate()}} > {{)}} > {{df1 = spark.range(0, 100)}} > {{df2 = spark.range(0, 100, 2)}} > {{df1.write.bucketBy(4, "id").mode("overwrite").saveAsTable("t1")}} > {{df2.write.bucketBy(2, "id").mode("overwrite").saveAsTable("t2")}} > {{t1 = spark.table("t1")}} > {{t2 = spark.table("t2")}} > {{t2.join(t1, "id").explain()}} > h1. What happened > There is an Exchange node in the join plan > h1. What is expected > The plan should not contain any Exchange/Shuffle nodes, because {{t1}}'s > number of buckets is 4 and {{t2}}'s number of buckets is 2, and their ratio > is 2 which is less than 4 > ({{spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio}}) and > [CoalesceBucketsInJoin|https://github.com/apache/spark/blob/c9878a212958bc54be529ef99f5e5d1ddf513ec8/sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala] > should be applied -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43119) Support Get SQL Keywords Dynamically
Kent Yao created SPARK-43119: Summary: Support Get SQL Keywords Dynamically Key: SPARK-43119 URL: https://issues.apache.org/jira/browse/SPARK-43119 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Kent Yao Implements the JDBC standard API and an auxiliary function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43119) Support Get SQL Keywords Dynamically
[ https://issues.apache.org/jira/browse/SPARK-43119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711675#comment-17711675 ] Snoot.io commented on SPARK-43119: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/40768 > Support Get SQL Keywords Dynamically > > > Key: SPARK-43119 > URL: https://issues.apache.org/jira/browse/SPARK-43119 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kent Yao >Priority: Major > > Implements the JDBC standard API and an auxiliary function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43119) Support Get SQL Keywords Dynamically
[ https://issues.apache.org/jira/browse/SPARK-43119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711676#comment-17711676 ] Snoot.io commented on SPARK-43119: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/40768 > Support Get SQL Keywords Dynamically > > > Key: SPARK-43119 > URL: https://issues.apache.org/jira/browse/SPARK-43119 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kent Yao >Priority: Major > > Implements the JDBC standard API and an auxiliary function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42916) JDBCCatalog Keep Char/Varchar meta information on the read-side
[ https://issues.apache.org/jira/browse/SPARK-42916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-42916: Assignee: Kent Yao > JDBCCatalog Keep Char/Varchar meta information on the read-side > --- > > Key: SPARK-42916 > URL: https://issues.apache.org/jira/browse/SPARK-42916 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Fix error like: > string cannot be cast to varchar(20) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42916) JDBCCatalog Keep Char/Varchar meta information on the read-side
[ https://issues.apache.org/jira/browse/SPARK-42916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-42916. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40543 [https://github.com/apache/spark/pull/40543] > JDBCCatalog Keep Char/Varchar meta information on the read-side > --- > > Key: SPARK-42916 > URL: https://issues.apache.org/jira/browse/SPARK-42916 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.5.0 > > > Fix error like: > string cannot be cast to varchar(20) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42975) Cast result type to timestamp type for string +/- interval
[ https://issues.apache.org/jira/browse/SPARK-42975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-42975. - Resolution: Won't Fix Disable it in ANSI mode. > Cast result type to timestamp type for string +/- interval > -- > > Key: SPARK-42975 > URL: https://issues.apache.org/jira/browse/SPARK-42975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > select '2023-01-01' >= '2023-01-08' - interval '7' day; > should be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43120) Add tracking for block cache pinned usage for RocksDB state store
Anish Shrigondekar created SPARK-43120: -- Summary: Add tracking for block cache pinned usage for RocksDB state store Key: SPARK-43120 URL: https://issues.apache.org/jira/browse/SPARK-43120 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Anish Shrigondekar Add tracking for block cache pinned usage for RocksDB state store -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43120) Add tracking for block cache pinned usage for RocksDB state store
[ https://issues.apache.org/jira/browse/SPARK-43120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711711#comment-17711711 ] Anish Shrigondekar commented on SPARK-43120: cc - [~kabhwan] - will send the PR soon. Thx > Add tracking for block cache pinned usage for RocksDB state store > - > > Key: SPARK-43120 > URL: https://issues.apache.org/jira/browse/SPARK-43120 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Add tracking for block cache pinned usage for RocksDB state store -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org