[jira] [Comment Edited] (SPARK-44581) ShutdownHookManager get wrong hadoop user group information

2023-08-08 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752270#comment-17752270
 ] 

Kent Yao edited comment on SPARK-44581 at 8/9/23 5:58 AM:
--

Issue resolved by  [https://github.com/apache/spark/pull/42295]


was (Author: yao):
Issue resolved by 

> ShutdownHookManager get wrong hadoop user group information
> ---
>
> Key: SPARK-44581
> URL: https://issues.apache.org/jira/browse/SPARK-44581
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, YARN
>Affects Versions: 3.2.1, 3.3.2, 3.4.1
>Reporter: liang yu
>Assignee: liang yu
>Priority: Minor
> Fix For: 3.4.2, 3.5.0, 3.3.4
>
>
>  I use spark 3.2.1 to run a job on yarn in cluster mode. 
> when the job is finished, there is an exception that:
> {code:java}
> 2023-07-28 10:57:16,324 ERROR yarn.ApplicationMaster: Failed to cleanup 
> staging dir 
> hdfs://dmp/user/ubd_dmp_test/.sparkStaging/application_1689318995305_0290 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=yarn, access=WRITE, 
> inode="/user/ubd_dmp_test/.sparkStaging":ubd_dmp_test:ubd_dmp_test:drwxr-xr-x 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:349)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1943)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:105)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3266)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1128)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:725)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>  at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>  at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1656) at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:991)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:988)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:998)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:686)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$3(ApplicationMaster.scala:268)
>  at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>  at 

[jira] [Resolved] (SPARK-44726) Improve HeartbeatReceiver config validation error message

2023-08-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44726.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42403
[https://github.com/apache/spark/pull/42403]

> Improve HeartbeatReceiver config validation error message
> -
>
> Key: SPARK-44726
> URL: https://issues.apache.org/jira/browse/SPARK-44726
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 4.0.0
>
>
> {code}
> $ bin/spark-shell -c spark.network.timeout=30s
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 23/08/08 14:38:18 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/08/08 14:38:19 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: requirement failed: 
> spark.network.timeoutInterval should be less than or equal to 
> spark.storage.blockManagerHeartbeatTimeoutMs.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44726) Improve HeartbeatReceiver config validation error message

2023-08-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44726:
-

Assignee: Dongjoon Hyun

> Improve HeartbeatReceiver config validation error message
> -
>
> Key: SPARK-44726
> URL: https://issues.apache.org/jira/browse/SPARK-44726
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> {code}
> $ bin/spark-shell -c spark.network.timeout=30s
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 23/08/08 14:38:18 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/08/08 14:38:19 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: requirement failed: 
> spark.network.timeoutInterval should be less than or equal to 
> spark.storage.blockManagerHeartbeatTimeoutMs.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44581) ShutdownHookManager get wrong hadoop user group information

2023-08-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44581.
--
Fix Version/s: 3.4.2
   3.5.0
   3.3.4
   Resolution: Fixed

Issue resolved by 

> ShutdownHookManager get wrong hadoop user group information
> ---
>
> Key: SPARK-44581
> URL: https://issues.apache.org/jira/browse/SPARK-44581
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, YARN
>Affects Versions: 3.2.1, 3.3.2, 3.4.1
>Reporter: liang yu
>Assignee: liang yu
>Priority: Minor
> Fix For: 3.4.2, 3.5.0, 3.3.4
>
>
>  I use spark 3.2.1 to run a job on yarn in cluster mode. 
> when the job is finished, there is an exception that:
> {code:java}
> 2023-07-28 10:57:16,324 ERROR yarn.ApplicationMaster: Failed to cleanup 
> staging dir 
> hdfs://dmp/user/ubd_dmp_test/.sparkStaging/application_1689318995305_0290 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=yarn, access=WRITE, 
> inode="/user/ubd_dmp_test/.sparkStaging":ubd_dmp_test:ubd_dmp_test:drwxr-xr-x 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:349)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1943)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:105)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3266)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1128)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:725)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>  at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>  at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1656) at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:991)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:988)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:998)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:686)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$3(ApplicationMaster.scala:268)
>  at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> scala.util.Try$.apply(Try.scala:213) at 
> 

[jira] [Assigned] (SPARK-44581) ShutdownHookManager get wrong hadoop user group information

2023-08-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44581:


Assignee: liang yu

> ShutdownHookManager get wrong hadoop user group information
> ---
>
> Key: SPARK-44581
> URL: https://issues.apache.org/jira/browse/SPARK-44581
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, YARN
>Affects Versions: 3.2.1, 3.3.2, 3.4.1
>Reporter: liang yu
>Assignee: liang yu
>Priority: Minor
>
>  I use spark 3.2.1 to run a job on yarn in cluster mode. 
> when the job is finished, there is an exception that:
> {code:java}
> 2023-07-28 10:57:16,324 ERROR yarn.ApplicationMaster: Failed to cleanup 
> staging dir 
> hdfs://dmp/user/ubd_dmp_test/.sparkStaging/application_1689318995305_0290 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=yarn, access=WRITE, 
> inode="/user/ubd_dmp_test/.sparkStaging":ubd_dmp_test:ubd_dmp_test:drwxr-xr-x 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:349)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1943)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:105)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3266)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1128)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:725)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>  at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>  at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1656) at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:991)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:988)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:998)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:686)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$3(ApplicationMaster.scala:268)
>  at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> 

[jira] [Resolved] (SPARK-43907) Add SQL functions into Scala, Python and R API

2023-08-08 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43907.
---
Resolution: Resolved

> Add SQL functions into Scala, Python and R API
> --
>
> Key: SPARK-43907
> URL: https://issues.apache.org/jira/browse/SPARK-43907
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
>
> See the discussion in dev mailing list 
> (https://lists.apache.org/thread/0tdcfyzxzcv8w46qbgwys2rormhdgyqg).
> This is an umbrella JIRA to implement all SQL functions in Scala, Python and R



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43709) Enable NamespaceTests.test_date_range for pandas 2.0.0.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43709.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42389
[https://github.com/apache/spark/pull/42389]

> Enable NamespaceTests.test_date_range for pandas 2.0.0.
> ---
>
> Key: SPARK-43709
> URL: https://issues.apache.org/jira/browse/SPARK-43709
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43709) Enable NamespaceTests.test_date_range for pandas 2.0.0.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43709:


Assignee: Haejoon Lee

> Enable NamespaceTests.test_date_range for pandas 2.0.0.
> ---
>
> Key: SPARK-43709
> URL: https://issues.apache.org/jira/browse/SPARK-43709
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44725) Document spark.network.timeoutInterval

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752260#comment-17752260
 ] 

Snoot.io commented on SPARK-44725:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42402

> Document spark.network.timeoutInterval
> --
>
> Key: SPARK-44725
> URL: https://issues.apache.org/jira/browse/SPARK-44725
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44725) Document spark.network.timeoutInterval

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752259#comment-17752259
 ] 

Snoot.io commented on SPARK-44725:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42402

> Document spark.network.timeoutInterval
> --
>
> Key: SPARK-44725
> URL: https://issues.apache.org/jira/browse/SPARK-44725
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44726) Improve HeartbeatReceiver config validation error message

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752255#comment-17752255
 ] 

Snoot.io commented on SPARK-44726:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42403

> Improve HeartbeatReceiver config validation error message
> -
>
> Key: SPARK-44726
> URL: https://issues.apache.org/jira/browse/SPARK-44726
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> {code}
> $ bin/spark-shell -c spark.network.timeout=30s
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 23/08/08 14:38:18 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/08/08 14:38:19 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: requirement failed: 
> spark.network.timeoutInterval should be less than or equal to 
> spark.storage.blockManagerHeartbeatTimeoutMs.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44726) Improve HeartbeatReceiver config validation error message

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752254#comment-17752254
 ] 

Snoot.io commented on SPARK-44726:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42403

> Improve HeartbeatReceiver config validation error message
> -
>
> Key: SPARK-44726
> URL: https://issues.apache.org/jira/browse/SPARK-44726
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> {code}
> $ bin/spark-shell -c spark.network.timeout=30s
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 23/08/08 14:38:18 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/08/08 14:38:19 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: requirement failed: 
> spark.network.timeoutInterval should be less than or equal to 
> spark.storage.blockManagerHeartbeatTimeoutMs.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44737) Should not display json format errors on SQL page for non-SparkThrowables

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752251#comment-17752251
 ] 

Snoot.io commented on SPARK-44737:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/42407

> Should not display json format errors on SQL page for non-SparkThrowables
> -
>
> Key: SPARK-44737
> URL: https://issues.apache.org/jira/browse/SPARK-44737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44737) Should not display json format errors on SQL page for non-SparkThrowables

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752248#comment-17752248
 ] 

Snoot.io commented on SPARK-44737:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/42407

> Should not display json format errors on SQL page for non-SparkThrowables
> -
>
> Key: SPARK-44737
> URL: https://issues.apache.org/jira/browse/SPARK-44737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42746) Add the LISTAGG() aggregate function

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752249#comment-17752249
 ] 

Snoot.io commented on SPARK-42746:
--

User 'Hisoka-X' has created a pull request for this issue:
https://github.com/apache/spark/pull/42398

> Add the LISTAGG() aggregate function
> 
>
> Key: SPARK-42746
> URL: https://issues.apache.org/jira/browse/SPARK-42746
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> {{listagg()}} is a common and useful aggregation function to concatenate 
> string values in a column, optionally by a certain order. The systems below 
> have supported such function already:
>  * Oracle: 
> [https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030]
>  * Snowflake: [https://docs.snowflake.com/en/sql-reference/functions/listagg]
>  * Amazon Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html]
>  * Google BigQuery: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg]
> Need to introduce this new aggregate in Spark, both as a regular aggregate 
> and as a window function.
> Proposed syntax:
> {code:sql}
> LISTAGG( [ DISTINCT ]  [,  ] ) [ WITHIN GROUP ( 
>  ) ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44718) High On-heap memory usage is detected while doing parquet-file reading with Off-Heap memory mode enabled on spark

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752247#comment-17752247
 ] 

Snoot.io commented on SPARK-44718:
--

User 'majdyz' has created a pull request for this issue:
https://github.com/apache/spark/pull/42394

> High On-heap memory usage is detected while doing parquet-file reading with 
> Off-Heap memory mode enabled on spark
> -
>
> Key: SPARK-44718
> URL: https://issues.apache.org/jira/browse/SPARK-44718
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1
>Reporter: Zamil Majdy
>Priority: Major
>
> I see the high use of on-heap memory usage while doing the parquet file 
> reading when the off-heap memory mode is enabled. This is caused by the 
> memory-mode for the column vector for the vectorized reader is configured by 
> different flag, and the default value is always set to On-Heap.
> Conf to reproduce the issue:
> {{spark.memory.offHeap.size 100}}
> {{spark.memory.offHeap.enabled true}}
> Enabling these configurations only will not change the memory mode used for 
> parquet-reading by the vectorized reader to Off-Heap.
>  
> Proposed PR: https://github.com/apache/spark/pull/42394



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44737) Should not display json format errors on SQL page for non-SparkThrowables

2023-08-08 Thread Kent Yao (Jira)
Kent Yao created SPARK-44737:


 Summary: Should not display json format errors on SQL page for 
non-SparkThrowables
 Key: SPARK-44737
 URL: https://issues.apache.org/jira/browse/SPARK-44737
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 3.5.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44736) Implement Dataset.explode

2023-08-08 Thread Jira
Herman van Hövell created SPARK-44736:
-

 Summary: Implement Dataset.explode
 Key: SPARK-44736
 URL: https://issues.apache.org/jira/browse/SPARK-44736
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43429) Add default/active SparkSession APIs

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752241#comment-17752241
 ] 

Snoot.io commented on SPARK-43429:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/42406

> Add default/active SparkSession APIs
> 
>
> Key: SPARK-43429
> URL: https://issues.apache.org/jira/browse/SPARK-43429
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44735) Log a warning when inserting columns with the same name by row that don't match up

2023-08-08 Thread Holden Karau (Jira)
Holden Karau created SPARK-44735:


 Summary: Log a warning when inserting columns with the same name 
by row that don't match up
 Key: SPARK-44735
 URL: https://issues.apache.org/jira/browse/SPARK-44735
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.2, 3.5.0, 4.0.0
Reporter: Holden Karau


With SPARK-42750 people can now insert by name, but sometimes people forget it. 
We should log warning when it *looks like* someone forgot it (e.g. insert by 
column number with all the same names *but* not matching up in row).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44690) Downgrade Scala to 2.13.8

2023-08-08 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752240#comment-17752240
 ] 

Snoot.io commented on SPARK-44690:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/42362

> Downgrade Scala to 2.13.8
> -
>
> Key: SPARK-44690
> URL: https://issues.apache.org/jira/browse/SPARK-44690
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Downgrade Scala from 2.13.11 to 2.13.8 for fix maven compile issue described 
> in SPARK-44376



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43711) Support `pyspark.ml.feature.Bucketizer` and `pyspark.mllib.stat.KernelDensity` to work with Spark Connect.

2023-08-08 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43711:

Affects Version/s: 4.0.0
   (was: 3.5.0)
  Description: 
Repro: run `DataFramePlotParityTests.test_compute_hist_multi_columns` or `

SeriesPlotMatplotlibParityTests.test_kde_plot`

  was:Repro: run `DataFramePlotParityTests.test_compute_hist_multi_columns`


> Support `pyspark.ml.feature.Bucketizer` and 
> `pyspark.mllib.stat.KernelDensity` to work with Spark Connect.
> --
>
> Key: SPARK-43711
> URL: https://issues.apache.org/jira/browse/SPARK-43711
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, MLlib
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Repro: run `DataFramePlotParityTests.test_compute_hist_multi_columns` or `
> SeriesPlotMatplotlibParityTests.test_kde_plot`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys

2023-08-08 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752239#comment-17752239
 ] 

Yuming Wang commented on SPARK-24087:
-

Fixed by SPARK-35703.

> Avoid shuffle when join keys are a super-set of bucket keys
> ---
>
> Key: SPARK-24087
> URL: https://issues.apache.org/jira/browse/SPARK-24087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43711) Support `pyspark.ml.feature.Bucketizer` and `pyspark.mllib.stat.KernelDensity` to work with Spark Connect.

2023-08-08 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43711:

Summary: Support `pyspark.ml.feature.Bucketizer` and 
`pyspark.mllib.stat.KernelDensity` to work with Spark Connect.  (was: Support 
`pyspark.ml.feature.Bucketizer` to work with Spark Connect.)

> Support `pyspark.ml.feature.Bucketizer` and 
> `pyspark.mllib.stat.KernelDensity` to work with Spark Connect.
> --
>
> Key: SPARK-43711
> URL: https://issues.apache.org/jira/browse/SPARK-43711
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Repro: run `DataFramePlotParityTests.test_compute_hist_multi_columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43711) Support `pyspark.ml.feature.Bucketizer` and `pyspark.mllib.stat.KernelDensity` to work with Spark Connect.

2023-08-08 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43711:

Component/s: MLlib
 (was: Pandas API on Spark)

> Support `pyspark.ml.feature.Bucketizer` and 
> `pyspark.mllib.stat.KernelDensity` to work with Spark Connect.
> --
>
> Key: SPARK-43711
> URL: https://issues.apache.org/jira/browse/SPARK-43711
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, MLlib
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Repro: run `DataFramePlotParityTests.test_compute_hist_multi_columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43711) Support `pyspark.ml.feature.Bucketizer` to work with Spark Connect.

2023-08-08 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43711:

Summary: Support `pyspark.ml.feature.Bucketizer` to work with Spark 
Connect.  (was: Fix Transformer.transform to work with Spark Connect.)

> Support `pyspark.ml.feature.Bucketizer` to work with Spark Connect.
> ---
>
> Key: SPARK-43711
> URL: https://issues.apache.org/jira/browse/SPARK-43711
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Repro: run `DataFramePlotParityTests.test_compute_hist_multi_columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44581) ShutdownHookManager get wrong hadoop user group information

2023-08-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-44581:
-
Affects Version/s: 3.4.1
   3.3.2

> ShutdownHookManager get wrong hadoop user group information
> ---
>
> Key: SPARK-44581
> URL: https://issues.apache.org/jira/browse/SPARK-44581
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, YARN
>Affects Versions: 3.2.1, 3.3.2, 3.4.1
>Reporter: liang yu
>Priority: Minor
>
>  I use spark 3.2.1 to run a job on yarn in cluster mode. 
> when the job is finished, there is an exception that:
> {code:java}
> 2023-07-28 10:57:16,324 ERROR yarn.ApplicationMaster: Failed to cleanup 
> staging dir 
> hdfs://dmp/user/ubd_dmp_test/.sparkStaging/application_1689318995305_0290 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=yarn, access=WRITE, 
> inode="/user/ubd_dmp_test/.sparkStaging":ubd_dmp_test:ubd_dmp_test:drwxr-xr-x 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:349)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1943)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:105)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3266)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1128)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:725)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>  at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>  at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1656) at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:991)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:988)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:998)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:686)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$3(ApplicationMaster.scala:268)
>  at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> 

[jira] [Updated] (SPARK-44581) ShutdownHookManager get wrong hadoop user group information

2023-08-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-44581:
-
Priority: Minor  (was: Major)

> ShutdownHookManager get wrong hadoop user group information
> ---
>
> Key: SPARK-44581
> URL: https://issues.apache.org/jira/browse/SPARK-44581
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, YARN
>Affects Versions: 3.2.1
>Reporter: liang yu
>Priority: Minor
>
>  I use spark 3.2.1 to run a job on yarn in cluster mode. 
> when the job is finished, there is an exception that:
> {code:java}
> 2023-07-28 10:57:16,324 ERROR yarn.ApplicationMaster: Failed to cleanup 
> staging dir 
> hdfs://dmp/user/ubd_dmp_test/.sparkStaging/application_1689318995305_0290 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=yarn, access=WRITE, 
> inode="/user/ubd_dmp_test/.sparkStaging":ubd_dmp_test:ubd_dmp_test:drwxr-xr-x 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:349)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1943)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:105)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3266)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1128)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:725)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>  at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>  at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1656) at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:991)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:988)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:998)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:686)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$3(ApplicationMaster.scala:268)
>  at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> 

[jira] [Updated] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs

2023-08-08 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44734:
-
Description: 
In addition to type mappings between Spark data types and Python data types 
(SPARK-44733), we should add the type casting rules for regular and 
arrow-optimized Python UDFs/UDTFs. 

We currently have this table in code:
 * Arrow: 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329]
 * Python UDF: 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116]

We should add a proper documentation page for the type casting rules. 

  was:
In addition to type mappings between Spark data types and Python data types, we 
should add the type casting rules for regular and arrow-optimized Python 
UDFs/UDTFs. 

We currently have this table in code:
 * Arrow: 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329]
 * Python UDF: 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116]

We should add a proper documentation page for the type casting rules. 


> Add documentation for type casting rules in Python UDFs/UDTFs
> -
>
> Key: SPARK-44734
> URL: https://issues.apache.org/jira/browse/SPARK-44734
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> In addition to type mappings between Spark data types and Python data types 
> (SPARK-44733), we should add the type casting rules for regular and 
> arrow-optimized Python UDFs/UDTFs. 
> We currently have this table in code:
>  * Arrow: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329]
>  * Python UDF: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116]
> We should add a proper documentation page for the type casting rules. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs

2023-08-08 Thread Allison Wang (Jira)
Allison Wang created SPARK-44734:


 Summary: Add documentation for type casting rules in Python 
UDFs/UDTFs
 Key: SPARK-44734
 URL: https://issues.apache.org/jira/browse/SPARK-44734
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


In addition to type mappings between Spark data types and Python data types, we 
should add the type casting rules for regular and arrow-optimized Python 
UDFs/UDTFs. 

We currently have this table in code:
 * Arrow: 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329]
 * Python UDF: 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116]

We should add a proper documentation page for the type casting rules. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44733) Add documentation for type mappings between Spark and Python data types

2023-08-08 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44733:
-
Summary: Add documentation for type mappings between Spark and Python data 
types  (was: Add type mappings between Spark data types and Python types)

> Add documentation for type mappings between Spark and Python data types
> ---
>
> Key: SPARK-44733
> URL: https://issues.apache.org/jira/browse/SPARK-44733
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the PySpark documentation does not cover the data type mapping 
> between Spark data types and Python data types. This mapping can be useful 
> for users of Python UDFs/UDTFs.
> There's a document detailing type mapping in the Spark documentation: 
> [https://spark.apache.org/docs/3.4.1/sql-ref-datatypes.html]
> We should create a dedicated documentation page dedicated to type mappings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44733) Add type mappings between Spark data types and Python types

2023-08-08 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44733:
-
Summary: Add type mappings between Spark data types and Python types  (was: 
Add type mappings between Spark data type and Python type)

> Add type mappings between Spark data types and Python types
> ---
>
> Key: SPARK-44733
> URL: https://issues.apache.org/jira/browse/SPARK-44733
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the PySpark documentation does not cover the data type mapping 
> between Spark data types and Python data types. This mapping can be useful 
> for users of Python UDFs/UDTFs.
> There's a document detailing type mapping in the Spark documentation: 
> [https://spark.apache.org/docs/3.4.1/sql-ref-datatypes.html]
> We should create a dedicated documentation page dedicated to type mappings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44733) Add type mappings between Spark data type and Python type

2023-08-08 Thread Allison Wang (Jira)
Allison Wang created SPARK-44733:


 Summary: Add type mappings between Spark data type and Python type
 Key: SPARK-44733
 URL: https://issues.apache.org/jira/browse/SPARK-44733
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Currently, the PySpark documentation does not cover the data type mapping 
between Spark data types and Python data types. This mapping can be useful for 
users of Python UDFs/UDTFs.

There's a document detailing type mapping in the Spark documentation: 
[https://spark.apache.org/docs/3.4.1/sql-ref-datatypes.html]

We should create a dedicated documentation page dedicated to type mappings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44717.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42392
[https://github.com/apache/spark/pull/42392]

> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York (like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat})
> You will get the error for the latter mentioned test:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +---+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +---+
> |2011-03-13 01:00:00|
> +---+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> ++
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> ++
> | 2011-03-13 03:00:00|
> ++
> {noformat}
> The current resample code uses the 

[jira] [Assigned] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44717:


Assignee: Hyukjin Kwon

> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York (like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat})
> You will get the error for the latter mentioned test:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +---+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +---+
> |2011-03-13 01:00:00|
> +---+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> ++
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> ++
> | 2011-03-13 03:00:00|
> ++
> {noformat}
> The current resample code uses the above interval based calculation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (SPARK-43633) Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43633.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42273
[https://github.com/apache/spark/pull/42273]

> Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0.
> -
>
> Key: SPARK-43633
> URL: https://issues.apache.org/jira/browse/SPARK-43633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43568) Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43568:


Assignee: Haejoon Lee

> Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.
> -
>
> Key: SPARK-43568
> URL: https://issues.apache.org/jira/browse/SPARK-43568
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43568) Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43568.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42273
[https://github.com/apache/spark/pull/42273]

> Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.
> -
>
> Key: SPARK-43568
> URL: https://issues.apache.org/jira/browse/SPARK-43568
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43633) Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43633:


Assignee: Haejoon Lee

> Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0.
> -
>
> Key: SPARK-43633
> URL: https://issues.apache.org/jira/browse/SPARK-43633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44695) Improve error message for `DataFrame.toDF`.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44695:


Assignee: Haejoon Lee

> Improve error message for `DataFrame.toDF`.
> ---
>
> Key: SPARK-44695
> URL: https://issues.apache.org/jira/browse/SPARK-44695
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Improve error message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44695) Improve error message for `DataFrame.toDF`.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44695.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42369
[https://github.com/apache/spark/pull/42369]

> Improve error message for `DataFrame.toDF`.
> ---
>
> Key: SPARK-44695
> URL: https://issues.apache.org/jira/browse/SPARK-44695
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Improve error message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44732) Port the initial implementation of Spark XML data source

2023-08-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752216#comment-17752216
 ] 

Hyukjin Kwon commented on SPARK-44732:
--

https://github.com/apache/spark/pull/41832


> Port the initial implementation of Spark XML data source
> 
>
> Key: SPARK-44732
> URL: https://issues.apache.org/jira/browse/SPARK-44732
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-44265) Built-in XML data source support

2023-08-08 Thread Hyukjin Kwon (Jira)


[ https://issues.apache.org/jira/browse/SPARK-44265 ]


Hyukjin Kwon deleted comment on SPARK-44265:
--

was (Author: snoot):
User 'sandip-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/41832

> Built-in XML data source support
> 
>
> Key: SPARK-44265
> URL: https://issues.apache.org/jira/browse/SPARK-44265
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Critical
>
> XML is a widely used data format. An external spark-xml package 
> ([https://github.com/databricks/spark-xml)] is available to read and write 
> XML data in spark. Making spark-xml built-in will provide a better user 
> experience for Spark SQL and structured streaming. The proposal is to inline 
> code from spark-xml package.
>  
> Here is the link to 
> [SPIP|https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44265) Built-in XML data source support

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44265:
-
Affects Version/s: 4.0.0
   (was: 3.5.0)

> Built-in XML data source support
> 
>
> Key: SPARK-44265
> URL: https://issues.apache.org/jira/browse/SPARK-44265
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Critical
>
> XML is a widely used data format. An external spark-xml package 
> ([https://github.com/databricks/spark-xml)] is available to read and write 
> XML data in spark. Making spark-xml built-in will provide a better user 
> experience for Spark SQL and structured streaming. The proposal is to inline 
> code from spark-xml package.
>  
> Here is the link to 
> [SPIP|https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44732) Port the initial implementation of Spark XML data source

2023-08-08 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-44732:


 Summary: Port the initial implementation of Spark XML data source
 Key: SPARK-44732
 URL: https://issues.apache.org/jira/browse/SPARK-44732
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44265) Built-in XML data source support

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44265:
-
Issue Type: Umbrella  (was: New Feature)

> Built-in XML data source support
> 
>
> Key: SPARK-44265
> URL: https://issues.apache.org/jira/browse/SPARK-44265
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Sandip Agarwala
>Priority: Critical
>
> XML is a widely used data format. An external spark-xml package 
> ([https://github.com/databricks/spark-xml)] is available to read and write 
> XML data in spark. Making spark-xml built-in will provide a better user 
> experience for Spark SQL and structured streaming. The proposal is to inline 
> code from spark-xml package.
>  
> Here is the link to 
> [SPIP|https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44723) Upgrade `gcs-connector` to 2.2.16

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44723.
--
Fix Version/s: 4.0.0
 Assignee: Dongjoon Hyun
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/42403

> Upgrade `gcs-connector` to 2.2.16
> -
>
> Key: SPARK-44723
> URL: https://issues.apache.org/jira/browse/SPARK-44723
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44665) Add support for pandas DataFrame assertDataFrameEqual

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44665.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42332
[https://github.com/apache/spark/pull/42332]

> Add support for pandas DataFrame assertDataFrameEqual
> -
>
> Key: SPARK-44665
> URL: https://issues.apache.org/jira/browse/SPARK-44665
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44665) Add support for pandas DataFrame assertDataFrameEqual

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44665:


Assignee: Amanda Liu

> Add support for pandas DataFrame assertDataFrameEqual
> -
>
> Key: SPARK-44665
> URL: https://issues.apache.org/jira/browse/SPARK-44665
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44722) reattach.py: AttributeError: 'NoneType' object has no attribute 'message'

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44722.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42397
[https://github.com/apache/spark/pull/42397]

> reattach.py: AttributeError: 'NoneType' object has no attribute 'message'
> -
>
> Key: SPARK-44722
> URL: https://issues.apache.org/jira/browse/SPARK-44722
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44722) reattach.py: AttributeError: 'NoneType' object has no attribute 'message'

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44722:


Assignee: Juliusz Sompolski

> reattach.py: AttributeError: 'NoneType' object has no attribute 'message'
> -
>
> Key: SPARK-44722
> URL: https://issues.apache.org/jira/browse/SPARK-44722
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44731) Support 'spark.sql.timestampType' in Python Spark Connect client

2023-08-08 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-44731:


 Summary: Support 'spark.sql.timestampType' in Python Spark Connect 
client
 Key: SPARK-44731
 URL: https://issues.apache.org/jira/browse/SPARK-44731
 Project: Spark
  Issue Type: Task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Hyukjin Kwon


If Spark session enables 'spark.sql.timestampType', datetime should be inferred 
as TimestampNTZ type. However, this isn't implemented yet in Python client side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44730) Spark Connect: Cleaner thread not stopped when SparkSession stops

2023-08-08 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-44730:
-

 Summary: Spark Connect: Cleaner thread not stopped when 
SparkSession stops
 Key: SPARK-44730
 URL: https://issues.apache.org/jira/browse/SPARK-44730
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0, 4.0.0
Reporter: Juliusz Sompolski


Spark Connect scala client SparkSession has a cleaner, which starts a daemon 
thread to clean up Closeable objects after GC. This daemon thread is never 
stopped, and every SparkSession creates a new one.

Cleaner implements a stop() function, but no-one ever calls it. Possibly 
because even after SparkSession.stop(), the cleaner may still be needed when 
remaining references are GCed... For this reason it seems that the Cleaner 
should rather be a global singleton than within a session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44725) Document spark.network.timeoutInterval

2023-08-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44725:
--
Fix Version/s: 3.3.4
   3.5.1
   (was: 3.5.0)
   (was: 3.3.3)

> Document spark.network.timeoutInterval
> --
>
> Key: SPARK-44725
> URL: https://issues.apache.org/jira/browse/SPARK-44725
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44725) Document spark.network.timeoutInterval

2023-08-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44725:
-

Assignee: Dongjoon Hyun

> Document spark.network.timeoutInterval
> --
>
> Key: SPARK-44725
> URL: https://issues.apache.org/jira/browse/SPARK-44725
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44725) Document spark.network.timeoutInterval

2023-08-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44725.
---
Fix Version/s: 3.3.3
   3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42402
[https://github.com/apache/spark/pull/42402]

> Document spark.network.timeoutInterval
> --
>
> Key: SPARK-44725
> URL: https://issues.apache.org/jira/browse/SPARK-44725
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 3.3.3, 3.5.0, 4.0.0, 3.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44729) Add canonical links to the PySpark docs page

2023-08-08 Thread Allison Wang (Jira)
Allison Wang created SPARK-44729:


 Summary: Add canonical links to the PySpark docs page
 Key: SPARK-44729
 URL: https://issues.apache.org/jira/browse/SPARK-44729
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


We should add the canonical link to the PySpark docs page 
[https://spark.apache.org/docs/latest/api/python/index.html] so that the search 
engine can return the latest PySpark docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44728) Improve PySpark documentations

2023-08-08 Thread Allison Wang (Jira)
Allison Wang created SPARK-44728:


 Summary: Improve PySpark documentations
 Key: SPARK-44728
 URL: https://issues.apache.org/jira/browse/SPARK-44728
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


An umbrella Jira ticket to improve the PySpark documentation.
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44725) Document spark.network.timeoutInterval

2023-08-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44725:
--
Affects Version/s: 3.4.1
   3.3.2

> Document spark.network.timeoutInterval
> --
>
> Key: SPARK-44725
> URL: https://issues.apache.org/jira/browse/SPARK-44725
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44727) Improve the error message for dynamic allocation conditions

2023-08-08 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-44727:
-

 Summary: Improve the error message for dynamic allocation 
conditions
 Key: SPARK-44727
 URL: https://issues.apache.org/jira/browse/SPARK-44727
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44726) Improve HeartbeatReceiver config validation error message

2023-08-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44726:
--
Description: 
{code}
$ bin/spark-shell -c spark.network.timeout=30s
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/08/08 14:38:18 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
23/08/08 14:38:19 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: 
spark.network.timeoutInterval should be less than or equal to 
spark.storage.blockManagerHeartbeatTimeoutMs.
{code}

> Improve HeartbeatReceiver config validation error message
> -
>
> Key: SPARK-44726
> URL: https://issues.apache.org/jira/browse/SPARK-44726
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> {code}
> $ bin/spark-shell -c spark.network.timeout=30s
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 23/08/08 14:38:18 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/08/08 14:38:19 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: requirement failed: 
> spark.network.timeoutInterval should be less than or equal to 
> spark.storage.blockManagerHeartbeatTimeoutMs.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44726) Improve HeartbeatReceiver config validation error message

2023-08-08 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44726:
-

 Summary: Improve HeartbeatReceiver config validation error message
 Key: SPARK-44726
 URL: https://issues.apache.org/jira/browse/SPARK-44726
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44725) Document spark.network.timeoutInterval

2023-08-08 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44725:
-

 Summary: Document spark.network.timeoutInterval
 Key: SPARK-44725
 URL: https://issues.apache.org/jira/browse/SPARK-44725
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44724) INSET hash hset set to None when plan exported into JSON

2023-08-08 Thread Matteo Interlandi (Jira)
Matteo Interlandi created SPARK-44724:
-

 Summary: INSET hash hset set to None when plan exported into JSON
 Key: SPARK-44724
 URL: https://issues.apache.org/jira/browse/SPARK-44724
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.1
Reporter: Matteo Interlandi


I am exporting optimized plans using 
`_jdf.queryExecution().optimizedPlan().toJSON()`. I noticed that when the plan 
contains a `INSET` operator the `hset` attribute is None (instead of containing 
the set elements).

 

When printing directly `_jdf.queryExecution().optimizedPlan()` the `INSET` 
operator has all the elements so I guess that the problem is with the `toJSON` 
method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44723) Upgrade `gcs-connector` to 2.2.16

2023-08-08 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44723:
-

 Summary: Upgrade `gcs-connector` to 2.2.16
 Key: SPARK-44723
 URL: https://issues.apache.org/jira/browse/SPARK-44723
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44699) Add logging for complete write events to file in EventLogFileWriter.closeWriter

2023-08-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752133#comment-17752133
 ] 

Hudson commented on SPARK-44699:


User 'shuyouZZ' has created a pull request for this issue:
https://github.com/apache/spark/pull/42372

> Add logging for complete write events to file in 
> EventLogFileWriter.closeWriter
> ---
>
> Key: SPARK-44699
> URL: https://issues.apache.org/jira/browse/SPARK-44699
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: shuyouZZ
>Priority: Major
> Fix For: 3.5.0
>
>
> Sometimes we want to know when to finish logging the events to eventLog file, 
> we need add a log to make it clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44691) Move Subclasses of Analysis to sql/api

2023-08-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752134#comment-17752134
 ] 

Hudson commented on SPARK-44691:


User 'heyihong' has created a pull request for this issue:
https://github.com/apache/spark/pull/42363

> Move Subclasses of Analysis to sql/api
> --
>
> Key: SPARK-44691
> URL: https://issues.apache.org/jira/browse/SPARK-44691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43754) Spark Connect Session & Query lifecycle

2023-08-08 Thread Juliusz Sompolski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752130#comment-17752130
 ] 

Juliusz Sompolski commented on SPARK-43754:
---

Not in epic, but nice-to-have refactoring: 
https://issues.apache.org/jira/browse/SPARK-43756

> Spark Connect Session & Query lifecycle
> ---
>
> Key: SPARK-43754
> URL: https://issues.apache.org/jira/browse/SPARK-43754
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Currently, queries in Spark Connect are executed within the RPC handler.
> We want to detach the RPC interface from actual sessions and execution, so 
> that we can make the interface more flexible
>  * maintain long running sessions, independent of unbroken GRPC channel
>  * be able to cancel queries
>  * have different interfaces to query results than push from server



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43756) Spark Connect - prefer to pass around SessionHolder / ExecuteHolder more

2023-08-08 Thread Juliusz Sompolski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-43756:
--
Epic Link:   (was: SPARK-43754)

> Spark Connect - prefer to pass around SessionHolder / ExecuteHolder more
> 
>
> Key: SPARK-43756
> URL: https://issues.apache.org/jira/browse/SPARK-43756
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Right now, we pass around inidvidual things like sessionId, userId etc. in 
> multiple places. This needs to the need of additional threading of parameters 
> quite often when something new is added.
> Better pass around SessionHolder and ExecutePlanHolder, where accessor and 
> utility functions can be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44709) Fix flow control in ExecuteGrpcResponseSender

2023-08-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44709.
---
Fix Version/s: 3.5.0
 Assignee: Juliusz Sompolski
   Resolution: Fixed

> Fix flow control in ExecuteGrpcResponseSender
> -
>
> Key: SPARK-44709
> URL: https://issues.apache.org/jira/browse/SPARK-44709
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752119#comment-17752119
 ] 

Dongjoon Hyun commented on SPARK-44719:
---

No, there is no Apache Hive 2.3.10 release yet.

Given that there is no Apache Hive 2.3.10 yet, +1 for reverting SPARK-43225 
(which is also created by [~yumwang]).

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44722) reattach.py: AttributeError: 'NoneType' object has no attribute 'message'

2023-08-08 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-44722:
-

 Summary: reattach.py: AttributeError: 'NoneType' object has no 
attribute 'message'
 Key: SPARK-44722
 URL: https://issues.apache.org/jira/browse/SPARK-44722
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0, 4.0.0
Reporter: Juliusz Sompolski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44721) Retry Policy Revamp

2023-08-08 Thread Alice Sayutina (Jira)
Alice Sayutina created SPARK-44721:
--

 Summary: Retry Policy Revamp
 Key: SPARK-44721
 URL: https://issues.apache.org/jira/browse/SPARK-44721
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Alice Sayutina


Change retry logic. For existing retry logic the maximum tolerated wait time 
can be extremely low with small probability. Revamp the logic to guarantee the 
certain minimum wait time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44720) Make Dataset use Encoder instead of AgnosticEncoder

2023-08-08 Thread Jira
Herman van Hövell created SPARK-44720:
-

 Summary: Make Dataset use Encoder instead of AgnosticEncoder
 Key: SPARK-44720
 URL: https://issues.apache.org/jira/browse/SPARK-44720
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44715) Add missing udf and callUdf functions

2023-08-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44715.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Add missing udf and callUdf functions
> -
>
> Key: SPARK-44715
> URL: https://issues.apache.org/jira/browse/SPARK-44715
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752049#comment-17752049
 ] 

Manu Zhang commented on SPARK-44719:


Is there a 2.3.10 release?

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44710) Support Dataset.dropDuplicatesWithinWatermark in Scala Client

2023-08-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44710.
---
Fix Version/s: 3.5.0
 Assignee: Herman van Hövell
   Resolution: Fixed

> Support Dataset.dropDuplicatesWithinWatermark in Scala Client
> -
>
> Key: SPARK-44710
> URL: https://issues.apache.org/jira/browse/SPARK-44710
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44713) Deduplicate files between sql/core and Spark Connect Scala Client

2023-08-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44713.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Deduplicate files between sql/core and Spark Connect Scala Client
> -
>
> Key: SPARK-44713
> URL: https://issues.apache.org/jira/browse/SPARK-44713
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752023#comment-17752023
 ] 

Yuming Wang commented on SPARK-44719:
-

There are two ways to fix it:
1. Upgrade the built-in hive to 2.3.10 with the following patch.
2. Revert SPARK-43225.

https://github.com/apache/hive/pull/4562
https://github.com/apache/hive/pull/4563
https://github.com/apache/hive/pull/4564

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44719:

Description: 
How to reproduce:
{noformat}
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...
{noformat}


  was:
How to reproduce:
```
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...
```


> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44719:

Attachment: HiveUDFs-1.0-SNAPSHOT.jar

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> ```
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-44719:
---

 Summary: NoClassDefFoundError when using Hive UDF
 Key: SPARK-44719
 URL: https://issues.apache.org/jira/browse/SPARK-44719
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang
 Attachments: HiveUDFs-1.0-SNAPSHOT.jar

How to reproduce:
```
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44718) High On-heap memory usage is detected while doing parquet-file reading with Off-Heap memory mode enabled on spark

2023-08-08 Thread Zamil Majdy (Jira)
Zamil Majdy created SPARK-44718:
---

 Summary: High On-heap memory usage is detected while doing 
parquet-file reading with Off-Heap memory mode enabled on spark
 Key: SPARK-44718
 URL: https://issues.apache.org/jira/browse/SPARK-44718
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.4.1
Reporter: Zamil Majdy


I see the high use of on-heap memory usage while doing the parquet file reading 
when the off-heap memory mode is enabled. This is caused by the memory-mode for 
the column vector for the vectorized reader is configured by different flag, 
and the default value is always set to On-Heap.

Conf to reproduce the issue:

{{spark.memory.offHeap.size 100}}
{{spark.memory.offHeap.enabled true}}

Enabling these configurations only will not change the memory mode used for 
parquet-reading by the vectorized reader to Off-Heap.

 

Proposed PR: https://github.com/apache/spark/pull/42394



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-08-08 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Description: 
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to with a prompts (e.g. the attached 
prompt), you can of course use/design your own prompt.
For prompt engineering, you can refer to this [Best 
practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api]
 

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may still 
contain some mistakes, e.g.
* The example code can not run
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* The lint can be broken
* ...

we need to fix them before sending a PR.

We can try different prompts, choose the good parts and combine them to the new 
doc sting.

  was:
Let's first focus on the Documents of *PySpark DataFrame APIs*.

*1*, Chose a subset of DF APIs
Since the review bandwidth is limited, we recommend each PR contains at least 5 
APIs;

*2*, For each API, copy-paste the function (including function signature, doc 
string) to a LLM Model, and ask it to refine the document with prompts like:
* please improve the docstring of the 'unionByName' function
* please refine the comments of the 'unionByName' function
* please refine the documents of the 'unionByName' function, and add more 
examples
* please provide more examples for function 'unionByName'
* ...

It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
former generate better results. 

*3*, Note that the LLM is not 100% reliable, the generated doc string may 
contain some mistakes, e.g.
* The example code can not run
* The example results are incorrect
* The example code doesn't reflect the example title
* The description use wrong version, add a 'Raise' selection for non-existent 
exception
* The lint can be broken
* ...

we need to fix them before sending a PR.

We can try different prompts, choose the good parts and combine them to the new 
doc sting.


> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Attachments: docstr_prompt.py
>
>
> Let's first focus on the Documents of *PySpark DataFrame APIs*.
> *1*, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> *2*, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to with a prompts (e.g. the attached 
> prompt), you can of course use/design your own prompt.
> For prompt engineering, you can refer to this [Best 
> practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api]
>  
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> *3*, Note that the LLM is not 100% reliable, the generated doc string may 
> still contain some mistakes, e.g.
> * The example code can not run
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * The lint can be broken
> * ...
> we need to fix them before sending a PR.
> We can try different prompts, choose the good parts and combine them to the 
> new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44564) Refine the documents with LLM

2023-08-08 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44564:
--
Attachment: docstr_prompt.py

> Refine the documents with LLM
> -
>
> Key: SPARK-44564
> URL: https://issues.apache.org/jira/browse/SPARK-44564
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Attachments: docstr_prompt.py
>
>
> Let's first focus on the Documents of *PySpark DataFrame APIs*.
> *1*, Chose a subset of DF APIs
> Since the review bandwidth is limited, we recommend each PR contains at least 
> 5 APIs;
> *2*, For each API, copy-paste the function (including function signature, doc 
> string) to a LLM Model, and ask it to with a prompts (e.g. the attached 
> prompt), you can of course use/design your own prompt.
> For prompt engineering, you can refer to this [Best 
> practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api]
>  
> It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the 
> former generate better results. 
> *3*, Note that the LLM is not 100% reliable, the generated doc string may 
> still contain some mistakes, e.g.
> * The example code can not run
> * The example results are incorrect
> * The example code doesn't reflect the example title
> * The description use wrong version, add a 'Raise' selection for non-existent 
> exception
> * The lint can be broken
> * ...
> we need to fix them before sending a PR.
> We can try different prompts, choose the good parts and combine them to the 
> new doc sting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752006#comment-17752006
 ] 

Hyukjin Kwon commented on SPARK-44717:
--

Made a quick fix: https://github.com/apache/spark/pull/42392. I believe there 
are other corner cases like this a lot .. but the PR fixes this one alone for 
now.

> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York (like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat})
> You will get the error for the latter mentioned test:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +---+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +---+
> |2011-03-13 01:00:00|
> +---+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> ++
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> ++
> | 2011-03-13 03:00:00|
> ++
> {noformat}
> The current resample code uses the above interval based calculation.

[jira] [Assigned] (SPARK-44236) Even `spark.sql.codegen.factoryMode` is NO_CODEGEN, the WholeStageCodegen also will be generated.

2023-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44236:
---

Assignee: Jia Fan

> Even `spark.sql.codegen.factoryMode` is NO_CODEGEN, the WholeStageCodegen 
> also will be generated.
> -
>
> Key: SPARK-44236
> URL: https://issues.apache.org/jira/browse/SPARK-44236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
>
> The `spark.sql.codegen.factoryMode` is NO_CODEGEN, but Spark always generate 
> WholeStageCodegen plan when set `spark.sql.codegen.wholeStage` to `true`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44236) Even `spark.sql.codegen.factoryMode` is NO_CODEGEN, the WholeStageCodegen also will be generated.

2023-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44236.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41779
[https://github.com/apache/spark/pull/41779]

> Even `spark.sql.codegen.factoryMode` is NO_CODEGEN, the WholeStageCodegen 
> also will be generated.
> -
>
> Key: SPARK-44236
> URL: https://issues.apache.org/jira/browse/SPARK-44236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0
>
>
> The `spark.sql.codegen.factoryMode` is NO_CODEGEN, but Spark always generate 
> WholeStageCodegen plan when set `spark.sql.codegen.wholeStage` to `true`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44714) Ease restriction of LCA resolution regarding queries with having

2023-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44714.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 42276
[https://github.com/apache/spark/pull/42276]

> Ease restriction of LCA resolution regarding queries with having
> 
>
> Key: SPARK-44714
> URL: https://issues.apache.org/jira/browse/SPARK-44714
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 3.5.0
>
>
> Current LCA resolution has a limitation, that it can't resolve the query, 
> when it satisfies all the following criteria:
>  # the main (outer) query has having clause
>  # there is a window expression in the query
>  # in the same SELECT list as the window expression in 2), there is an lca
> This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; 
> window expressions won't be extracted until LCA in the same SELECT lists are 
> rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, 
> which could include the Window. It becomes a deadlock.
> *We should ease some limitation on the LCA resolution regarding to having, to 
> break the deadlock for most cases.*
> For example, for the following query:
> {code:java}
> create table t (col boolean) using orc;
> with w AS (
>   select min(col) over () as min_alias,
>   min_alias as col_alias
>   FROM t
> )
> select col_alias
> from w
> having count > 0;
> {code}
>  
> It now throws confusing error message:
> {code:java}
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `col_alias` 
> cannot be resolved. Did you mean one of the following? [`col_alias`, 
> `min_alias`].{code}
> The LCA and window is in a CTE that is completely unrelated to the having. 
> LCA should resolve in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44714) Ease restriction of LCA resolution regarding queries with having

2023-08-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44714:
---

Assignee: Xinyi Yu

> Ease restriction of LCA resolution regarding queries with having
> 
>
> Key: SPARK-44714
> URL: https://issues.apache.org/jira/browse/SPARK-44714
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
>
> Current LCA resolution has a limitation, that it can't resolve the query, 
> when it satisfies all the following criteria:
>  # the main (outer) query has having clause
>  # there is a window expression in the query
>  # in the same SELECT list as the window expression in 2), there is an lca
> This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; 
> window expressions won't be extracted until LCA in the same SELECT lists are 
> rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, 
> which could include the Window. It becomes a deadlock.
> *We should ease some limitation on the LCA resolution regarding to having, to 
> break the deadlock for most cases.*
> For example, for the following query:
> {code:java}
> create table t (col boolean) using orc;
> with w AS (
>   select min(col) over () as min_alias,
>   min_alias as col_alias
>   FROM t
> )
> select col_alias
> from w
> having count > 0;
> {code}
>  
> It now throws confusing error message:
> {code:java}
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `col_alias` 
> cannot be resolved. Did you mean one of the following? [`col_alias`, 
> `min_alias`].{code}
> The LCA and window is in a CTE that is completely unrelated to the having. 
> LCA should resolve in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44714) Ease restriction of LCA resolution regarding queries with having

2023-08-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751966#comment-17751966
 ] 

ASF GitHub Bot commented on SPARK-44714:


User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/42276

> Ease restriction of LCA resolution regarding queries with having
> 
>
> Key: SPARK-44714
> URL: https://issues.apache.org/jira/browse/SPARK-44714
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Xinyi Yu
>Priority: Major
>
> Current LCA resolution has a limitation, that it can't resolve the query, 
> when it satisfies all the following criteria:
>  # the main (outer) query has having clause
>  # there is a window expression in the query
>  # in the same SELECT list as the window expression in 2), there is an lca
> This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; 
> window expressions won't be extracted until LCA in the same SELECT lists are 
> rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, 
> which could include the Window. It becomes a deadlock.
> *We should ease some limitation on the LCA resolution regarding to having, to 
> break the deadlock for most cases.*
> For example, for the following query:
> {code:java}
> create table t (col boolean) using orc;
> with w AS (
>   select min(col) over () as min_alias,
>   min_alias as col_alias
>   FROM t
> )
> select col_alias
> from w
> having count > 0;
> {code}
>  
> It now throws confusing error message:
> {code:java}
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `col_alias` 
> cannot be resolved. Did you mean one of the following? [`col_alias`, 
> `min_alias`].{code}
> The LCA and window is in a CTE that is completely unrelated to the having. 
> LCA should resolve in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751962#comment-17751962
 ] 

Attila Zsolt Piros edited comment on SPARK-44717 at 8/8/23 8:56 AM:


The TIMESTAMP_NTZ would work for sure.

Here is the test:

{noformat}
$ export TZ="America/New_York"
$ ./bin/pyspark

>>> sql("select  TIMESTAMP_NTZ '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()

++
|TIMESTAMP_NTZ '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 02:00:00|
++

>>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()
++
|TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 03:00:00|
++
{noformat}




was (Author: attilapiros):
The TIMESTAMP_NTZ would work for sure.

Here is the test:

{noformat}
$ TZ="America/New_York"
$ ./bin/pyspark

>>> sql("select  TIMESTAMP_NTZ '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()

++
|TIMESTAMP_NTZ '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 02:00:00|
++

>>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()
++
|TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 03:00:00|
++
{noformat}



> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York (like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat})
> You will get the error for the latter mentioned test:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> 

[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751962#comment-17751962
 ] 

Attila Zsolt Piros commented on SPARK-44717:


The TIMESTAMP_NTZ would work for sure.

Here is the test:

{noformat}
$ TZ="America/New_York"
$ ./bin/pyspark

>>> sql("select  TIMESTAMP_NTZ '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()

++
|TIMESTAMP_NTZ '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 02:00:00|
++

>>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()
++
|TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 03:00:00|
++
{noformat}



> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York (like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat})
> You will get the error for the latter mentioned test:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> 

[jira] [Resolved] (SPARK-44657) Incorrect limit handling and config parsing in Arrow collect

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44657.
--
Fix Version/s: 3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42321
[https://github.com/apache/spark/pull/42321]

> Incorrect limit handling and config parsing in Arrow collect
> 
>
> Key: SPARK-44657
> URL: https://issues.apache.org/jira/browse/SPARK-44657
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0, 4.0.0, 3.4.2
>
>
> In the arrow writer 
> [code|https://github.com/apache/spark/blob/6161bf44f40f8146ea4c115c788fd4eaeb128769/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L154-L163]
>  , the conditions don’t seem to hold what the documentation says regd 
> "{_}maxBatchSize and maxRecordsPerBatch, respect whatever smaller"{_} since 
> it seems to actually respect the conf which is "larger" (i.e less 
> restrictive) due to _||_ operator.
>  
> Further, when the `{_}CONNECT_GRPC_ARROW_MAX_BATCH_SIZE{_}` conf is read, the 
> value is not converted to bytes from Mib 
> ([example|https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala#L103]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44657) Incorrect limit handling and config parsing in Arrow collect

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44657:


Assignee: Venkata Sai Akhil Gudesa

> Incorrect limit handling and config parsing in Arrow collect
> 
>
> Key: SPARK-44657
> URL: https://issues.apache.org/jira/browse/SPARK-44657
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
>
> In the arrow writer 
> [code|https://github.com/apache/spark/blob/6161bf44f40f8146ea4c115c788fd4eaeb128769/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L154-L163]
>  , the conditions don’t seem to hold what the documentation says regd 
> "{_}maxBatchSize and maxRecordsPerBatch, respect whatever smaller"{_} since 
> it seems to actually respect the conf which is "larger" (i.e less 
> restrictive) due to _||_ operator.
>  
> Further, when the `{_}CONNECT_GRPC_ARROW_MAX_BATCH_SIZE{_}` conf is read, the 
> value is not converted to bytes from Mib 
> ([example|https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala#L103]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44680) parameter markers are not blocked from DEFAULT (and other places)

2023-08-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44680.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42365
[https://github.com/apache/spark/pull/42365]

> parameter markers are not blocked from DEFAULT (and other places)
> -
>
> Key: SPARK-44680
> URL: https://issues.apache.org/jira/browse/SPARK-44680
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> scala> spark.sql("CREATE TABLE t11(c1 int default :parm)", args = Map("parm" 
> -> 5)).show()
> -> success
> scala> spark.sql("describe t11");
> [INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION] Failed to execute 
> EXISTS_DEFAULT command because the destination table column `c1` has a 
> DEFAULT value :parm, which fails to resolve as a valid expression.
> This likely extends to other DDL-y places.
> I can only find protection against placement in the body of a CREATE VIEW.
> I see two ways out of this:
> * Raise an error (as we do for CREATE VIEW v1(c1) AS SELECT ? )
>  * Improve the way we persist queries/expressions to substitute the 
> at-DDL-time bound parameter value (it' not a bug it's a feature)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44680) parameter markers are not blocked from DEFAULT (and other places)

2023-08-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44680:


Assignee: Max Gekk

> parameter markers are not blocked from DEFAULT (and other places)
> -
>
> Key: SPARK-44680
> URL: https://issues.apache.org/jira/browse/SPARK-44680
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Max Gekk
>Priority: Major
>
> scala> spark.sql("CREATE TABLE t11(c1 int default :parm)", args = Map("parm" 
> -> 5)).show()
> -> success
> scala> spark.sql("describe t11");
> [INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION] Failed to execute 
> EXISTS_DEFAULT command because the destination table column `c1` has a 
> DEFAULT value :parm, which fails to resolve as a valid expression.
> This likely extends to other DDL-y places.
> I can only find protection against placement in the body of a CREATE VIEW.
> I see two ways out of this:
> * Raise an error (as we do for CREATE VIEW v1(c1) AS SELECT ? )
>  * Improve the way we persist queries/expressions to substitute the 
> at-DDL-time bound parameter value (it' not a bug it's a feature)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751928#comment-17751928
 ] 

Hyukjin Kwon commented on SPARK-44717:
--

[~attilapiros] which time zone are you in? Would you mind trying this one below:

{code}
sql("select  TIMESTAMP_NTZ '2011-03-13 01:00:00' + 
make_interval(0,0,0,0,1,0,0)").show()
{code}

> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York (like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat})
> You will get the error for the latter mentioned test:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +---+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +---+
> |2011-03-13 01:00:00|
> +---+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> ++
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> ++
> | 2011-03-13 03:00:00|
> ++
> {noformat}
> The current resample code uses the above 

[jira] [Resolved] (SPARK-43567) Enable CategoricalIndexTests.test_factorize for pandas 2.0.0.

2023-08-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43567.
--
Fix Version/s: 4.0.0
 Assignee: Haejoon Lee
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/42270

> Enable CategoricalIndexTests.test_factorize for pandas 2.0.0.
> -
>
> Key: SPARK-43567
> URL: https://issues.apache.org/jira/browse/SPARK-43567
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> Enable CategoricalIndexTests.test_factorize for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org