[jira] [Commented] (SPARK-33075) Only disable auto bucketed scan for cached query

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219527#comment-17219527
 ] 

Apache Spark commented on SPARK-33075:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/30138

> Only disable auto bucketed scan for cached query
> 
>
> Key: SPARK-33075
> URL: https://issues.apache.org/jira/browse/SPARK-33075
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r500033528,] auto 
> bucketed scan is disabled by default due to regression for cached query. 
> Suggested by [~cloud_fan], we can enable auto bucketed scan globally with 
> special handling for cached query, similar to adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33075) Only disable auto bucketed scan for cached query

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33075:


Assignee: Apache Spark

> Only disable auto bucketed scan for cached query
> 
>
> Key: SPARK-33075
> URL: https://issues.apache.org/jira/browse/SPARK-33075
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r500033528,] auto 
> bucketed scan is disabled by default due to regression for cached query. 
> Suggested by [~cloud_fan], we can enable auto bucketed scan globally with 
> special handling for cached query, similar to adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33075) Only disable auto bucketed scan for cached query

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33075:


Assignee: (was: Apache Spark)

> Only disable auto bucketed scan for cached query
> 
>
> Key: SPARK-33075
> URL: https://issues.apache.org/jira/browse/SPARK-33075
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r500033528,] auto 
> bucketed scan is disabled by default due to regression for cached query. 
> Suggested by [~cloud_fan], we can enable auto bucketed scan globally with 
> special handling for cached query, similar to adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33075) Only disable auto bucketed scan for cached query

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219529#comment-17219529
 ] 

Apache Spark commented on SPARK-33075:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/30138

> Only disable auto bucketed scan for cached query
> 
>
> Key: SPARK-33075
> URL: https://issues.apache.org/jira/browse/SPARK-33075
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r500033528,] auto 
> bucketed scan is disabled by default due to regression for cached query. 
> Suggested by [~cloud_fan], we can enable auto bucketed scan globally with 
> special handling for cached query, similar to adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31069) high cpu caused by chunksBeingTransferred in external shuffle service

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219535#comment-17219535
 ] 

Apache Spark commented on SPARK-31069:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30139

> high cpu caused by chunksBeingTransferred in external shuffle service
> -
>
> Key: SPARK-31069
> URL: https://issues.apache.org/jira/browse/SPARK-31069
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Xiaoju Wu
>Priority: Major
>
> "shuffle-chunk-fetch-handler-2-40" #250 daemon prio=5 os_prio=0 
> tid=0x02ac nid=0xb9b3 runnable [0x7ff20a1af000]
>java.lang.Thread.State: RUNNABLE
> at 
> java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
> at 
> java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:184)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
>  
>  
>  
> "shuffle-chunk-fetch-handler-2-48" #235 daemon prio=5 os_prio=0 
> tid=0x7ff2302ec800 nid=0xb9ad runnable [0x7ff20a7b4000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:186)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31069) high cpu caused by chunksBeingTransferred in external shuffle service

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219534#comment-17219534
 ] 

Apache Spark commented on SPARK-31069:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30139

> high cpu caused by chunksBeingTransferred in external shuffle service
> -
>
> Key: SPARK-31069
> URL: https://issues.apache.org/jira/browse/SPARK-31069
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Xiaoju Wu
>Priority: Major
>
> "shuffle-chunk-fetch-handler-2-40" #250 daemon prio=5 os_prio=0 
> tid=0x02ac nid=0xb9b3 runnable [0x7ff20a1af000]
>java.lang.Thread.State: RUNNABLE
> at 
> java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
> at 
> java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:184)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
>  
>  
>  
> "shuffle-chunk-fetch-handler-2-48" #235 daemon prio=5 os_prio=0 
> tid=0x7ff2302ec800 nid=0xb9ad runnable [0x7ff20a7b4000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:186)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33145) In Execution web page, when `Succeeded Job` has many child url elements,they will extend over the edge of the page.

2020-10-23 Thread akiyamaneko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko resolved SPARK-33145.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

> In Execution web page, when `Succeeded Job` has many child url elements,they 
> will extend over the edge of the page. 
> 
>
> Key: SPARK-33145
> URL: https://issues.apache.org/jira/browse/SPARK-33145
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0, 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: Screenshot.png, Screenshot1.png
>
>
> Spark Version:3.0.1
> Problem: In Execution web page, when *{color:#de350b}Succeeded Jobs(or failed 
> Jobs){color}* has many child url elements,they will extend over the edge of 
> the page, as the attachment shows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33104) Fix `YarnClusterSuite.yarn-cluster should respect conf overrides in SparkHadoopUtil`

2020-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33104:


Assignee: Hyukjin Kwon

> Fix `YarnClusterSuite.yarn-cluster should respect conf overrides in 
> SparkHadoopUtil`
> 
>
> Key: SPARK-33104
> URL: https://issues.apache.org/jira/browse/SPARK-33104
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1377/testReport/org.apache.spark.deploy.yarn/YarnClusterSuite/yarn_cluster_should_respect_conf_overrides_in_SparkHadoopUtil__SPARK_16414__SPARK_23630_/
> {code}
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
> Exit code from container container_1602245728426_0006_02_01 is : 15
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
> Exception from container-launch with container ID: 
> container_1602245728426_0006_02_01 and exit code: 15
> ExitCodeException exitCode=15: 
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>   at org.apache.hadoop.util.Shell.run(Shell.java:482)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN ContainerLaunch: Container 
> exited with a non-zero exit code 15
> 20/10/09 05:18:13.237 AsyncDispatcher event handler WARN NMAuditLogger: 
> USER=jenkins  OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1602245728426_0006
> CONTAINERID=container_1602245728426_0006_02_01
> 20/10/09 05:18:13.244 Socket Reader #1 for port 37112 INFO Server: Auth 
> successful for appattempt_1602245728426_0006_02 (auth:SIMPLE)
> 20/10/09 05:18:13.326 IPC Parameter Sending Thread #0 DEBUG Client: IPC 
> Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins sending #37
> 20/10/09 05:18:13.327 IPC Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins DEBUG Client: IPC 
> Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins got value #37
> 20/10/09 05:18:13.328 main DEBUG ProtobufRpcEngine: Call: 
> getApplicationReport took 2ms
> 20/10/09 05:18:13.328 main INFO Client: Application report for 
> application_1602245728426_0006 (state: FINISHED)
> 20/10/09 05:18:13.328 main DEBUG Client: 
>client token: N/A
>diagnostics: User class threw exception: 
> org.scalatest.exceptions.TestFailedException: null was not equal to 
> "testvalue"
>   at 
> org.scalatest.matchers.MatchersHelper$.indicateFailure(MatchersHelper.scala:344)
>   at 
> org.scalatest.matchers.should.Matchers$ShouldMethodHelperClass.shouldMatcher(Matchers.scala:6778)
>   at 
> org.scalatest.matchers.should.Matchers$AnyShouldWrapper.should(Matchers.scala:6822)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.$anonfun$main$2(YarnClusterSuite.scala:383)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.main(YarnClusterSuite.scala:382)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf.main(YarnClusterSuite.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Metho

[jira] [Resolved] (SPARK-33104) Fix `YarnClusterSuite.yarn-cluster should respect conf overrides in SparkHadoopUtil`

2020-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33104.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30133
[https://github.com/apache/spark/pull/30133]

> Fix `YarnClusterSuite.yarn-cluster should respect conf overrides in 
> SparkHadoopUtil`
> 
>
> Key: SPARK-33104
> URL: https://issues.apache.org/jira/browse/SPARK-33104
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.1.0
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1377/testReport/org.apache.spark.deploy.yarn/YarnClusterSuite/yarn_cluster_should_respect_conf_overrides_in_SparkHadoopUtil__SPARK_16414__SPARK_23630_/
> {code}
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
> Exit code from container container_1602245728426_0006_02_01 is : 15
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
> Exception from container-launch with container ID: 
> container_1602245728426_0006_02_01 and exit code: 15
> ExitCodeException exitCode=15: 
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>   at org.apache.hadoop.util.Shell.run(Shell.java:482)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN ContainerLaunch: Container 
> exited with a non-zero exit code 15
> 20/10/09 05:18:13.237 AsyncDispatcher event handler WARN NMAuditLogger: 
> USER=jenkins  OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1602245728426_0006
> CONTAINERID=container_1602245728426_0006_02_01
> 20/10/09 05:18:13.244 Socket Reader #1 for port 37112 INFO Server: Auth 
> successful for appattempt_1602245728426_0006_02 (auth:SIMPLE)
> 20/10/09 05:18:13.326 IPC Parameter Sending Thread #0 DEBUG Client: IPC 
> Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins sending #37
> 20/10/09 05:18:13.327 IPC Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins DEBUG Client: IPC 
> Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins got value #37
> 20/10/09 05:18:13.328 main DEBUG ProtobufRpcEngine: Call: 
> getApplicationReport took 2ms
> 20/10/09 05:18:13.328 main INFO Client: Application report for 
> application_1602245728426_0006 (state: FINISHED)
> 20/10/09 05:18:13.328 main DEBUG Client: 
>client token: N/A
>diagnostics: User class threw exception: 
> org.scalatest.exceptions.TestFailedException: null was not equal to 
> "testvalue"
>   at 
> org.scalatest.matchers.MatchersHelper$.indicateFailure(MatchersHelper.scala:344)
>   at 
> org.scalatest.matchers.should.Matchers$ShouldMethodHelperClass.shouldMatcher(Matchers.scala:6778)
>   at 
> org.scalatest.matchers.should.Matchers$AnyShouldWrapper.should(Matchers.scala:6822)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.$anonfun$main$2(YarnClusterSuite.scala:383)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.main(YarnClusterSuite.scala:382)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf.main(YarnClusterSuite.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> su

[jira] [Created] (SPARK-33227) Add Jar with Azure SAS token fails with URL encoded characters

2020-10-23 Thread James McShane (Jira)
James McShane created SPARK-33227:
-

 Summary: Add Jar with Azure SAS token fails with URL encoded 
characters
 Key: SPARK-33227
 URL: https://issues.apache.org/jira/browse/SPARK-33227
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.3
Reporter: James McShane


I am running spark-submit using an Azure SAS token to access the jar file. When 
the sig of the SAS token contains URL encoded characters before the end, I get 
a 403 error trying to download the jar. It appears to be related to the URL 
encoding change that occurs within DependencyUtils: 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L137.]

Error message:

+ exec /usr/local/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.0.0.44 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class MyClass 
'https://storageaccount.blob.core.windows.net/blob/my-jar.jar?sv=2019-12-12&ss=b&srt=sco&sp=r&se=*&st=***&spr=https&sig=sigwith%2Band%2Fending%3D'

ava.io.IOException: Server returned HTTP response code: 403 for URL: 
https://storageaccount.blob.core.windows.net/blob/ivm-0.2.40-Spark-2.2.jar?sv=2019-12-12&ss=b&srt=sco&sp=r&se=**&st=*&spr=https&sig=sigwith+and/ending=
 at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900)
 at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
 at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
 at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:713) at 
org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
 at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
 at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
 at scala.Option.map(Option.scala:146) at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143) at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


It may not be clear in the example above, but when I submit the sas token url, 
it looks like:

sig=sigwith%2Band%2Fending%3D

The 403 error from the stacktrace gives

sig=sigwith+and/ending=

Is there something I can do to ensure that these characters do not get URL 
decoded in this way?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-23 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-33228:


 Summary: Don't uncache data when replacing an existing view having 
the same plan
 Key: SPARK-33228
 URL: https://issues.apache.org/jira/browse/SPARK-33228
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.8, 3.0.2, 3.1.0
Reporter: Takeshi Yamamuro


SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
when replacing an existing view. But, this change drops cache even when 
replacing a view having the same logical plan. A sequence of queries to 
reproduce this as follows;
{code}
scala> val df = spark.range(1).selectExpr("id a", "id b")
scala> df.cache()
scala> df.explain()
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [a#2L, b#3L]
 +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
replicas)
 +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
 +- *(1) Range (0, 1, step=1, splits=4)


scala> df.createOrReplaceTempView("t")
scala> sql("select * from t").explain()
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [a#2L, b#3L]
 +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
replicas)
 +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
 +- *(1) Range (0, 1, step=1, splits=4)


// If one re-runs the same query `df.createOrReplaceTempView("t")`, the cache's 
swept away
scala> df.createOrReplaceTempView("t")
scala> sql("select * from t").explain()
== Physical Plan ==
*(1) Project [id#0L AS a#2L, id#0L AS b#3L]
+- *(1) Range (0, 1, step=1, splits=4)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33229) UnsupportedOperationException when group by with cube

2020-10-23 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33229:
---

 Summary: UnsupportedOperationException when group by with cube
 Key: SPARK-33229
 URL: https://issues.apache.org/jira/browse/SPARK-33229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 3.1.0
Reporter: Yuming Wang


How to reproduce this issue:

{code:sql}
create table test_cube using parquet as select id as a, id as b, id as c from 
range(10);
select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
{code}



{noformat}
spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) 
from test_cube group by 1, cube(2, 3)]
java.lang.UnsupportedOperationException
at 
org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35)
at 
org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35)
at 
org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33230) re-instate "spark.sql.sources.writeJobUUID" as unique ID in FileOutputWriter jobs

2020-10-23 Thread Steve Loughran (Jira)
Steve Loughran created SPARK-33230:
--

 Summary: re-instate "spark.sql.sources.writeJobUUID" as unique ID 
in FileOutputWriter jobs
 Key: SPARK-33230
 URL: https://issues.apache.org/jira/browse/SPARK-33230
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1, 2.4.7
Reporter: Steve Loughran


The Hadoop S3A staging committer has problems with >1 spark sql query being 
launched simultaneously, as it uses the jobID for its path in the clusterFS to 
pass the commit information from tasks to job committer. 

If two queries are launched in the same second, they conflict and the output of 
job 1 includes that of all job2 files written so far; job 2 will fail with FNFE.

Proposed:
job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
{{WriteJobDescription.uuid}}

That was the property name which used to serve this purpose; any committers 
already written which use this property will pick it up without needing any 
changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219682#comment-17219682
 ] 

Apache Spark commented on SPARK-33228:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30140

> Don't uncache data when replacing an existing view having the same plan
> ---
>
> Key: SPARK-33228
> URL: https://issues.apache.org/jira/browse/SPARK-33228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
> when replacing an existing view. But, this change drops cache even when 
> replacing a view having the same logical plan. A sequence of queries to 
> reproduce this as follows;
> {code}
> scala> val df = spark.range(1).selectExpr("id a", "id b")
> scala> df.cache()
> scala> df.explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> // If one re-runs the same query `df.createOrReplaceTempView("t")`, the 
> cache's swept away
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
> +- *(1) Range (0, 1, step=1, splits=4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33228:


Assignee: Apache Spark

> Don't uncache data when replacing an existing view having the same plan
> ---
>
> Key: SPARK-33228
> URL: https://issues.apache.org/jira/browse/SPARK-33228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
> when replacing an existing view. But, this change drops cache even when 
> replacing a view having the same logical plan. A sequence of queries to 
> reproduce this as follows;
> {code}
> scala> val df = spark.range(1).selectExpr("id a", "id b")
> scala> df.cache()
> scala> df.explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> // If one re-runs the same query `df.createOrReplaceTempView("t")`, the 
> cache's swept away
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
> +- *(1) Range (0, 1, step=1, splits=4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33228:


Assignee: (was: Apache Spark)

> Don't uncache data when replacing an existing view having the same plan
> ---
>
> Key: SPARK-33228
> URL: https://issues.apache.org/jira/browse/SPARK-33228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
> when replacing an existing view. But, this change drops cache even when 
> replacing a view having the same logical plan. A sequence of queries to 
> reproduce this as follows;
> {code}
> scala> val df = spark.range(1).selectExpr("id a", "id b")
> scala> df.cache()
> scala> df.explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> // If one re-runs the same query `df.createOrReplaceTempView("t")`, the 
> cache's swept away
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
> +- *(1) Range (0, 1, step=1, splits=4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33230) FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to description.uuid

2020-10-23 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-33230:
---
Summary: FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" 
to description.uuid  (was: re-instate "spark.sql.sources.writeJobUUID" as 
unique ID in FileOutputWriter jobs)

> FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to 
> description.uuid
> 
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33230) FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to description.uuid

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33230:


Assignee: (was: Apache Spark)

> FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to 
> description.uuid
> 
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33230) FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to description.uuid

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33230:


Assignee: Apache Spark

> FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to 
> description.uuid
> 
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33230) FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to description.uuid

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219705#comment-17219705
 ] 

Apache Spark commented on SPARK-33230:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/30141

> FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to 
> description.uuid
> 
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33230) FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to description.uuid

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219707#comment-17219707
 ] 

Apache Spark commented on SPARK-33230:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/30141

> FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to 
> description.uuid
> 
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33220) `scheduleAtFixedRate` change to `schedulerWithFixedDelay` to avoid repeated unecessary scheduling for a short time

2020-10-23 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219727#comment-17219727
 ] 

Rohit Mishra commented on SPARK-33220:
--

[~angerszhuuu], Can you please add the description?

> `scheduleAtFixedRate` change to `schedulerWithFixedDelay` to avoid repeated 
> unecessary scheduling for a short time
> --
>
> Key: SPARK-33220
> URL: https://issues.apache.org/jira/browse/SPARK-33220
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219740#comment-17219740
 ] 

Apache Spark commented on SPARK-33095:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30142

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (MySQL dialect)
> -
>
> Key: SPARK-33095
> URL: https://issues.apache.org/jira/browse/SPARK-33095
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following MySQL JDBC dialect according to official documentation.
> Write MySQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219742#comment-17219742
 ] 

Apache Spark commented on SPARK-33095:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30142

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (MySQL dialect)
> -
>
> Key: SPARK-33095
> URL: https://issues.apache.org/jira/browse/SPARK-33095
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following MySQL JDBC dialect according to official documentation.
> Write MySQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33230) FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to description.uuid

2020-10-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33230:
--
Issue Type: Bug  (was: Improvement)

> FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to 
> description.uuid
> 
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33230) FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to description.uuid

2020-10-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33230:
--
Priority: Major  (was: Minor)

> FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" to 
> description.uuid
> 
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Priority: Major
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second

2020-10-23 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-33230:
---
Summary: FileOutputWriter jobs have duplicate JobIDs if launched in same 
second  (was: FileOutputWriter to set jobConf "spark.sql.sources.writeJobUUID" 
to description.uuid)

> FileOutputWriter jobs have duplicate JobIDs if launched in same second
> --
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Priority: Major
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33213) Upgrade Apache Arrow to 2.0.0

2020-10-23 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219840#comment-17219840
 ] 

Bryan Cutler commented on SPARK-33213:
--

Just a couple notes:

The library and format versions are now split, the format version is still at 
1.0.0 so remains binary compatible, see here for more info 
[https://arrow.apache.org/blog/2020/10/22/2.0.0-release/]

I don't think there are any relevant changes in Arrow Java between 1.0.1 and 
2.0.0, and pyspark is currently working with pyarrow 2.0.0 with the added env 
var {{PYARROW_IGNORE_TIMEZONE=1}}

> Upgrade Apache Arrow to 2.0.0
> -
>
> Key: SPARK-33213
> URL: https://issues.apache.org/jira/browse/SPARK-33213
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Minor
>
> Apache Arrow 2.0.0 has [just been 
> released|https://cwiki.apache.org/confluence/display/ARROW/Arrow+2.0.0+Release].
>  This proposes to upgrade Spark's Arrow dependency to use 2.0.0, from the 
> current 1.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33231) Make podCreationTimeout configurable

2020-10-23 Thread Holden Karau (Jira)
Holden Karau created SPARK-33231:


 Summary: Make podCreationTimeout configurable
 Key: SPARK-33231
 URL: https://issues.apache.org/jira/browse/SPARK-33231
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.1, 3.0.0, 3.1.0
Reporter: Holden Karau


Execution Monitor & Pod Allocator have differing views of the world which can 
lead to pod trashing.

The executor monitor can be notified of an executor coming up before a snapshot 
is delivered to the PodAllocator. This can cause the executor monitor to 
believe it needs to delete a pod, and the pod allocator to believe that it 
needs to create a new pod. This happens if the podCreationTimeout is too low 
for the cluster. Currently podCreationTimeout can only be configured by 
increasing the batch delay but that has additional consequences leading to 
slower spin up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32863) Full outer stream-stream join

2020-10-23 Thread Zhongwei Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219872#comment-17219872
 ] 

Zhongwei Zhu commented on SPARK-32863:
--

[~chengsu] Have you already worked on this? I want to help on this PR.

> Full outer stream-stream join
> -
>
> Key: SPARK-32863
> URL: https://issues.apache.org/jira/browse/SPARK-32863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Major
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). With current design of stream-stream join (which marks whether the row is 
> matched or not in state store), it would be very easy to support full outer 
> join as well.
>  
> Full outer stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store. If there's a match, output all matched rows. Put the row in left side 
> state store.
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, output all matched rows and update left side rows 
> state with "matched" field to set to true. Put the right side row in right 
> side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is false.
> (4).for right side row needs to be evicted from state store, output the row 
> if "matched" field is false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32863) Full outer stream-stream join

2020-10-23 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219907#comment-17219907
 ] 

Cheng Su commented on SPARK-32863:
--

[~warrenzhu25] - Thanks for your interest, Yes, I already had a draft on this, 
will publish in next week after left semi stream-stream join is merged - 
[https://github.com/apache/spark/pull/30076] , thanks.

> Full outer stream-stream join
> -
>
> Key: SPARK-32863
> URL: https://issues.apache.org/jira/browse/SPARK-32863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Major
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). With current design of stream-stream join (which marks whether the row is 
> matched or not in state store), it would be very easy to support full outer 
> join as well.
>  
> Full outer stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store. If there's a match, output all matched rows. Put the row in left side 
> state store.
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, output all matched rows and update left side rows 
> state with "matched" field to set to true. Put the right side row in right 
> side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is false.
> (4).for right side row needs to be evicted from state store, output the row 
> if "matched" field is false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33193) Hive ThriftServer JDBC Database MetaData API Behavior Auditing

2020-10-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33193.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30101
[https://github.com/apache/spark/pull/30101]

> Hive ThriftServer JDBC Database MetaData API  Behavior Auditing
> ---
>
> Key: SPARK-33193
> URL: https://issues.apache.org/jira/browse/SPARK-33193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> Add a test case to audit all JDBC metadata behaviors to check and prevent 
> potential APIs silent changing from both the upstream hive-jdbc module or the 
> Spark thrift server side. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33193) Hive ThriftServer JDBC Database MetaData API Behavior Auditing

2020-10-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33193:
-

Assignee: Kent Yao

> Hive ThriftServer JDBC Database MetaData API  Behavior Auditing
> ---
>
> Key: SPARK-33193
> URL: https://issues.apache.org/jira/browse/SPARK-33193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Add a test case to audit all JDBC metadata behaviors to check and prevent 
> potential APIs silent changing from both the upstream hive-jdbc module or the 
> Spark thrift server side. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33232) ConcurrentAppendException while updating delta lake table

2020-10-23 Thread Khang Pham (Jira)
Khang Pham created SPARK-33232:
--

 Summary: ConcurrentAppendException while updating delta lake table
 Key: SPARK-33232
 URL: https://issues.apache.org/jira/browse/SPARK-33232
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Khang Pham


I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching ID. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
\{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true},"job":\{"jobId":"x","jobName":"Streaming
 join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  

 

However the error says "SnapshotIsolation". 

 

What did I miss? 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33232) ConcurrentAppendException while updating delta lake table

2020-10-23 Thread Khang Pham (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khang Pham updated SPARK-33232:
---
Description: 
I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching IDs. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":

{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}

,"job":\{"jobId":"x","jobName":"Streaming join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A have these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  

 

However the error says "SnapshotIsolation". 

 

What did I miss? 

 

 

  was:
I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching IDs. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":

{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}

,"job":\{"jobId":"x","jobName":"Streaming join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  

 

However the error says "SnapshotIsolation". 

 

What did I miss? 

 

 


> ConcurrentAppendException while updating delta lake table
> -
>
> Key: SPARK-33232
> URL: https://issues.apache.org/jira/browse/SPARK-33232
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Khang Pham
>Priority: Major
>
> I have two Spark Streaming job run concurrently. 
>  * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
> The result will be appended into delta lake table A
>  * Upsert job: read from one Delta lake table B and update table A when there 
> are matching IDs. 
> Environment:
>  * Databricks cloud run time 7.2, Spark 3.0.0
>  
> The stream join job works fine but the Upsert job kept failing. 
>  
> Stack trace:
> com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were

[jira] [Updated] (SPARK-33232) ConcurrentAppendException while updating delta lake table

2020-10-23 Thread Khang Pham (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khang Pham updated SPARK-33232:
---
Description: 
I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching IDs. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":

{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}

,"job":\{"jobId":"x","jobName":"Streaming join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  

 

However the error says "SnapshotIsolation". 

 

What did I miss? 

 

 

  was:
I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching ID. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
\{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true},"job":\{"jobId":"x","jobName":"Streaming
 join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  

 

However the error says "SnapshotIsolation". 

 

What did I miss? 

 

 


> ConcurrentAppendException while updating delta lake table
> -
>
> Key: SPARK-33232
> URL: https://issues.apache.org/jira/browse/SPARK-33232
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Khang Pham
>Priority: Major
>
> I have two Spark Streaming job run concurrently. 
>  * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
> The result will be appended into delta lake table A
>  * Upsert job: read from one Delta lake table B and update table A when there 
> are matching IDs. 
> Environment:
>  * Databricks cloud run time 7.2, Spark 3.0.0
>  
> The stream join job works fine but the Upsert job kept failing. 
>  
> Stack trace:
> com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
> adde

[jira] [Updated] (SPARK-33232) ConcurrentAppendException while updating delta lake table

2020-10-23 Thread Khang Pham (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khang Pham updated SPARK-33232:
---
Description: 
I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching IDs. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":

{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}

,"job":\{"jobId":"x","jobName":"Streaming join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A have these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  and 
[https://docs.databricks.com/delta/concurrency-control.html] 

 

However the error says "SnapshotIsolation". 

 

What did I miss? 

 

 

  was:
I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching IDs. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":

{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}

,"job":\{"jobId":"x","jobName":"Streaming join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A have these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  

 

However the error says "SnapshotIsolation". 

 

What did I miss? 

 

 


> ConcurrentAppendException while updating delta lake table
> -
>
> Key: SPARK-33232
> URL: https://issues.apache.org/jira/browse/SPARK-33232
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Khang Pham
>Priority: Major
>
> I have two Spark Streaming job run concurrently. 
>  * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
> The result will be appended into delta lake table A
>  * Upsert job: read from one Delta lake table B and update table A when there 
> are matching IDs. 
> Environment:
>  * Databricks cloud run time 7.2, Spark 3.0.0
>  
> The stream join job works fine but the Upsert job kept failing. 
>  
> Stack trace:
> com

[jira] [Updated] (SPARK-33232) ConcurrentAppendException while updating delta lake table

2020-10-23 Thread Khang Pham (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khang Pham updated SPARK-33232:
---
Description: 
I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching IDs. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":

{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}

,"job":\{"jobId":"x","jobName":"Streaming join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A have these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  and 
[https://docs.databricks.com/delta/concurrency-control.html] 

 

However the error says "SnapshotIsolation". I didn't have any Optimize 
operation running on the target table. 

 

What did I miss? 

 

 

  was:
I have two Spark Streaming job run concurrently. 
 * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
The result will be appended into delta lake table A
 * Upsert job: read from one Delta lake table B and update table A when there 
are matching IDs. 

Environment:
 * Databricks cloud run time 7.2, Spark 3.0.0

 

The stream join job works fine but the Upsert job kept failing. 

 

Stack trace:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
concurrent update.

Please try the operation again. Conflicting commit: 
{"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":

{"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}

,"job":\{"jobId":"x","jobName":"Streaming join 
xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}

 

Table A have these setting: 

- 'delta.isolationLevel' = 'WriteSerializable'

- spark.databricks.delta.optimizeWrite.enable = True

- spark.databricks.delta.autoCompact.enabled = True

 

Other settings:

spark.databricks.io.cache.compression.enabled true

stateStore = rocksdb

spark.sql.adaptive.enabled true

spark.sql.adaptive.skewJoin.enabled true

 

I already set IsolationLevel to WriteSerializable to handle 
ConcurrentAppendingException as described in 
[https://docs.databricks.com/delta/optimizations/isolation-level.html]  and 
[https://docs.databricks.com/delta/concurrency-control.html] 

 

However the error says "SnapshotIsolation". 

 

What did I miss? 

 

 


> ConcurrentAppendException while updating delta lake table
> -
>
> Key: SPARK-33232
> URL: https://issues.apache.org/jira/browse/SPARK-33232
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Khang Pham
>Priority: Major
>
> I have two Spark Streaming job run concurrently. 
>  * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
> The result will be appended into delta lake table A
>  * Upsert job: read from one Delta lake table B and update table A when there 
> are matching IDs. 
> Environment:
>  * Datab

[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13

2020-10-23 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219952#comment-17219952
 ] 

Dongjoon Hyun commented on SPARK-33044:
---

Thank you so much, [~shaneknapp]! :)

> Add a Jenkins build and test job for Scala 2.13
> ---
>
> Key: SPARK-33044
> URL: https://issues.apache.org/jira/browse/SPARK-33044
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a 
> Jenkins test job to verify current work results and CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32084:


Assignee: Apache Spark

> Replace dictionary-based function definitions to proper functions in 
> functions.py
> -
>
> Key: SPARK-32084
> URL: https://issues.apache.org/jira/browse/SPARK-32084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently some functions in {{functions.py}} are defined by a dictionary. It 
> programmatically defines the functions to the module; however, it makes some 
> IDEs such as PyCharm don't detect.
> Also, it makes hard to add proper examples into the docstrings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219973#comment-17219973
 ] 

Apache Spark commented on SPARK-32084:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30143

> Replace dictionary-based function definitions to proper functions in 
> functions.py
> -
>
> Key: SPARK-32084
> URL: https://issues.apache.org/jira/browse/SPARK-32084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently some functions in {{functions.py}} are defined by a dictionary. It 
> programmatically defines the functions to the module; however, it makes some 
> IDEs such as PyCharm don't detect.
> Also, it makes hard to add proper examples into the docstrings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32084:


Assignee: (was: Apache Spark)

> Replace dictionary-based function definitions to proper functions in 
> functions.py
> -
>
> Key: SPARK-32084
> URL: https://issues.apache.org/jira/browse/SPARK-32084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently some functions in {{functions.py}} are defined by a dictionary. It 
> programmatically defines the functions to the module; however, it makes some 
> IDEs such as PyCharm don't detect.
> Also, it makes hard to add proper examples into the docstrings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219974#comment-17219974
 ] 

Apache Spark commented on SPARK-32084:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30143

> Replace dictionary-based function definitions to proper functions in 
> functions.py
> -
>
> Key: SPARK-32084
> URL: https://issues.apache.org/jira/browse/SPARK-32084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently some functions in {{functions.py}} are defined by a dictionary. It 
> programmatically defines the functions to the module; however, it makes some 
> IDEs such as PyCharm don't detect.
> Also, it makes hard to add proper examples into the docstrings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33003) Add type hints guideliness to the documentation

2020-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33003.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30094
[https://github.com/apache/spark/pull/30094]

> Add type hints guideliness to the documentation
> ---
>
> Key: SPARK-33003
> URL: https://issues.apache.org/jira/browse/SPARK-33003
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33003) Add type hints guideliness to the documentation

2020-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33003:


Assignee: Maciej Szymkiewicz

> Add type hints guideliness to the documentation
> ---
>
> Key: SPARK-33003
> URL: https://issues.apache.org/jira/browse/SPARK-33003
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33232) ConcurrentAppendException while updating delta lake table

2020-10-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219990#comment-17219990
 ] 

Hyukjin Kwon commented on SPARK-33232:
--

This sounds specific to Databrick's instead of Apache Spark's. I think you 
should contact Databricks for this issue. Does this happen in Apache Spark too?

> ConcurrentAppendException while updating delta lake table
> -
>
> Key: SPARK-33232
> URL: https://issues.apache.org/jira/browse/SPARK-33232
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Khang Pham
>Priority: Major
>
> I have two Spark Streaming job run concurrently. 
>  * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
> The result will be appended into delta lake table A
>  * Upsert job: read from one Delta lake table B and update table A when there 
> are matching IDs. 
> Environment:
>  * Databricks cloud run time 7.2, Spark 3.0.0
>  
> The stream join job works fine but the Upsert job kept failing. 
>  
> Stack trace:
> com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
> added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
> concurrent update.
> Please try the operation again. Conflicting commit: 
> {"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":
> {"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}
> ,"job":\{"jobId":"x","jobName":"Streaming join 
> xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}
>  
> Table A have these setting: 
> - 'delta.isolationLevel' = 'WriteSerializable'
> - spark.databricks.delta.optimizeWrite.enable = True
> - spark.databricks.delta.autoCompact.enabled = True
>  
> Other settings:
> spark.databricks.io.cache.compression.enabled true
> stateStore = rocksdb
> spark.sql.adaptive.enabled true
> spark.sql.adaptive.skewJoin.enabled true
>  
> I already set IsolationLevel to WriteSerializable to handle 
> ConcurrentAppendingException as described in 
> [https://docs.databricks.com/delta/optimizations/isolation-level.html]  and 
> [https://docs.databricks.com/delta/concurrency-control.html] 
>  
> However the error says "SnapshotIsolation". I didn't have any Optimize 
> operation running on the target table. 
>  
> What did I miss? 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32137) AttributeError: Can only use .dt accessor with datetimelike values

2020-10-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219992#comment-17219992
 ] 

Hyukjin Kwon commented on SPARK-32137:
--

It was fixed by Arrow dependency upgrade in Spark 3.0.0, which cannot be ported 
back. Users would have to upgrade their Spark.

> AttributeError: Can only use .dt accessor with datetimelike values
> --
>
> Key: SPARK-32137
> URL: https://issues.apache.org/jira/browse/SPARK-32137
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.5
>Reporter: David Lacalle Castillo
>Priority: Major
>
> I was using a pandas udf with a dataframe containing a date object. I was 
> using the lastversion of pyarrow, 0.17.0.
> I setup this variable on zeppelin spark interpreter:
> ARROW_PRE_0_15_IPC_FORMAT=1
>  
> However, I was getting the following error:
> Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 19.0 (TID 1619, 10.20.0.5, executor 
> 1): org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
>  process()
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in 
> process
>  serializer.dump_stream(func(split_index, iterator), outfile)
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 290, 
> in dump_stream
>  for series in iterator:
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, 
> in load_stream
>  yield [self.arrow_to_pandas(c) for c in 
> pa.Table.from_batches([batch]).itercolumns()]
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, 
> in 
>  yield [self.arrow_to_pandas(c) for c in 
> pa.Table.from_batches([batch]).itercolumns()]
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 278, 
> in arrow_to_pandas
>  s = _check_series_convert_date(s, from_arrow_type(arrow_column.type))
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1692, in 
> _check_series_convert_date
>  return series.dt.date
>  File "/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py", line 
> 5270, in getattr
>  return object.getattribute(self, name)
>  File "/usr/local/lib/python3.7/dist-packages/pandas/core/accessor.py", line 
> 187, in get
>  accessor_obj = self._accessor(obj)
>  File 
> "/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/accessors.py", 
> line 338, in new
>  raise AttributeError("Can only use .dt accessor with datetimelike values")
> AttributeError: Can only use .dt accessor with datetimelike values
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>  at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
>  at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:123)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> 

[jira] [Resolved] (SPARK-32137) AttributeError: Can only use .dt accessor with datetimelike values

2020-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32137.
--
Resolution: Cannot Reproduce

> AttributeError: Can only use .dt accessor with datetimelike values
> --
>
> Key: SPARK-32137
> URL: https://issues.apache.org/jira/browse/SPARK-32137
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.5
>Reporter: David Lacalle Castillo
>Priority: Major
>
> I was using a pandas udf with a dataframe containing a date object. I was 
> using the lastversion of pyarrow, 0.17.0.
> I setup this variable on zeppelin spark interpreter:
> ARROW_PRE_0_15_IPC_FORMAT=1
>  
> However, I was getting the following error:
> Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 19.0 (TID 1619, 10.20.0.5, executor 
> 1): org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
>  process()
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in 
> process
>  serializer.dump_stream(func(split_index, iterator), outfile)
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 290, 
> in dump_stream
>  for series in iterator:
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, 
> in load_stream
>  yield [self.arrow_to_pandas(c) for c in 
> pa.Table.from_batches([batch]).itercolumns()]
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, 
> in 
>  yield [self.arrow_to_pandas(c) for c in 
> pa.Table.from_batches([batch]).itercolumns()]
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 278, 
> in arrow_to_pandas
>  s = _check_series_convert_date(s, from_arrow_type(arrow_column.type))
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1692, in 
> _check_series_convert_date
>  return series.dt.date
>  File "/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py", line 
> 5270, in getattr
>  return object.getattribute(self, name)
>  File "/usr/local/lib/python3.7/dist-packages/pandas/core/accessor.py", line 
> 187, in get
>  accessor_obj = self._accessor(obj)
>  File 
> "/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/accessors.py", 
> line 338, in new
>  raise AttributeError("Can only use .dt accessor with datetimelike values")
> AttributeError: Can only use .dt accessor with datetimelike values
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>  at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
>  at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:123)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor

[jira] [Resolved] (SPARK-18180) pyspark.sql.Row does not serialize well to json

2020-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-18180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-18180.
--
Resolution: Not A Bug

> pyspark.sql.Row does not serialize well to json
> ---
>
> Key: SPARK-18180
> URL: https://issues.apache.org/jira/browse/SPARK-18180
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: HDP 2.3.4, Spark 2.0.1, 
>Reporter: Miguel Cabrera
>Priority: Major
>
> {{Row}} does not serialize well automatically. Although they are dict-like in 
> Python, the json module does not see to be able to serialize it.
> {noformat}
> from  pyspark.sql import Row
> import json
> r = Row(field1='hello', field2='world')
> json.dumps(r)
> {noformat}
> Results:
> {noformat}
> '["hello", "world"]'
> {noformat}
> Expected:
> {noformat}
> {'field1':'hellow', 'field2':'world'}
> {noformat}
> The work around is to call the {{asDict()}} method of Row. However, this 
> makes custom serializing of nested objects really painful as the person has 
> to be aware that is serializing a Row object. In particular with SPARK-17695, 
>   you cannot serialize DataFrames easily if you have some empty or null 
> fields,  so you have to customize the serialization process. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2020-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24915.
--
Resolution: Cannot Reproduce

This is fixed from Spark 3.0 and cannot be reproduced.

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields are out of alphabetical order and one field is a StructType, making 
> chema._needSerializeAnyField==True . However we encountered this in real use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33232) ConcurrentAppendException while updating delta lake table

2020-10-23 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219997#comment-17219997
 ] 

Jungtaek Lim commented on SPARK-33232:
--

The issue doesn't look to be specific to "Apache Spark". Please consult with 
the right channel.

> ConcurrentAppendException while updating delta lake table
> -
>
> Key: SPARK-33232
> URL: https://issues.apache.org/jira/browse/SPARK-33232
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Khang Pham
>Priority: Major
>
> I have two Spark Streaming job run concurrently. 
>  * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
> The result will be appended into delta lake table A
>  * Upsert job: read from one Delta lake table B and update table A when there 
> are matching IDs. 
> Environment:
>  * Databricks cloud run time 7.2, Spark 3.0.0
>  
> The stream join job works fine but the Upsert job kept failing. 
>  
> Stack trace:
> com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
> added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
> concurrent update.
> Please try the operation again. Conflicting commit: 
> {"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":
> {"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}
> ,"job":\{"jobId":"x","jobName":"Streaming join 
> xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}
>  
> Table A have these setting: 
> - 'delta.isolationLevel' = 'WriteSerializable'
> - spark.databricks.delta.optimizeWrite.enable = True
> - spark.databricks.delta.autoCompact.enabled = True
>  
> Other settings:
> spark.databricks.io.cache.compression.enabled true
> stateStore = rocksdb
> spark.sql.adaptive.enabled true
> spark.sql.adaptive.skewJoin.enabled true
>  
> I already set IsolationLevel to WriteSerializable to handle 
> ConcurrentAppendingException as described in 
> [https://docs.databricks.com/delta/optimizations/isolation-level.html]  and 
> [https://docs.databricks.com/delta/concurrency-control.html] 
>  
> However the error says "SnapshotIsolation". I didn't have any Optimize 
> operation running on the target table. 
>  
> What did I miss? 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33232) ConcurrentAppendException while updating delta lake table

2020-10-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-33232.
--
Resolution: Invalid

> ConcurrentAppendException while updating delta lake table
> -
>
> Key: SPARK-33232
> URL: https://issues.apache.org/jira/browse/SPARK-33232
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Khang Pham
>Priority: Major
>
> I have two Spark Streaming job run concurrently. 
>  * Stream join Job: join in Kafka Stream with another stream from Amazon SQS. 
> The result will be appended into delta lake table A
>  * Upsert job: read from one Delta lake table B and update table A when there 
> are matching IDs. 
> Environment:
>  * Databricks cloud run time 7.2, Spark 3.0.0
>  
> The stream join job works fine but the Upsert job kept failing. 
>  
> Stack trace:
> com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were 
> added to partition [dt=2020-xx-yy, request_hour=2020-xx-yy 23:00:00] by a 
> concurrent update.
> Please try the operation again. Conflicting commit: 
> {"timestamp":1603477588946,"userId":"x","operation":"OPTIMIZE","operationParameters":
> {"predicate":[],"zOrderBy":[],"batchId":0,"auto":true}
> ,"job":\{"jobId":"x","jobName":"Streaming join 
> xxx","runId":"xxx","jobOwnerId":"","triggerType":"manual"},"notebook":\{"notebookId":""},"clusterId":"xx","readVersion":22,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":\{"numRemovedFiles":"44","numRemovedBytes":"64455537","p25FileSize":"63341646","minFileSize":"63341646","numAddedFiles":"1","maxFileSize":"63341646","p75FileSize":"63341646","p50FileSize":"63341646","numAddedBytes":"63341646"}}
>  
> Table A have these setting: 
> - 'delta.isolationLevel' = 'WriteSerializable'
> - spark.databricks.delta.optimizeWrite.enable = True
> - spark.databricks.delta.autoCompact.enabled = True
>  
> Other settings:
> spark.databricks.io.cache.compression.enabled true
> stateStore = rocksdb
> spark.sql.adaptive.enabled true
> spark.sql.adaptive.skewJoin.enabled true
>  
> I already set IsolationLevel to WriteSerializable to handle 
> ConcurrentAppendingException as described in 
> [https://docs.databricks.com/delta/optimizations/isolation-level.html]  and 
> [https://docs.databricks.com/delta/concurrency-control.html] 
>  
> However the error says "SnapshotIsolation". I didn't have any Optimize 
> operation running on the target table. 
>  
> What did I miss? 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33220) `scheduleAtFixedRate` change to `schedulerWithFixedDelay` to avoid repeated unecessary scheduling for a short time

2020-10-23 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33220:
--
Description: 
For some schedule behavior, we will `scheduleAtFixedRate` to schedule task or 
heartbeat RPC etc...
If thread blocked by a big fullGc or other reason cause delay, will schedule 
task repeatedly. For some behavior it's not necessary. We can use 
`scheduleWithFixedDelay` to avoid it.

> `scheduleAtFixedRate` change to `schedulerWithFixedDelay` to avoid repeated 
> unecessary scheduling for a short time
> --
>
> Key: SPARK-33220
> URL: https://issues.apache.org/jira/browse/SPARK-33220
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> For some schedule behavior, we will `scheduleAtFixedRate` to schedule task or 
> heartbeat RPC etc...
> If thread blocked by a big fullGc or other reason cause delay, will schedule 
> task repeatedly. For some behavior it's not necessary. We can use 
> `scheduleWithFixedDelay` to avoid it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33229) UnsupportedOperationException when group by with cube

2020-10-23 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220002#comment-17220002
 ] 

angerszhu commented on SPARK-33229:
---

Working on this

> UnsupportedOperationException when group by with cube
> -
>
> Key: SPARK-33229
> URL: https://issues.apache.org/jira/browse/SPARK-33229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table test_cube using parquet as select id as a, id as b, id as c from 
> range(10);
> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> {code}
> {noformat}
> spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> 20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) 
> from test_cube group by 1, cube(2, 3)]
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33229) UnsupportedOperationException when group by with cube

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33229:


Assignee: (was: Apache Spark)

> UnsupportedOperationException when group by with cube
> -
>
> Key: SPARK-33229
> URL: https://issues.apache.org/jira/browse/SPARK-33229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table test_cube using parquet as select id as a, id as b, id as c from 
> range(10);
> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> {code}
> {noformat}
> spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> 20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) 
> from test_cube group by 1, cube(2, 3)]
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33229) UnsupportedOperationException when group by with cube

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33229:


Assignee: Apache Spark

> UnsupportedOperationException when group by with cube
> -
>
> Key: SPARK-33229
> URL: https://issues.apache.org/jira/browse/SPARK-33229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table test_cube using parquet as select id as a, id as b, id as c from 
> range(10);
> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> {code}
> {noformat}
> spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> 20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) 
> from test_cube group by 1, cube(2, 3)]
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33229) UnsupportedOperationException when group by with cube

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220007#comment-17220007
 ] 

Apache Spark commented on SPARK-33229:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30144

> UnsupportedOperationException when group by with cube
> -
>
> Key: SPARK-33229
> URL: https://issues.apache.org/jira/browse/SPARK-33229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table test_cube using parquet as select id as a, id as b, id as c from 
> range(10);
> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> {code}
> {noformat}
> spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> 20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) 
> from test_cube group by 1, cube(2, 3)]
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-23 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220009#comment-17220009
 ] 

angerszhu commented on SPARK-33233:
---

Raise a pr soon

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-23 Thread angerszhu (Jira)
angerszhu created SPARK-33233:
-

 Summary: CUBE/ROLLUP can't support UnresolvedOrdinal
 Key: SPARK-33233
 URL: https://issues.apache.org/jira/browse/SPARK-33233
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33233:


Assignee: Apache Spark

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33233:


Assignee: (was: Apache Spark)

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220014#comment-17220014
 ] 

Apache Spark commented on SPARK-33233:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30145

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220015#comment-17220015
 ] 

Apache Spark commented on SPARK-33233:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30145

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org