[jira] [Commented] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190963#comment-17190963
 ] 

Apache Spark commented on SPARK-32803:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29652

> Catch InterruptedException when resolve rack in SparkRackResolver
> -
>
> Key: SPARK-32803
> URL: https://issues.apache.org/jira/browse/SPARK-32803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Major
>
> In Yarn mode, Insert into hive table can produce some dirty data when kill 
> Spark application. The error msg is:
> ```
> java.io.IOException: java.lang.InterruptedException at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at 
> org.apache.hadoop.util.Shell.run(Shell.java:507) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
>  at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
>  at 
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
>  at 
> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) 
> at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188)
>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250)
>  at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> ```
> The reason is:
> 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname 
> when we kill Spark application, then we got an `InterruptedException` error.
> 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so 
> `FileFormatWriter` action cann't get the failed info.
> 3. `FileFormatWriter` not abort this commit and leaves some dirty data.
>  
> So we should catch the `InterruptedException` error and throw a 
> `SparkException`.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32803:


Assignee: Apache Spark

> Catch InterruptedException when resolve rack in SparkRackResolver
> -
>
> Key: SPARK-32803
> URL: https://issues.apache.org/jira/browse/SPARK-32803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Major
>
> In Yarn mode, Insert into hive table can produce some dirty data when kill 
> Spark application. The error msg is:
> ```
> java.io.IOException: java.lang.InterruptedException at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at 
> org.apache.hadoop.util.Shell.run(Shell.java:507) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
>  at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
>  at 
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
>  at 
> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) 
> at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188)
>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250)
>  at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> ```
> The reason is:
> 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname 
> when we kill Spark application, then we got an `InterruptedException` error.
> 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so 
> `FileFormatWriter` action cann't get the failed info.
> 3. `FileFormatWriter` not abort this commit and leaves some dirty data.
>  
> So we should catch the `InterruptedException` error and throw a 
> `SparkException`.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32803:


Assignee: (was: Apache Spark)

> Catch InterruptedException when resolve rack in SparkRackResolver
> -
>
> Key: SPARK-32803
> URL: https://issues.apache.org/jira/browse/SPARK-32803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Major
>
> In Yarn mode, Insert into hive table can produce some dirty data when kill 
> Spark application. The error msg is:
> ```
> java.io.IOException: java.lang.InterruptedException at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at 
> org.apache.hadoop.util.Shell.run(Shell.java:507) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
>  at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
>  at 
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
>  at 
> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) 
> at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188)
>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250)
>  at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> ```
> The reason is:
> 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname 
> when we kill Spark application, then we got an `InterruptedException` error.
> 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so 
> `FileFormatWriter` action cann't get the failed info.
> 3. `FileFormatWriter` not abort this commit and leaves some dirty data.
>  
> So we should catch the `InterruptedException` error and throw a 
> `SparkException`.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190962#comment-17190962
 ] 

Apache Spark commented on SPARK-32803:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29652

> Catch InterruptedException when resolve rack in SparkRackResolver
> -
>
> Key: SPARK-32803
> URL: https://issues.apache.org/jira/browse/SPARK-32803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Major
>
> In Yarn mode, Insert into hive table can produce some dirty data when kill 
> Spark application. The error msg is:
> ```
> java.io.IOException: java.lang.InterruptedException at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at 
> org.apache.hadoop.util.Shell.run(Shell.java:507) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
>  at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
>  at 
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
>  at 
> org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) 
> at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216)
>  at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188)
>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250)
>  at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> ```
> The reason is:
> 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname 
> when we kill Spark application, then we got an `InterruptedException` error.
> 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so 
> `FileFormatWriter` action cann't get the failed info.
> 3. `FileFormatWriter` not abort this commit and leaves some dirty data.
>  
> So we should catch the `InterruptedException` error and throw a 
> `SparkException`.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver

2020-09-04 Thread ulysses you (Jira)
ulysses you created SPARK-32803:
---

 Summary: Catch InterruptedException when resolve rack in 
SparkRackResolver
 Key: SPARK-32803
 URL: https://issues.apache.org/jira/browse/SPARK-32803
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: ulysses you


In Yarn mode, Insert into hive table can produce some dirty data when kill 
Spark application. The error msg is:

```

java.io.IOException: java.lang.InterruptedException at 
org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at 
org.apache.hadoop.util.Shell.run(Shell.java:507) at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at 
org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
 at 
org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
 at 
org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
 at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) 
at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at 
org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
 at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235)
 at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216)
 at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at 
org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at 
org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250)
 at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208)
 at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215)
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073)
 at scala.collection.immutable.List.foreach(List.scala:392) at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073)
 at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

```

The reason is:

1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname 
when we kill Spark application, then we got an `InterruptedException` error.

2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so 
`FileFormatWriter` action cann't get the failed info.

3. `FileFormatWriter` not abort this commit and leaves some dirty data.

 

So we should catch the `InterruptedException` error and throw a 
`SparkException`.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32802) Avoid using SpecificInternalRow in RunLengthEncoding#Encoder

2020-09-04 Thread Chao Sun (Jira)
Chao Sun created SPARK-32802:


 Summary: Avoid using SpecificInternalRow in 
RunLengthEncoding#Encoder
 Key: SPARK-32802
 URL: https://issues.apache.org/jira/browse/SPARK-32802
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Chao Sun


{{RunLengthEncoding#Encoder}} currently uses {{SpecificInternalRow}} which is 
more expensive than using the native types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-04 Thread Aman Rastogi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190920#comment-17190920
 ] 

Aman Rastogi commented on SPARK-32778:
--

[~maropu], I am on v2.3, I have not tried higher version.

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32017) Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-04 Thread George Pongracz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190908#comment-17190908
 ] 

George Pongracz commented on SPARK-32017:
-

Thank you, this is valuable to me and your effort is appreciated.

> Make Pyspark Hadoop 3.2+ Variant available in PyPI
> --
>
> Key: SPARK-32017
> URL: https://issues.apache.org/jira/browse/SPARK-32017
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: George Pongracz
>Priority: Major
>
> The version of Pyspark 3.0.0 currently available in PyPI currently uses 
> hadoop 2.7.4.
> Could a variant (or the default) have its version of Hadoop aligned to 3.2.0 
> as per the downloadable spark binaries.
> This would enable the PyPI version to be compatible with session token 
> authorisations and assist in accessing data residing in object stores with 
> stronger encryption methods.
> If not PyPI then as a tar file in the apache download archives at the least 
> please.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2020-09-04 Thread Pasha Finkeshteyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190879#comment-17190879
 ] 

Pasha Finkeshteyn commented on SPARK-32530:
---

[~DennisJaheruddin] thank you for your idea. We've published announcement of 
version 
[1.0.0-preview1|https://blog.jetbrains.com/kotlin/2020/08/introducing-kotlin-for-apache-spark-preview/].
 We're going to implement 2.4 compatibility in next release (I believe it will 
be preview2 version). 

[~kabhwan] We're interested and committed to support at least our part and at 
most other parts of Apache Spark itself. In general, we're interested to 
provide end users for API with the best possible experience, which would be 
much easier if it will be part of the whole Spark project (for example, it will 
be much easier to maintain compatibility with every single version of spark 
because it can be tested automatically during build).

So if you may consider JetBrains as such vendor — then yes, vendor is 
interested in this.

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How long will it take?
> A  working implementation is already available, and if the community will 
> have any proposal of changes for this implementation to be improved, these 
> can be implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32794) Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32794:


Assignee: Apache Spark  (was: Tathagata Das)

> Rare corner case error in micro-batch engine with some stateful queries + 
> no-data-batches + V1 streaming sources 
> -
>
> Key: SPARK-32794
> URL: https://issues.apache.org/jira/browse/SPARK-32794
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>
> Structured Streaming micro-batch engine has the contract with V1 data sources 
> that, after a restart, it will call `source.getBatch()` on the last batch 
> attempted before the restart. However, a very rare combination of sequences 
> violates this contract. It occurs only when 
> - The streaming query has specific types of stateful operations with 
> watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). 
> - These queries can execute a batch even without new data when the 
> previous updates the watermark and the stateful ops are such that the new 
> watermark can cause new output/cleanup. Such batches are called 
> no-data-batches.
> - The last batch before termination was an incomplete no-data-batch. Upon 
> restart, the micro-batch engine fails to call `source.getBatch` when 
> attempting to re-execute the incomplete no-data-batch.
> This occurs because no-data-batches has the same and end offsets, and when a 
> batch is executed, if the start and end offset is same then calling 
> `source.getBatch` is skipped as it is assumed the generated plan will be 
> empty. This only affects V1 data sources which rely on this invariant to 
> initialize differently when the query is being started from scratch or 
> restarted. How will a source misbehave is very source-specific. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32794) Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32794:


Assignee: Tathagata Das  (was: Apache Spark)

> Rare corner case error in micro-batch engine with some stateful queries + 
> no-data-batches + V1 streaming sources 
> -
>
> Key: SPARK-32794
> URL: https://issues.apache.org/jira/browse/SPARK-32794
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Structured Streaming micro-batch engine has the contract with V1 data sources 
> that, after a restart, it will call `source.getBatch()` on the last batch 
> attempted before the restart. However, a very rare combination of sequences 
> violates this contract. It occurs only when 
> - The streaming query has specific types of stateful operations with 
> watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). 
> - These queries can execute a batch even without new data when the 
> previous updates the watermark and the stateful ops are such that the new 
> watermark can cause new output/cleanup. Such batches are called 
> no-data-batches.
> - The last batch before termination was an incomplete no-data-batch. Upon 
> restart, the micro-batch engine fails to call `source.getBatch` when 
> attempting to re-execute the incomplete no-data-batch.
> This occurs because no-data-batches has the same and end offsets, and when a 
> batch is executed, if the start and end offset is same then calling 
> `source.getBatch` is skipped as it is assumed the generated plan will be 
> empty. This only affects V1 data sources which rely on this invariant to 
> initialize differently when the query is being started from scratch or 
> restarted. How will a source misbehave is very source-specific. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32794) Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190866#comment-17190866
 ] 

Apache Spark commented on SPARK-32794:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/29651

> Rare corner case error in micro-batch engine with some stateful queries + 
> no-data-batches + V1 streaming sources 
> -
>
> Key: SPARK-32794
> URL: https://issues.apache.org/jira/browse/SPARK-32794
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Structured Streaming micro-batch engine has the contract with V1 data sources 
> that, after a restart, it will call `source.getBatch()` on the last batch 
> attempted before the restart. However, a very rare combination of sequences 
> violates this contract. It occurs only when 
> - The streaming query has specific types of stateful operations with 
> watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). 
> - These queries can execute a batch even without new data when the 
> previous updates the watermark and the stateful ops are such that the new 
> watermark can cause new output/cleanup. Such batches are called 
> no-data-batches.
> - The last batch before termination was an incomplete no-data-batch. Upon 
> restart, the micro-batch engine fails to call `source.getBatch` when 
> attempting to re-execute the incomplete no-data-batch.
> This occurs because no-data-batches has the same and end offsets, and when a 
> batch is executed, if the start and end offset is same then calling 
> `source.getBatch` is skipped as it is assumed the generated plan will be 
> empty. This only affects V1 data sources which rely on this invariant to 
> initialize differently when the query is being started from scratch or 
> restarted. How will a source misbehave is very source-specific. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32794) Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190867#comment-17190867
 ] 

Apache Spark commented on SPARK-32794:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/29651

> Rare corner case error in micro-batch engine with some stateful queries + 
> no-data-batches + V1 streaming sources 
> -
>
> Key: SPARK-32794
> URL: https://issues.apache.org/jira/browse/SPARK-32794
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Structured Streaming micro-batch engine has the contract with V1 data sources 
> that, after a restart, it will call `source.getBatch()` on the last batch 
> attempted before the restart. However, a very rare combination of sequences 
> violates this contract. It occurs only when 
> - The streaming query has specific types of stateful operations with 
> watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). 
> - These queries can execute a batch even without new data when the 
> previous updates the watermark and the stateful ops are such that the new 
> watermark can cause new output/cleanup. Such batches are called 
> no-data-batches.
> - The last batch before termination was an incomplete no-data-batch. Upon 
> restart, the micro-batch engine fails to call `source.getBatch` when 
> attempting to re-execute the incomplete no-data-batch.
> This occurs because no-data-batches has the same and end offsets, and when a 
> batch is executed, if the start and end offset is same then calling 
> `source.getBatch` is skipped as it is assumed the generated plan will be 
> empty. This only affects V1 data sources which rely on this invariant to 
> initialize differently when the query is being started from scratch or 
> restarted. How will a source misbehave is very source-specific. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32799) Make unionByName optionally fill missing columns with nulls in SparkR

2020-09-04 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32799:

Labels: beginner  (was: )

> Make unionByName optionally fill missing columns with nulls in SparkR
> -
>
> Key: SPARK-32799
> URL: https://issues.apache.org/jira/browse/SPARK-32799
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: beginner
>
> It would be nicer to expose {{unionbyName}} paramter in R APIs as well. 
> Currently this is only exposed in Scala/Java sides at SPARK-29358



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-04 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32798:

Labels: beginner  (was: )

> Make unionByName optionally fill missing columns with nulls in PySpark
> --
>
> Key: SPARK-32798
> URL: https://issues.apache.org/jira/browse/SPARK-32798
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: beginner
>
> It would be better to expose {{unionByName}} parameter in Python APIs as 
> well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-09-04 Thread Karen Feng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Feng updated SPARK-32793:
---
Description: 
# Add RAISEERROR() (or RAISE_ERROR()) to the API
 # Add Scala/Python/R version of API for ASSERT_TRUE()
 # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
the `message` parameter is only lazily evaluated when the condition is not true
 # Change the implementation of ASSERT_TRUE() to be rewritten during 
optimization to IF() instead.

  was:
# 
Add RAISEERROR() (or RAISE_ERROR()) to the API
 # 
Add Scala/Python/R version of API for ASSERT_TRUE()
 # 
Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which the 
`message` parameter is only lazily evaluated when the condition is not true
 # 
Change the implementation of ASSERT_TRUE() to be rewritten during optimization 
to IF() instead.


> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Priority: Minor
>
> # Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # Add Scala/Python/R version of API for ASSERT_TRUE()
>  # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
> the `message` parameter is only lazily evaluated when the condition is not 
> true
>  # Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-09-04 Thread Karen Feng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Feng updated SPARK-32793:
---
Description: 
# 
Add RAISEERROR() (or RAISE_ERROR()) to the API
 # 
Add Scala/Python/R version of API for ASSERT_TRUE()
 # 
Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which the 
`message` parameter is only lazily evaluated when the condition is not true
 # 
Change the implementation of ASSERT_TRUE() to be rewritten during optimization 
to IF() instead.

  was:
# assert_true is only available as a Spark SQL expression, and should be 
exposed as a  function in the Scala and Python APIs for easier programmatic 
access.
 # The error message thrown when the assertion fails is often not very useful 
for the user. Add a parameter so that users can pass a custom error message.


> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Priority: Minor
>
> # 
> Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # 
> Add Scala/Python/R version of API for ASSERT_TRUE()
>  # 
> Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which the 
> `message` parameter is only lazily evaluated when the condition is not true
>  # 
> Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32801:


Assignee: (was: Apache Spark)

> Make InferFiltersFromConstraints take in account EqualNullSafe
> --
>
> Key: SPARK-32801
> URL: https://issues.apache.org/jira/browse/SPARK-32801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190673#comment-17190673
 ] 

Apache Spark commented on SPARK-32801:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29650

> Make InferFiltersFromConstraints take in account EqualNullSafe
> --
>
> Key: SPARK-32801
> URL: https://issues.apache.org/jira/browse/SPARK-32801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32801:


Assignee: Apache Spark

> Make InferFiltersFromConstraints take in account EqualNullSafe
> --
>
> Key: SPARK-32801
> URL: https://issues.apache.org/jira/browse/SPARK-32801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190674#comment-17190674
 ] 

Apache Spark commented on SPARK-32801:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29650

> Make InferFiltersFromConstraints take in account EqualNullSafe
> --
>
> Key: SPARK-32801
> URL: https://issues.apache.org/jira/browse/SPARK-32801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32779:


Assignee: (was: Apache Spark)

> Spark/Hive3 interaction potentially causes deadlock
> ---
>
> Key: SPARK-32779
> URL: https://issues.apache.org/jira/browse/SPARK-32779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This is an issue for applications that share a Spark Session across multiple 
> threads.
> sessionCatalog.loadPartition (after checking that the table exists) grabs 
> locks in this order:
>  - HiveExternalCatalog
>  - HiveSessionCatalog (in Shim_v3_0)
> Other operations (e.g., sessionCatalog.tableExists), grab locks in this order:
>  - HiveSessionCatalog
>  - HiveExternalCatalog
> [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332]
>  appears to be the culprit. Maybe db name should be defaulted _before_ the 
> call to HiveClient so that Shim_v3_0 doesn't have to call back into 
> SessionCatalog. Or possibly this is not needed at all, since loadPartition in 
> Shim_v2_1 doesn't worry about the default db name, but that might be because 
> of differences between Hive client libraries.
> Reproduction case:
>  - You need to have a running Hive 3.x HMS instance and the appropriate 
> hive-site.xml for your Spark instance
>  - Adjust your spark.sql.hive.metastore.version accordingly
>  - It might take more than one try to hit the deadlock
> Launch Spark:
> {noformat}
> bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" 
> --conf spark.sql.hive.metastore.version=3.1
> {noformat}
> Then use the following code:
> {noformat}
> import scala.collection.mutable.ArrayBuffer
> import scala.util.Random
> val tableCount = 4
> for (i <- 0 until tableCount) {
>   val tableName = s"partitioned${i+1}"
>   sql(s"drop table if exists $tableName")
>   sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored 
> as orc")
> }
> val threads = new ArrayBuffer[Thread]
> for (i <- 0 until tableCount) {
>   threads.append(new Thread( new Runnable {
> override def run: Unit = {
>   val tableName = s"partitioned${i + 1}"
>   val rand = Random
>   val df = spark.range(0, 2).toDF("a")
>   val location = s"/tmp/${rand.nextLong.abs}"
>   df.write.mode("overwrite").orc(location)
>   sql(
> s"""
> LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition 
> (b=$i)""")
> }
>   }, s"worker$i"))
>   threads(i).start()
> }
> for (i <- 0 until tableCount) {
>   println(s"Joining with thread $i")
>   threads(i).join()
> }
> println("All done")
> {noformat}
> The job often gets stuck after one or two "Joining..." lines.
> {{kill -3}} shows something like this:
> {noformat}
> Found one Java-level deadlock:
> =
> "worker3":
>   waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a 
> org.apache.spark.sql.hive.HiveSessionCatalog),
>   which is held by "worker0"
> "worker0":
>   waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a 
> org.apache.spark.sql.hive.HiveExternalCatalog),
>   which is held by "worker3"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32779:


Assignee: Apache Spark

> Spark/Hive3 interaction potentially causes deadlock
> ---
>
> Key: SPARK-32779
> URL: https://issues.apache.org/jira/browse/SPARK-32779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> This is an issue for applications that share a Spark Session across multiple 
> threads.
> sessionCatalog.loadPartition (after checking that the table exists) grabs 
> locks in this order:
>  - HiveExternalCatalog
>  - HiveSessionCatalog (in Shim_v3_0)
> Other operations (e.g., sessionCatalog.tableExists), grab locks in this order:
>  - HiveSessionCatalog
>  - HiveExternalCatalog
> [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332]
>  appears to be the culprit. Maybe db name should be defaulted _before_ the 
> call to HiveClient so that Shim_v3_0 doesn't have to call back into 
> SessionCatalog. Or possibly this is not needed at all, since loadPartition in 
> Shim_v2_1 doesn't worry about the default db name, but that might be because 
> of differences between Hive client libraries.
> Reproduction case:
>  - You need to have a running Hive 3.x HMS instance and the appropriate 
> hive-site.xml for your Spark instance
>  - Adjust your spark.sql.hive.metastore.version accordingly
>  - It might take more than one try to hit the deadlock
> Launch Spark:
> {noformat}
> bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" 
> --conf spark.sql.hive.metastore.version=3.1
> {noformat}
> Then use the following code:
> {noformat}
> import scala.collection.mutable.ArrayBuffer
> import scala.util.Random
> val tableCount = 4
> for (i <- 0 until tableCount) {
>   val tableName = s"partitioned${i+1}"
>   sql(s"drop table if exists $tableName")
>   sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored 
> as orc")
> }
> val threads = new ArrayBuffer[Thread]
> for (i <- 0 until tableCount) {
>   threads.append(new Thread( new Runnable {
> override def run: Unit = {
>   val tableName = s"partitioned${i + 1}"
>   val rand = Random
>   val df = spark.range(0, 2).toDF("a")
>   val location = s"/tmp/${rand.nextLong.abs}"
>   df.write.mode("overwrite").orc(location)
>   sql(
> s"""
> LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition 
> (b=$i)""")
> }
>   }, s"worker$i"))
>   threads(i).start()
> }
> for (i <- 0 until tableCount) {
>   println(s"Joining with thread $i")
>   threads(i).join()
> }
> println("All done")
> {noformat}
> The job often gets stuck after one or two "Joining..." lines.
> {{kill -3}} shows something like this:
> {noformat}
> Found one Java-level deadlock:
> =
> "worker3":
>   waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a 
> org.apache.spark.sql.hive.HiveSessionCatalog),
>   which is held by "worker0"
> "worker0":
>   waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a 
> org.apache.spark.sql.hive.HiveExternalCatalog),
>   which is held by "worker3"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190671#comment-17190671
 ] 

Apache Spark commented on SPARK-32779:
--

User 'sandeep-katta' has created a pull request for this issue:
https://github.com/apache/spark/pull/29649

> Spark/Hive3 interaction potentially causes deadlock
> ---
>
> Key: SPARK-32779
> URL: https://issues.apache.org/jira/browse/SPARK-32779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This is an issue for applications that share a Spark Session across multiple 
> threads.
> sessionCatalog.loadPartition (after checking that the table exists) grabs 
> locks in this order:
>  - HiveExternalCatalog
>  - HiveSessionCatalog (in Shim_v3_0)
> Other operations (e.g., sessionCatalog.tableExists), grab locks in this order:
>  - HiveSessionCatalog
>  - HiveExternalCatalog
> [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332]
>  appears to be the culprit. Maybe db name should be defaulted _before_ the 
> call to HiveClient so that Shim_v3_0 doesn't have to call back into 
> SessionCatalog. Or possibly this is not needed at all, since loadPartition in 
> Shim_v2_1 doesn't worry about the default db name, but that might be because 
> of differences between Hive client libraries.
> Reproduction case:
>  - You need to have a running Hive 3.x HMS instance and the appropriate 
> hive-site.xml for your Spark instance
>  - Adjust your spark.sql.hive.metastore.version accordingly
>  - It might take more than one try to hit the deadlock
> Launch Spark:
> {noformat}
> bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" 
> --conf spark.sql.hive.metastore.version=3.1
> {noformat}
> Then use the following code:
> {noformat}
> import scala.collection.mutable.ArrayBuffer
> import scala.util.Random
> val tableCount = 4
> for (i <- 0 until tableCount) {
>   val tableName = s"partitioned${i+1}"
>   sql(s"drop table if exists $tableName")
>   sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored 
> as orc")
> }
> val threads = new ArrayBuffer[Thread]
> for (i <- 0 until tableCount) {
>   threads.append(new Thread( new Runnable {
> override def run: Unit = {
>   val tableName = s"partitioned${i + 1}"
>   val rand = Random
>   val df = spark.range(0, 2).toDF("a")
>   val location = s"/tmp/${rand.nextLong.abs}"
>   df.write.mode("overwrite").orc(location)
>   sql(
> s"""
> LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition 
> (b=$i)""")
> }
>   }, s"worker$i"))
>   threads(i).start()
> }
> for (i <- 0 until tableCount) {
>   println(s"Joining with thread $i")
>   threads(i).join()
> }
> println("All done")
> {noformat}
> The job often gets stuck after one or two "Joining..." lines.
> {{kill -3}} shows something like this:
> {noformat}
> Found one Java-level deadlock:
> =
> "worker3":
>   waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a 
> org.apache.spark.sql.hive.HiveSessionCatalog),
>   which is held by "worker0"
> "worker0":
>   waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a 
> org.apache.spark.sql.hive.HiveExternalCatalog),
>   which is held by "worker3"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe

2020-09-04 Thread Tanel Kiis (Jira)
Tanel Kiis created SPARK-32801:
--

 Summary: Make InferFiltersFromConstraints take in account 
EqualNullSafe
 Key: SPARK-32801
 URL: https://issues.apache.org/jira/browse/SPARK-32801
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Tanel Kiis






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-09-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190661#comment-17190661
 ] 

Hyukjin Kwon commented on SPARK-32312:
--

Thank you [~bryanc]!

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32800) Remove ExpressionSet from the 2.13 branch

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32800:


Assignee: (was: Apache Spark)

> Remove ExpressionSet from the 2.13 branch
> -
>
> Key: SPARK-32800
> URL: https://issues.apache.org/jira/browse/SPARK-32800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ali Afroozeh
>Priority: Major
>
> ExpressionSet does not extent Scala Set anymore, and therefore, can be 
> removed from the 2.13 branch. This is a followup on 
> https://issues.apache.org/jira/browse/SPARK-32755.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32800) Remove ExpressionSet from the 2.13 branch

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190657#comment-17190657
 ] 

Apache Spark commented on SPARK-32800:
--

User 'dbaliafroozeh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29648

> Remove ExpressionSet from the 2.13 branch
> -
>
> Key: SPARK-32800
> URL: https://issues.apache.org/jira/browse/SPARK-32800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ali Afroozeh
>Priority: Major
>
> ExpressionSet does not extent Scala Set anymore, and therefore, can be 
> removed from the 2.13 branch. This is a followup on 
> https://issues.apache.org/jira/browse/SPARK-32755.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32800) Remove ExpressionSet from the 2.13 branch

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32800:


Assignee: Apache Spark

> Remove ExpressionSet from the 2.13 branch
> -
>
> Key: SPARK-32800
> URL: https://issues.apache.org/jira/browse/SPARK-32800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ali Afroozeh
>Assignee: Apache Spark
>Priority: Major
>
> ExpressionSet does not extent Scala Set anymore, and therefore, can be 
> removed from the 2.13 branch. This is a followup on 
> https://issues.apache.org/jira/browse/SPARK-32755.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32017) Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190654#comment-17190654
 ] 

Hyukjin Kwon commented on SPARK-32017:
--

I am working on this.

> Make Pyspark Hadoop 3.2+ Variant available in PyPI
> --
>
> Key: SPARK-32017
> URL: https://issues.apache.org/jira/browse/SPARK-32017
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: George Pongracz
>Priority: Major
>
> The version of Pyspark 3.0.0 currently available in PyPI currently uses 
> hadoop 2.7.4.
> Could a variant (or the default) have its version of Hadoop aligned to 3.2.0 
> as per the downloadable spark binaries.
> This would enable the PyPI version to be compatible with session token 
> authorisations and assist in accessing data residing in object stores with 
> stronger encryption methods.
> If not PyPI then as a tar file in the apache download archives at the least 
> please.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190650#comment-17190650
 ] 

Hyukjin Kwon commented on SPARK-32798:
--

cc [~nchammas] [~fokko] or [~zero323] are you interested in this :-)? 

> Make unionByName optionally fill missing columns with nulls in PySpark
> --
>
> Key: SPARK-32798
> URL: https://issues.apache.org/jira/browse/SPARK-32798
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> It would be better to expose {{unionByName}} parameter in Python APIs as 
> well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32799) Make unionByName optionally fill missing columns with nulls in SparkR

2020-09-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190646#comment-17190646
 ] 

Hyukjin Kwon commented on SPARK-32799:
--

cc [~michaelchirico] are you interested in this?

> Make unionByName optionally fill missing columns with nulls in SparkR
> -
>
> Key: SPARK-32799
> URL: https://issues.apache.org/jira/browse/SPARK-32799
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> It would be nicer to expose {{unionbyName}} paramter in R APIs as well. 
> Currently this is only exposed in Scala/Java sides at SPARK-29358



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32800) Remove ExpressionSet from the 2.13 branch

2020-09-04 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-32800:


 Summary: Remove ExpressionSet from the 2.13 branch
 Key: SPARK-32800
 URL: https://issues.apache.org/jira/browse/SPARK-32800
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Ali Afroozeh


ExpressionSet does not extent Scala Set anymore, and therefore, can be removed 
from the 2.13 branch. This is a followup on 
https://issues.apache.org/jira/browse/SPARK-32755.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32798:
-
Labels:   (was: starter)

> Make unionByName optionally fill missing columns with nulls in PySpark
> --
>
> Key: SPARK-32798
> URL: https://issues.apache.org/jira/browse/SPARK-32798
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> It would be better to expose {{unionByName}} parameter in Python APIs as 
> well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32799) Make unionByName optionally fill missing columns with nulls in SparkR

2020-09-04 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32799:


 Summary: Make unionByName optionally fill missing columns with 
nulls in SparkR
 Key: SPARK-32799
 URL: https://issues.apache.org/jira/browse/SPARK-32799
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


It would be nicer to expose {{unionbyName}} paramter in R APIs as well. 
Currently this is only exposed in Scala/Java sides at SPARK-29358



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-04 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32798:


 Summary: Make unionByName optionally fill missing columns with 
nulls in PySpark
 Key: SPARK-32798
 URL: https://issues.apache.org/jira/browse/SPARK-32798
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


It would be better to expose {{unionByName}} parameter in Python APIs as well. 
Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32783) Development - Testing PySpark

2020-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32783.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29634
[https://github.com/apache/spark/pull/29634]

> Development - Testing PySpark
> -
>
> Key: SPARK-32783
> URL: https://issues.apache.org/jira/browse/SPARK-32783
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Link back https://spark.apache.org/developer-tools.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-04 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190639#comment-17190639
 ] 

Wenchen Fan commented on SPARK-32764:
-

Thanks for your analysis, [~sandeep.katta2007]

I think we need to add back the `OrderingUtil` to implement custom compare 
methods to take care of 0.0 vs -0.0. I've opened a PR for it: 
[https://github.com/apache/spark/pull/29647]

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: correctness
> Attachments: 2.4_codegen.txt, 3.0_Codegen.txt
>
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.
> **Apache Spark 2.4.6 RESULT**
> {code}
> scala> spark.version
> res0: String = 2.4.6
> scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
> col("pos")).show
> ++---+-+
> | neg|pos| comp|
> ++---+-+
> |-0.0|0.0|false|
> ++---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32764:


Assignee: (was: Apache Spark)

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: correctness
> Attachments: 2.4_codegen.txt, 3.0_Codegen.txt
>
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.
> **Apache Spark 2.4.6 RESULT**
> {code}
> scala> spark.version
> res0: String = 2.4.6
> scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
> col("pos")).show
> ++---+-+
> | neg|pos| comp|
> ++---+-+
> |-0.0|0.0|false|
> ++---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32764:


Assignee: Apache Spark

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
> Attachments: 2.4_codegen.txt, 3.0_Codegen.txt
>
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.
> **Apache Spark 2.4.6 RESULT**
> {code}
> scala> spark.version
> res0: String = 2.4.6
> scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
> col("pos")).show
> ++---+-+
> | neg|pos| comp|
> ++---+-+
> |-0.0|0.0|false|
> ++---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190637#comment-17190637
 ] 

Apache Spark commented on SPARK-32764:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29647

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: correctness
> Attachments: 2.4_codegen.txt, 3.0_Codegen.txt
>
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.
> **Apache Spark 2.4.6 RESULT**
> {code}
> scala> spark.version
> res0: String = 2.4.6
> scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
> col("pos")).show
> ++---+-+
> | neg|pos| comp|
> ++---+-+
> |-0.0|0.0|false|
> ++---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock

2020-09-04 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190626#comment-17190626
 ] 

Sandeep Katta commented on SPARK-32779:
---

[~bersprockets] thanks for raising this issue, I am working on this. I will 
raise the patch soon

> Spark/Hive3 interaction potentially causes deadlock
> ---
>
> Key: SPARK-32779
> URL: https://issues.apache.org/jira/browse/SPARK-32779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This is an issue for applications that share a Spark Session across multiple 
> threads.
> sessionCatalog.loadPartition (after checking that the table exists) grabs 
> locks in this order:
>  - HiveExternalCatalog
>  - HiveSessionCatalog (in Shim_v3_0)
> Other operations (e.g., sessionCatalog.tableExists), grab locks in this order:
>  - HiveSessionCatalog
>  - HiveExternalCatalog
> [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332]
>  appears to be the culprit. Maybe db name should be defaulted _before_ the 
> call to HiveClient so that Shim_v3_0 doesn't have to call back into 
> SessionCatalog. Or possibly this is not needed at all, since loadPartition in 
> Shim_v2_1 doesn't worry about the default db name, but that might be because 
> of differences between Hive client libraries.
> Reproduction case:
>  - You need to have a running Hive 3.x HMS instance and the appropriate 
> hive-site.xml for your Spark instance
>  - Adjust your spark.sql.hive.metastore.version accordingly
>  - It might take more than one try to hit the deadlock
> Launch Spark:
> {noformat}
> bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" 
> --conf spark.sql.hive.metastore.version=3.1
> {noformat}
> Then use the following code:
> {noformat}
> import scala.collection.mutable.ArrayBuffer
> import scala.util.Random
> val tableCount = 4
> for (i <- 0 until tableCount) {
>   val tableName = s"partitioned${i+1}"
>   sql(s"drop table if exists $tableName")
>   sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored 
> as orc")
> }
> val threads = new ArrayBuffer[Thread]
> for (i <- 0 until tableCount) {
>   threads.append(new Thread( new Runnable {
> override def run: Unit = {
>   val tableName = s"partitioned${i + 1}"
>   val rand = Random
>   val df = spark.range(0, 2).toDF("a")
>   val location = s"/tmp/${rand.nextLong.abs}"
>   df.write.mode("overwrite").orc(location)
>   sql(
> s"""
> LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition 
> (b=$i)""")
> }
>   }, s"worker$i"))
>   threads(i).start()
> }
> for (i <- 0 until tableCount) {
>   println(s"Joining with thread $i")
>   threads(i).join()
> }
> println("All done")
> {noformat}
> The job often gets stuck after one or two "Joining..." lines.
> {{kill -3}} shows something like this:
> {noformat}
> Found one Java-level deadlock:
> =
> "worker3":
>   waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a 
> org.apache.spark.sql.hive.HiveSessionCatalog),
>   which is held by "worker0"
> "worker0":
>   waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a 
> org.apache.spark.sql.hive.HiveExternalCatalog),
>   which is held by "worker3"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32797) Install mypy on the Jenkins CI workers

2020-09-04 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-32797:
-
Description: 
We want to check the types of the PySpark code. This requires mypy to be 
installed on the CI. Can you do this [~shaneknapp]? 

Related PR: [https://github.com/apache/spark/pull/29180]

You can install this using pip: [https://pypi.org/project/mypy/] Should be 
similar to flake8 and sphinx. The latest version is ok! Thanks!

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32797) Install mypy on the Jenkins CI workers

2020-09-04 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32797:


 Summary: Install mypy on the Jenkins CI workers
 Key: SPARK-32797
 URL: https://issues.apache.org/jira/browse/SPARK-32797
 Project: Spark
  Issue Type: Improvement
  Components: jenkins, PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32795) ApplicationInfo#removedExecutors can cause OOM

2020-09-04 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32795:
-
Priority: Major  (was: Critical)

> ApplicationInfo#removedExecutors can cause OOM
> --
>
> Key: SPARK-32795
> URL: https://issues.apache.org/jira/browse/SPARK-32795
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Victor Tso
>Priority: Major
> Attachments: image-2020-09-03-23-27-11-809.png
>
>
> !image-2020-09-03-23-27-11-809.png|width=840,height=439!
> In my case, the Standalone Spark master process had a max heap of 1g. 738mb 
> were consumed by these ExecutorDesc objects, the vast majority of which were 
> the 18.5M removedExecutors. This caused the master to OOM and leave the 
> application driver process dangling.
> The reason for this is that the worker node ran out of disk space, so for 
> whatever reason decided to go in a fast and endless loop trying to launch new 
> executors and they in turn crashed too. It got up to the 18M before the 
> master just couldn't handle the history anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org