[jira] [Commented] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver
[ https://issues.apache.org/jira/browse/SPARK-32803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190963#comment-17190963 ] Apache Spark commented on SPARK-32803: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/29652 > Catch InterruptedException when resolve rack in SparkRackResolver > - > > Key: SPARK-32803 > URL: https://issues.apache.org/jira/browse/SPARK-32803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Major > > In Yarn mode, Insert into hive table can produce some dirty data when kill > Spark application. The error msg is: > ``` > java.io.IOException: java.lang.InterruptedException at > org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at > org.apache.hadoop.util.Shell.run(Shell.java:507) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188) > at > org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119) > at > org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) > at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at > org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at > org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at > org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250) > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073) > at scala.collection.immutable.List.foreach(List.scala:392) at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > ``` > The reason is: > 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname > when we kill Spark application, then we got an `InterruptedException` error. > 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so > `FileFormatWriter` action cann't get the failed info. > 3. `FileFormatWriter` not abort this commit and leaves some dirty data. > > So we should catch the `InterruptedException` error and throw a > `SparkException`. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver
[ https://issues.apache.org/jira/browse/SPARK-32803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32803: Assignee: Apache Spark > Catch InterruptedException when resolve rack in SparkRackResolver > - > > Key: SPARK-32803 > URL: https://issues.apache.org/jira/browse/SPARK-32803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Major > > In Yarn mode, Insert into hive table can produce some dirty data when kill > Spark application. The error msg is: > ``` > java.io.IOException: java.lang.InterruptedException at > org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at > org.apache.hadoop.util.Shell.run(Shell.java:507) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188) > at > org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119) > at > org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) > at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at > org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at > org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at > org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250) > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073) > at scala.collection.immutable.List.foreach(List.scala:392) at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > ``` > The reason is: > 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname > when we kill Spark application, then we got an `InterruptedException` error. > 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so > `FileFormatWriter` action cann't get the failed info. > 3. `FileFormatWriter` not abort this commit and leaves some dirty data. > > So we should catch the `InterruptedException` error and throw a > `SparkException`. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver
[ https://issues.apache.org/jira/browse/SPARK-32803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32803: Assignee: (was: Apache Spark) > Catch InterruptedException when resolve rack in SparkRackResolver > - > > Key: SPARK-32803 > URL: https://issues.apache.org/jira/browse/SPARK-32803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Major > > In Yarn mode, Insert into hive table can produce some dirty data when kill > Spark application. The error msg is: > ``` > java.io.IOException: java.lang.InterruptedException at > org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at > org.apache.hadoop.util.Shell.run(Shell.java:507) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188) > at > org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119) > at > org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) > at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at > org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at > org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at > org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250) > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073) > at scala.collection.immutable.List.foreach(List.scala:392) at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > ``` > The reason is: > 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname > when we kill Spark application, then we got an `InterruptedException` error. > 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so > `FileFormatWriter` action cann't get the failed info. > 3. `FileFormatWriter` not abort this commit and leaves some dirty data. > > So we should catch the `InterruptedException` error and throw a > `SparkException`. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver
[ https://issues.apache.org/jira/browse/SPARK-32803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190962#comment-17190962 ] Apache Spark commented on SPARK-32803: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/29652 > Catch InterruptedException when resolve rack in SparkRackResolver > - > > Key: SPARK-32803 > URL: https://issues.apache.org/jira/browse/SPARK-32803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Major > > In Yarn mode, Insert into hive table can produce some dirty data when kill > Spark application. The error msg is: > ``` > java.io.IOException: java.lang.InterruptedException at > org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at > org.apache.hadoop.util.Shell.run(Shell.java:507) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251) > at > org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188) > at > org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119) > at > org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) > at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at > org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at > org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at > org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250) > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073) > at scala.collection.immutable.List.foreach(List.scala:392) at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > ``` > The reason is: > 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname > when we kill Spark application, then we got an `InterruptedException` error. > 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so > `FileFormatWriter` action cann't get the failed info. > 3. `FileFormatWriter` not abort this commit and leaves some dirty data. > > So we should catch the `InterruptedException` error and throw a > `SparkException`. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32803) Catch InterruptedException when resolve rack in SparkRackResolver
ulysses you created SPARK-32803: --- Summary: Catch InterruptedException when resolve rack in SparkRackResolver Key: SPARK-32803 URL: https://issues.apache.org/jira/browse/SPARK-32803 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.0 Reporter: ulysses you In Yarn mode, Insert into hive table can produce some dirty data when kill Spark application. The error msg is: ``` java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:607) at org.apache.hadoop.util.Shell.run(Shell.java:507) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251) at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188) at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119) at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) at org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:235) at org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:216) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:216) at org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:188) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:187) at org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:250) at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:208) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1215) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1074) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:1073) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1073) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) ``` The reason is: 1. `CachedDNSToSwitchMapping` may execute a shell command to resolve hostname when we kill Spark application, then we got an `InterruptedException` error. 2. `DAGSchedulerEventProcessLoop` ignore `InterruptedException`, so `FileFormatWriter` action cann't get the failed info. 3. `FileFormatWriter` not abort this commit and leaves some dirty data. So we should catch the `InterruptedException` error and throw a `SparkException`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32802) Avoid using SpecificInternalRow in RunLengthEncoding#Encoder
Chao Sun created SPARK-32802: Summary: Avoid using SpecificInternalRow in RunLengthEncoding#Encoder Key: SPARK-32802 URL: https://issues.apache.org/jira/browse/SPARK-32802 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Chao Sun {{RunLengthEncoding#Encoder}} currently uses {{SpecificInternalRow}} which is more expensive than using the native types. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190920#comment-17190920 ] Aman Rastogi commented on SPARK-32778: -- [~maropu], I am on v2.3, I have not tried higher version. > Accidental Data Deletion on calling saveAsTable > --- > > Key: SPARK-32778 > URL: https://issues.apache.org/jira/browse/SPARK-32778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aman Rastogi >Priority: Major > > {code:java} > df.write.option("path", > "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) > {code} > Above code deleted the data present in path "/already/existing/path". This > happened because table was already not there in hive metastore however, path > given had data. And if table is not present in Hive Metastore, SaveMode gets > modified internally to SaveMode.Overwrite irrespective of what user has > provided, which leads to data deletion. This change was introduced as part of > https://issues.apache.org/jira/browse/SPARK-19583. > Now, suppose if user is not using external hive metastore (hive metastore is > associated with a cluster) and if cluster goes down or due to some reason > user has to migrate to a new cluster. Once user tries to save data using > above code in new cluster, it will first delete the data. It could be a > production data and user is completely unaware of it as they have provided > SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. > > Repro Steps: > > # Save data through a hive table as mentioned in above code > # create another cluster and save data in new table in new cluster by giving > same path > > Proposed Fix: > Instead of modifying SaveMode to Overwrite, we should modify it to > ErrorIfExists in class CreateDataSourceTableAsSelectCommand. > Change (line 154) > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = > false) > > {code} > to > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, > tableExists = false){code} > This should not break CTAS. Even in case of CTAS, user may not want to delete > data if already exists as it could be accidental. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32017) Make Pyspark Hadoop 3.2+ Variant available in PyPI
[ https://issues.apache.org/jira/browse/SPARK-32017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190908#comment-17190908 ] George Pongracz commented on SPARK-32017: - Thank you, this is valuable to me and your effort is appreciated. > Make Pyspark Hadoop 3.2+ Variant available in PyPI > -- > > Key: SPARK-32017 > URL: https://issues.apache.org/jira/browse/SPARK-32017 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: George Pongracz >Priority: Major > > The version of Pyspark 3.0.0 currently available in PyPI currently uses > hadoop 2.7.4. > Could a variant (or the default) have its version of Hadoop aligned to 3.2.0 > as per the downloadable spark binaries. > This would enable the PyPI version to be compatible with session token > authorisations and assist in accessing data residing in object stores with > stronger encryption methods. > If not PyPI then as a tar file in the apache download archives at the least > please. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190879#comment-17190879 ] Pasha Finkeshteyn commented on SPARK-32530: --- [~DennisJaheruddin] thank you for your idea. We've published announcement of version [1.0.0-preview1|https://blog.jetbrains.com/kotlin/2020/08/introducing-kotlin-for-apache-spark-preview/]. We're going to implement 2.4 compatibility in next release (I believe it will be preview2 version). [~kabhwan] We're interested and committed to support at least our part and at most other parts of Apache Spark itself. In general, we're interested to provide end users for API with the best possible experience, which would be much easier if it will be part of the whole Spark project (for example, it will be much easier to maintain compatibility with every single version of spark because it can be tested automatically during build). So if you may consider JetBrains as such vendor — then yes, vendor is interested in this. > SPIP: Kotlin support for Apache Spark > - > > Key: SPARK-32530 > URL: https://issues.apache.org/jira/browse/SPARK-32530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Pasha Finkeshteyn >Priority: Major > > h2. Background and motivation > Kotlin is a cross-platform, statically typed, general-purpose JVM language. > In the last year more than 5 million developers have used Kotlin in mobile, > backend, frontend and scientific development. The number of Kotlin developers > grows rapidly every year. > * [According to > redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: > "Kotlin, the second fastest growing language we’ve seen outside of Swift, > made a big splash a year ago at this time when it vaulted eight full spots up > the list." > * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], > Kotlin is the second most popular language on the JVM > * [According to > StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share > increased by 7.8% in 2020. > We notice the increasing usage of Kotlin in data analysis ([6% of users in > 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to > 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in > 2019), and we expect these numbers to continue to grow. > We, authors of this SPIP, strongly believe that making Kotlin API officially > available to developers can bring new users to Apache Spark and help some of > the existing users. > h2. Goals > The goal of this project is to bring first-class support for Kotlin language > into the Apache Spark project. We’re going to achieve this by adding one more > module to the current Apache Spark distribution. > h2. Non-goals > There is no goal to replace any existing language support or to change any > existing Apache Spark API. > At this time, there is no goal to support non-core APIs of Apache Spark like > Spark ML and Spark structured streaming. This may change in the future based > on community feedback. > There is no goal to provide CLI for Kotlin for Apache Spark, this will be a > separate SPIP. > There is no goal to provide support for Apache Spark < 3.0.0. > h2. Current implementation > A working prototype is available at > [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside > JetBrains and by early adopters. > h2. What are the risks? > There is always a risk that this product won’t get enough popularity and will > bring more costs than benefits. It can be mitigated by the fact that we don't > need to change any existing API and support can be potentially dropped at any > time. > We also believe that existing API is rather low maintenance. It does not > bring anything more complex than already exists in the Spark codebase. > Furthermore, the implementation is compact - less than 2000 lines of code. > We are committed to maintaining, improving and evolving the API based on > feedback from both Spark and Kotlin communities. As the Kotlin data community > continues to grow, we see Kotlin API for Apache Spark as an important part in > the evolving Kotlin ecosystem, and intend to fully support it. > h2. How long will it take? > A working implementation is already available, and if the community will > have any proposal of changes for this implementation to be improved, these > can be implemented quickly — in weeks if not days. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32794) Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources
[ https://issues.apache.org/jira/browse/SPARK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32794: Assignee: Apache Spark (was: Tathagata Das) > Rare corner case error in micro-batch engine with some stateful queries + > no-data-batches + V1 streaming sources > - > > Key: SPARK-32794 > URL: https://issues.apache.org/jira/browse/SPARK-32794 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Major > > Structured Streaming micro-batch engine has the contract with V1 data sources > that, after a restart, it will call `source.getBatch()` on the last batch > attempted before the restart. However, a very rare combination of sequences > violates this contract. It occurs only when > - The streaming query has specific types of stateful operations with > watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). > - These queries can execute a batch even without new data when the > previous updates the watermark and the stateful ops are such that the new > watermark can cause new output/cleanup. Such batches are called > no-data-batches. > - The last batch before termination was an incomplete no-data-batch. Upon > restart, the micro-batch engine fails to call `source.getBatch` when > attempting to re-execute the incomplete no-data-batch. > This occurs because no-data-batches has the same and end offsets, and when a > batch is executed, if the start and end offset is same then calling > `source.getBatch` is skipped as it is assumed the generated plan will be > empty. This only affects V1 data sources which rely on this invariant to > initialize differently when the query is being started from scratch or > restarted. How will a source misbehave is very source-specific. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32794) Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources
[ https://issues.apache.org/jira/browse/SPARK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32794: Assignee: Tathagata Das (was: Apache Spark) > Rare corner case error in micro-batch engine with some stateful queries + > no-data-batches + V1 streaming sources > - > > Key: SPARK-32794 > URL: https://issues.apache.org/jira/browse/SPARK-32794 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > Structured Streaming micro-batch engine has the contract with V1 data sources > that, after a restart, it will call `source.getBatch()` on the last batch > attempted before the restart. However, a very rare combination of sequences > violates this contract. It occurs only when > - The streaming query has specific types of stateful operations with > watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). > - These queries can execute a batch even without new data when the > previous updates the watermark and the stateful ops are such that the new > watermark can cause new output/cleanup. Such batches are called > no-data-batches. > - The last batch before termination was an incomplete no-data-batch. Upon > restart, the micro-batch engine fails to call `source.getBatch` when > attempting to re-execute the incomplete no-data-batch. > This occurs because no-data-batches has the same and end offsets, and when a > batch is executed, if the start and end offset is same then calling > `source.getBatch` is skipped as it is assumed the generated plan will be > empty. This only affects V1 data sources which rely on this invariant to > initialize differently when the query is being started from scratch or > restarted. How will a source misbehave is very source-specific. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32794) Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources
[ https://issues.apache.org/jira/browse/SPARK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190866#comment-17190866 ] Apache Spark commented on SPARK-32794: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/29651 > Rare corner case error in micro-batch engine with some stateful queries + > no-data-batches + V1 streaming sources > - > > Key: SPARK-32794 > URL: https://issues.apache.org/jira/browse/SPARK-32794 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > Structured Streaming micro-batch engine has the contract with V1 data sources > that, after a restart, it will call `source.getBatch()` on the last batch > attempted before the restart. However, a very rare combination of sequences > violates this contract. It occurs only when > - The streaming query has specific types of stateful operations with > watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). > - These queries can execute a batch even without new data when the > previous updates the watermark and the stateful ops are such that the new > watermark can cause new output/cleanup. Such batches are called > no-data-batches. > - The last batch before termination was an incomplete no-data-batch. Upon > restart, the micro-batch engine fails to call `source.getBatch` when > attempting to re-execute the incomplete no-data-batch. > This occurs because no-data-batches has the same and end offsets, and when a > batch is executed, if the start and end offset is same then calling > `source.getBatch` is skipped as it is assumed the generated plan will be > empty. This only affects V1 data sources which rely on this invariant to > initialize differently when the query is being started from scratch or > restarted. How will a source misbehave is very source-specific. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32794) Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources
[ https://issues.apache.org/jira/browse/SPARK-32794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190867#comment-17190867 ] Apache Spark commented on SPARK-32794: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/29651 > Rare corner case error in micro-batch engine with some stateful queries + > no-data-batches + V1 streaming sources > - > > Key: SPARK-32794 > URL: https://issues.apache.org/jira/browse/SPARK-32794 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > Structured Streaming micro-batch engine has the contract with V1 data sources > that, after a restart, it will call `source.getBatch()` on the last batch > attempted before the restart. However, a very rare combination of sequences > violates this contract. It occurs only when > - The streaming query has specific types of stateful operations with > watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). > - These queries can execute a batch even without new data when the > previous updates the watermark and the stateful ops are such that the new > watermark can cause new output/cleanup. Such batches are called > no-data-batches. > - The last batch before termination was an incomplete no-data-batch. Upon > restart, the micro-batch engine fails to call `source.getBatch` when > attempting to re-execute the incomplete no-data-batch. > This occurs because no-data-batches has the same and end offsets, and when a > batch is executed, if the start and end offset is same then calling > `source.getBatch` is skipped as it is assumed the generated plan will be > empty. This only affects V1 data sources which rely on this invariant to > initialize differently when the query is being started from scratch or > restarted. How will a source misbehave is very source-specific. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32799) Make unionByName optionally fill missing columns with nulls in SparkR
[ https://issues.apache.org/jira/browse/SPARK-32799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32799: Labels: beginner (was: ) > Make unionByName optionally fill missing columns with nulls in SparkR > - > > Key: SPARK-32799 > URL: https://issues.apache.org/jira/browse/SPARK-32799 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: beginner > > It would be nicer to expose {{unionbyName}} paramter in R APIs as well. > Currently this is only exposed in Scala/Java sides at SPARK-29358 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32798: Labels: beginner (was: ) > Make unionByName optionally fill missing columns with nulls in PySpark > -- > > Key: SPARK-32798 > URL: https://issues.apache.org/jira/browse/SPARK-32798 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: beginner > > It would be better to expose {{unionByName}} parameter in Python APIs as > well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter
[ https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Feng updated SPARK-32793: --- Description: # Add RAISEERROR() (or RAISE_ERROR()) to the API # Add Scala/Python/R version of API for ASSERT_TRUE() # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which the `message` parameter is only lazily evaluated when the condition is not true # Change the implementation of ASSERT_TRUE() to be rewritten during optimization to IF() instead. was: # Add RAISEERROR() (or RAISE_ERROR()) to the API # Add Scala/Python/R version of API for ASSERT_TRUE() # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which the `message` parameter is only lazily evaluated when the condition is not true # Change the implementation of ASSERT_TRUE() to be rewritten during optimization to IF() instead. > Expose assert_true in Python/Scala APIs and add error message parameter > --- > > Key: SPARK-32793 > URL: https://issues.apache.org/jira/browse/SPARK-32793 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karen Feng >Priority: Minor > > # Add RAISEERROR() (or RAISE_ERROR()) to the API > # Add Scala/Python/R version of API for ASSERT_TRUE() > # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which > the `message` parameter is only lazily evaluated when the condition is not > true > # Change the implementation of ASSERT_TRUE() to be rewritten during > optimization to IF() instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter
[ https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Feng updated SPARK-32793: --- Description: # Add RAISEERROR() (or RAISE_ERROR()) to the API # Add Scala/Python/R version of API for ASSERT_TRUE() # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which the `message` parameter is only lazily evaluated when the condition is not true # Change the implementation of ASSERT_TRUE() to be rewritten during optimization to IF() instead. was: # assert_true is only available as a Spark SQL expression, and should be exposed as a function in the Scala and Python APIs for easier programmatic access. # The error message thrown when the assertion fails is often not very useful for the user. Add a parameter so that users can pass a custom error message. > Expose assert_true in Python/Scala APIs and add error message parameter > --- > > Key: SPARK-32793 > URL: https://issues.apache.org/jira/browse/SPARK-32793 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karen Feng >Priority: Minor > > # > Add RAISEERROR() (or RAISE_ERROR()) to the API > # > Add Scala/Python/R version of API for ASSERT_TRUE() > # > Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which the > `message` parameter is only lazily evaluated when the condition is not true > # > Change the implementation of ASSERT_TRUE() to be rewritten during > optimization to IF() instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe
[ https://issues.apache.org/jira/browse/SPARK-32801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32801: Assignee: (was: Apache Spark) > Make InferFiltersFromConstraints take in account EqualNullSafe > -- > > Key: SPARK-32801 > URL: https://issues.apache.org/jira/browse/SPARK-32801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe
[ https://issues.apache.org/jira/browse/SPARK-32801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190673#comment-17190673 ] Apache Spark commented on SPARK-32801: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29650 > Make InferFiltersFromConstraints take in account EqualNullSafe > -- > > Key: SPARK-32801 > URL: https://issues.apache.org/jira/browse/SPARK-32801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe
[ https://issues.apache.org/jira/browse/SPARK-32801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32801: Assignee: Apache Spark > Make InferFiltersFromConstraints take in account EqualNullSafe > -- > > Key: SPARK-32801 > URL: https://issues.apache.org/jira/browse/SPARK-32801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe
[ https://issues.apache.org/jira/browse/SPARK-32801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190674#comment-17190674 ] Apache Spark commented on SPARK-32801: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29650 > Make InferFiltersFromConstraints take in account EqualNullSafe > -- > > Key: SPARK-32801 > URL: https://issues.apache.org/jira/browse/SPARK-32801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
[ https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32779: Assignee: (was: Apache Spark) > Spark/Hive3 interaction potentially causes deadlock > --- > > Key: SPARK-32779 > URL: https://issues.apache.org/jira/browse/SPARK-32779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > This is an issue for applications that share a Spark Session across multiple > threads. > sessionCatalog.loadPartition (after checking that the table exists) grabs > locks in this order: > - HiveExternalCatalog > - HiveSessionCatalog (in Shim_v3_0) > Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: > - HiveSessionCatalog > - HiveExternalCatalog > [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] > appears to be the culprit. Maybe db name should be defaulted _before_ the > call to HiveClient so that Shim_v3_0 doesn't have to call back into > SessionCatalog. Or possibly this is not needed at all, since loadPartition in > Shim_v2_1 doesn't worry about the default db name, but that might be because > of differences between Hive client libraries. > Reproduction case: > - You need to have a running Hive 3.x HMS instance and the appropriate > hive-site.xml for your Spark instance > - Adjust your spark.sql.hive.metastore.version accordingly > - It might take more than one try to hit the deadlock > Launch Spark: > {noformat} > bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" > --conf spark.sql.hive.metastore.version=3.1 > {noformat} > Then use the following code: > {noformat} > import scala.collection.mutable.ArrayBuffer > import scala.util.Random > val tableCount = 4 > for (i <- 0 until tableCount) { > val tableName = s"partitioned${i+1}" > sql(s"drop table if exists $tableName") > sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored > as orc") > } > val threads = new ArrayBuffer[Thread] > for (i <- 0 until tableCount) { > threads.append(new Thread( new Runnable { > override def run: Unit = { > val tableName = s"partitioned${i + 1}" > val rand = Random > val df = spark.range(0, 2).toDF("a") > val location = s"/tmp/${rand.nextLong.abs}" > df.write.mode("overwrite").orc(location) > sql( > s""" > LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition > (b=$i)""") > } > }, s"worker$i")) > threads(i).start() > } > for (i <- 0 until tableCount) { > println(s"Joining with thread $i") > threads(i).join() > } > println("All done") > {noformat} > The job often gets stuck after one or two "Joining..." lines. > {{kill -3}} shows something like this: > {noformat} > Found one Java-level deadlock: > = > "worker3": > waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a > org.apache.spark.sql.hive.HiveSessionCatalog), > which is held by "worker0" > "worker0": > waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a > org.apache.spark.sql.hive.HiveExternalCatalog), > which is held by "worker3" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
[ https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32779: Assignee: Apache Spark > Spark/Hive3 interaction potentially causes deadlock > --- > > Key: SPARK-32779 > URL: https://issues.apache.org/jira/browse/SPARK-32779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > > This is an issue for applications that share a Spark Session across multiple > threads. > sessionCatalog.loadPartition (after checking that the table exists) grabs > locks in this order: > - HiveExternalCatalog > - HiveSessionCatalog (in Shim_v3_0) > Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: > - HiveSessionCatalog > - HiveExternalCatalog > [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] > appears to be the culprit. Maybe db name should be defaulted _before_ the > call to HiveClient so that Shim_v3_0 doesn't have to call back into > SessionCatalog. Or possibly this is not needed at all, since loadPartition in > Shim_v2_1 doesn't worry about the default db name, but that might be because > of differences between Hive client libraries. > Reproduction case: > - You need to have a running Hive 3.x HMS instance and the appropriate > hive-site.xml for your Spark instance > - Adjust your spark.sql.hive.metastore.version accordingly > - It might take more than one try to hit the deadlock > Launch Spark: > {noformat} > bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" > --conf spark.sql.hive.metastore.version=3.1 > {noformat} > Then use the following code: > {noformat} > import scala.collection.mutable.ArrayBuffer > import scala.util.Random > val tableCount = 4 > for (i <- 0 until tableCount) { > val tableName = s"partitioned${i+1}" > sql(s"drop table if exists $tableName") > sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored > as orc") > } > val threads = new ArrayBuffer[Thread] > for (i <- 0 until tableCount) { > threads.append(new Thread( new Runnable { > override def run: Unit = { > val tableName = s"partitioned${i + 1}" > val rand = Random > val df = spark.range(0, 2).toDF("a") > val location = s"/tmp/${rand.nextLong.abs}" > df.write.mode("overwrite").orc(location) > sql( > s""" > LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition > (b=$i)""") > } > }, s"worker$i")) > threads(i).start() > } > for (i <- 0 until tableCount) { > println(s"Joining with thread $i") > threads(i).join() > } > println("All done") > {noformat} > The job often gets stuck after one or two "Joining..." lines. > {{kill -3}} shows something like this: > {noformat} > Found one Java-level deadlock: > = > "worker3": > waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a > org.apache.spark.sql.hive.HiveSessionCatalog), > which is held by "worker0" > "worker0": > waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a > org.apache.spark.sql.hive.HiveExternalCatalog), > which is held by "worker3" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
[ https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190671#comment-17190671 ] Apache Spark commented on SPARK-32779: -- User 'sandeep-katta' has created a pull request for this issue: https://github.com/apache/spark/pull/29649 > Spark/Hive3 interaction potentially causes deadlock > --- > > Key: SPARK-32779 > URL: https://issues.apache.org/jira/browse/SPARK-32779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > This is an issue for applications that share a Spark Session across multiple > threads. > sessionCatalog.loadPartition (after checking that the table exists) grabs > locks in this order: > - HiveExternalCatalog > - HiveSessionCatalog (in Shim_v3_0) > Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: > - HiveSessionCatalog > - HiveExternalCatalog > [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] > appears to be the culprit. Maybe db name should be defaulted _before_ the > call to HiveClient so that Shim_v3_0 doesn't have to call back into > SessionCatalog. Or possibly this is not needed at all, since loadPartition in > Shim_v2_1 doesn't worry about the default db name, but that might be because > of differences between Hive client libraries. > Reproduction case: > - You need to have a running Hive 3.x HMS instance and the appropriate > hive-site.xml for your Spark instance > - Adjust your spark.sql.hive.metastore.version accordingly > - It might take more than one try to hit the deadlock > Launch Spark: > {noformat} > bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" > --conf spark.sql.hive.metastore.version=3.1 > {noformat} > Then use the following code: > {noformat} > import scala.collection.mutable.ArrayBuffer > import scala.util.Random > val tableCount = 4 > for (i <- 0 until tableCount) { > val tableName = s"partitioned${i+1}" > sql(s"drop table if exists $tableName") > sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored > as orc") > } > val threads = new ArrayBuffer[Thread] > for (i <- 0 until tableCount) { > threads.append(new Thread( new Runnable { > override def run: Unit = { > val tableName = s"partitioned${i + 1}" > val rand = Random > val df = spark.range(0, 2).toDF("a") > val location = s"/tmp/${rand.nextLong.abs}" > df.write.mode("overwrite").orc(location) > sql( > s""" > LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition > (b=$i)""") > } > }, s"worker$i")) > threads(i).start() > } > for (i <- 0 until tableCount) { > println(s"Joining with thread $i") > threads(i).join() > } > println("All done") > {noformat} > The job often gets stuck after one or two "Joining..." lines. > {{kill -3}} shows something like this: > {noformat} > Found one Java-level deadlock: > = > "worker3": > waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a > org.apache.spark.sql.hive.HiveSessionCatalog), > which is held by "worker0" > "worker0": > waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a > org.apache.spark.sql.hive.HiveExternalCatalog), > which is held by "worker3" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32801) Make InferFiltersFromConstraints take in account EqualNullSafe
Tanel Kiis created SPARK-32801: -- Summary: Make InferFiltersFromConstraints take in account EqualNullSafe Key: SPARK-32801 URL: https://issues.apache.org/jira/browse/SPARK-32801 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Tanel Kiis -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0
[ https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190661#comment-17190661 ] Hyukjin Kwon commented on SPARK-32312: -- Thank you [~bryanc]! > Upgrade Apache Arrow to 1.0.0 > - > > Key: SPARK-32312 > URL: https://issues.apache.org/jira/browse/SPARK-32312 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Apache Arrow will soon release v1.0.0 which provides backward/forward > compatibility guarantees as well as a number of fixes and improvements. This > will upgrade the Java artifact and PySpark API. Although PySpark will not > need special changes, it might be a good idea to bump up minimum supported > version and CI testing. > TBD: list of important improvements and fixes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32800) Remove ExpressionSet from the 2.13 branch
[ https://issues.apache.org/jira/browse/SPARK-32800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32800: Assignee: (was: Apache Spark) > Remove ExpressionSet from the 2.13 branch > - > > Key: SPARK-32800 > URL: https://issues.apache.org/jira/browse/SPARK-32800 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ali Afroozeh >Priority: Major > > ExpressionSet does not extent Scala Set anymore, and therefore, can be > removed from the 2.13 branch. This is a followup on > https://issues.apache.org/jira/browse/SPARK-32755. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32800) Remove ExpressionSet from the 2.13 branch
[ https://issues.apache.org/jira/browse/SPARK-32800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190657#comment-17190657 ] Apache Spark commented on SPARK-32800: -- User 'dbaliafroozeh' has created a pull request for this issue: https://github.com/apache/spark/pull/29648 > Remove ExpressionSet from the 2.13 branch > - > > Key: SPARK-32800 > URL: https://issues.apache.org/jira/browse/SPARK-32800 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ali Afroozeh >Priority: Major > > ExpressionSet does not extent Scala Set anymore, and therefore, can be > removed from the 2.13 branch. This is a followup on > https://issues.apache.org/jira/browse/SPARK-32755. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32800) Remove ExpressionSet from the 2.13 branch
[ https://issues.apache.org/jira/browse/SPARK-32800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32800: Assignee: Apache Spark > Remove ExpressionSet from the 2.13 branch > - > > Key: SPARK-32800 > URL: https://issues.apache.org/jira/browse/SPARK-32800 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ali Afroozeh >Assignee: Apache Spark >Priority: Major > > ExpressionSet does not extent Scala Set anymore, and therefore, can be > removed from the 2.13 branch. This is a followup on > https://issues.apache.org/jira/browse/SPARK-32755. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32017) Make Pyspark Hadoop 3.2+ Variant available in PyPI
[ https://issues.apache.org/jira/browse/SPARK-32017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190654#comment-17190654 ] Hyukjin Kwon commented on SPARK-32017: -- I am working on this. > Make Pyspark Hadoop 3.2+ Variant available in PyPI > -- > > Key: SPARK-32017 > URL: https://issues.apache.org/jira/browse/SPARK-32017 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: George Pongracz >Priority: Major > > The version of Pyspark 3.0.0 currently available in PyPI currently uses > hadoop 2.7.4. > Could a variant (or the default) have its version of Hadoop aligned to 3.2.0 > as per the downloadable spark binaries. > This would enable the PyPI version to be compatible with session token > authorisations and assist in accessing data residing in object stores with > stronger encryption methods. > If not PyPI then as a tar file in the apache download archives at the least > please. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190650#comment-17190650 ] Hyukjin Kwon commented on SPARK-32798: -- cc [~nchammas] [~fokko] or [~zero323] are you interested in this :-)? > Make unionByName optionally fill missing columns with nulls in PySpark > -- > > Key: SPARK-32798 > URL: https://issues.apache.org/jira/browse/SPARK-32798 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > It would be better to expose {{unionByName}} parameter in Python APIs as > well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32799) Make unionByName optionally fill missing columns with nulls in SparkR
[ https://issues.apache.org/jira/browse/SPARK-32799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190646#comment-17190646 ] Hyukjin Kwon commented on SPARK-32799: -- cc [~michaelchirico] are you interested in this? > Make unionByName optionally fill missing columns with nulls in SparkR > - > > Key: SPARK-32799 > URL: https://issues.apache.org/jira/browse/SPARK-32799 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > It would be nicer to expose {{unionbyName}} paramter in R APIs as well. > Currently this is only exposed in Scala/Java sides at SPARK-29358 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32800) Remove ExpressionSet from the 2.13 branch
Ali Afroozeh created SPARK-32800: Summary: Remove ExpressionSet from the 2.13 branch Key: SPARK-32800 URL: https://issues.apache.org/jira/browse/SPARK-32800 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Ali Afroozeh ExpressionSet does not extent Scala Set anymore, and therefore, can be removed from the 2.13 branch. This is a followup on https://issues.apache.org/jira/browse/SPARK-32755. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32798: - Labels: (was: starter) > Make unionByName optionally fill missing columns with nulls in PySpark > -- > > Key: SPARK-32798 > URL: https://issues.apache.org/jira/browse/SPARK-32798 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > It would be better to expose {{unionByName}} parameter in Python APIs as > well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32799) Make unionByName optionally fill missing columns with nulls in SparkR
Hyukjin Kwon created SPARK-32799: Summary: Make unionByName optionally fill missing columns with nulls in SparkR Key: SPARK-32799 URL: https://issues.apache.org/jira/browse/SPARK-32799 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 3.1.0 Reporter: Hyukjin Kwon It would be nicer to expose {{unionbyName}} paramter in R APIs as well. Currently this is only exposed in Scala/Java sides at SPARK-29358 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
Hyukjin Kwon created SPARK-32798: Summary: Make unionByName optionally fill missing columns with nulls in PySpark Key: SPARK-32798 URL: https://issues.apache.org/jira/browse/SPARK-32798 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon It would be better to expose {{unionByName}} parameter in Python APIs as well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32783) Development - Testing PySpark
[ https://issues.apache.org/jira/browse/SPARK-32783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32783. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29634 [https://github.com/apache/spark/pull/29634] > Development - Testing PySpark > - > > Key: SPARK-32783 > URL: https://issues.apache.org/jira/browse/SPARK-32783 > Project: Spark > Issue Type: Sub-task > Components: docs, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Link back https://spark.apache.org/developer-tools.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190639#comment-17190639 ] Wenchen Fan commented on SPARK-32764: - Thanks for your analysis, [~sandeep.katta2007] I think we need to add back the `OrderingUtil` to implement custom compare methods to take care of 0.0 vs -0.0. I've opened a PR for it: [https://github.com/apache/spark/pull/29647] > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32764: Assignee: (was: Apache Spark) > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32764: Assignee: Apache Spark > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Assignee: Apache Spark >Priority: Major > Labels: correctness > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190637#comment-17190637 ] Apache Spark commented on SPARK-32764: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29647 > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
[ https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190626#comment-17190626 ] Sandeep Katta commented on SPARK-32779: --- [~bersprockets] thanks for raising this issue, I am working on this. I will raise the patch soon > Spark/Hive3 interaction potentially causes deadlock > --- > > Key: SPARK-32779 > URL: https://issues.apache.org/jira/browse/SPARK-32779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > This is an issue for applications that share a Spark Session across multiple > threads. > sessionCatalog.loadPartition (after checking that the table exists) grabs > locks in this order: > - HiveExternalCatalog > - HiveSessionCatalog (in Shim_v3_0) > Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: > - HiveSessionCatalog > - HiveExternalCatalog > [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] > appears to be the culprit. Maybe db name should be defaulted _before_ the > call to HiveClient so that Shim_v3_0 doesn't have to call back into > SessionCatalog. Or possibly this is not needed at all, since loadPartition in > Shim_v2_1 doesn't worry about the default db name, but that might be because > of differences between Hive client libraries. > Reproduction case: > - You need to have a running Hive 3.x HMS instance and the appropriate > hive-site.xml for your Spark instance > - Adjust your spark.sql.hive.metastore.version accordingly > - It might take more than one try to hit the deadlock > Launch Spark: > {noformat} > bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" > --conf spark.sql.hive.metastore.version=3.1 > {noformat} > Then use the following code: > {noformat} > import scala.collection.mutable.ArrayBuffer > import scala.util.Random > val tableCount = 4 > for (i <- 0 until tableCount) { > val tableName = s"partitioned${i+1}" > sql(s"drop table if exists $tableName") > sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored > as orc") > } > val threads = new ArrayBuffer[Thread] > for (i <- 0 until tableCount) { > threads.append(new Thread( new Runnable { > override def run: Unit = { > val tableName = s"partitioned${i + 1}" > val rand = Random > val df = spark.range(0, 2).toDF("a") > val location = s"/tmp/${rand.nextLong.abs}" > df.write.mode("overwrite").orc(location) > sql( > s""" > LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition > (b=$i)""") > } > }, s"worker$i")) > threads(i).start() > } > for (i <- 0 until tableCount) { > println(s"Joining with thread $i") > threads(i).join() > } > println("All done") > {noformat} > The job often gets stuck after one or two "Joining..." lines. > {{kill -3}} shows something like this: > {noformat} > Found one Java-level deadlock: > = > "worker3": > waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a > org.apache.spark.sql.hive.HiveSessionCatalog), > which is held by "worker0" > "worker0": > waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a > org.apache.spark.sql.hive.HiveExternalCatalog), > which is held by "worker3" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32797) Install mypy on the Jenkins CI workers
[ https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated SPARK-32797: - Description: We want to check the types of the PySpark code. This requires mypy to be installed on the CI. Can you do this [~shaneknapp]? Related PR: [https://github.com/apache/spark/pull/29180] You can install this using pip: [https://pypi.org/project/mypy/] Should be similar to flake8 and sphinx. The latest version is ok! Thanks! > Install mypy on the Jenkins CI workers > -- > > Key: SPARK-32797 > URL: https://issues.apache.org/jira/browse/SPARK-32797 > Project: Spark > Issue Type: Improvement > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > > We want to check the types of the PySpark code. This requires mypy to be > installed on the CI. Can you do this [~shaneknapp]? > Related PR: [https://github.com/apache/spark/pull/29180] > You can install this using pip: [https://pypi.org/project/mypy/] Should be > similar to flake8 and sphinx. The latest version is ok! Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32797) Install mypy on the Jenkins CI workers
Fokko Driesprong created SPARK-32797: Summary: Install mypy on the Jenkins CI workers Key: SPARK-32797 URL: https://issues.apache.org/jira/browse/SPARK-32797 Project: Spark Issue Type: Improvement Components: jenkins, PySpark Affects Versions: 3.0.0 Reporter: Fokko Driesprong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32795) ApplicationInfo#removedExecutors can cause OOM
[ https://issues.apache.org/jira/browse/SPARK-32795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32795: - Priority: Major (was: Critical) > ApplicationInfo#removedExecutors can cause OOM > -- > > Key: SPARK-32795 > URL: https://issues.apache.org/jira/browse/SPARK-32795 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Victor Tso >Priority: Major > Attachments: image-2020-09-03-23-27-11-809.png > > > !image-2020-09-03-23-27-11-809.png|width=840,height=439! > In my case, the Standalone Spark master process had a max heap of 1g. 738mb > were consumed by these ExecutorDesc objects, the vast majority of which were > the 18.5M removedExecutors. This caused the master to OOM and leave the > application driver process dangling. > The reason for this is that the worker node ran out of disk space, so for > whatever reason decided to go in a fast and endless loop trying to launch new > executors and they in turn crashed too. It got up to the 18M before the > master just couldn't handle the history anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org