date:20200914

[jira] [Updated] (SPARK-31848) DAGSchedulerSuite: Break down the very huge test files, each test suite should focus on one or several major features, but not all the related behaviors

2020-09-14 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-31848:
---
Summary: DAGSchedulerSuite: Break down the very huge test files, each test 
suite should focus on one or several major features, but not all the related 
behaviors  (was: Break down the very huge test files, each test suite should 
focus on one or several major features, but not all the related behaviors)

> DAGSchedulerSuite: Break down the very huge test files, each test suite 
> should focus on one or several major features, but not all the related 
> behaviors
> 
>
> Key: SPARK-31848
> URL: https://issues.apache.org/jira/browse/SPARK-31848
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31848) DAGSchedulerSuite: Break down the very huge test files, each test suite should focus on one or several major features, but not all the related behaviors

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31848:


Assignee: Apache Spark

> DAGSchedulerSuite: Break down the very huge test files, each test suite 
> should focus on one or several major features, but not all the related 
> behaviors
> 
>
> Key: SPARK-31848
> URL: https://issues.apache.org/jira/browse/SPARK-31848
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31848) DAGSchedulerSuite: Break down the very huge test files, each test suite should focus on one or several major features, but not all the related behaviors

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195242#comment-17195242
 ] 

Apache Spark commented on SPARK-31848:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/29747

> DAGSchedulerSuite: Break down the very huge test files, each test suite 
> should focus on one or several major features, but not all the related 
> behaviors
> 
>
> Key: SPARK-31848
> URL: https://issues.apache.org/jira/browse/SPARK-31848
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31848) DAGSchedulerSuite: Break down the very huge test files, each test suite should focus on one or several major features, but not all the related behaviors

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31848:


Assignee: (was: Apache Spark)

> DAGSchedulerSuite: Break down the very huge test files, each test suite 
> should focus on one or several major features, but not all the related 
> behaviors
> 
>
> Key: SPARK-31848
> URL: https://issues.apache.org/jira/browse/SPARK-31848
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Ankur Dave (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-32872:
---
Affects Version/s: 1.4.1
   1.5.2
   1.6.3

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 
> 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Ankur Dave (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195252#comment-17195252
 ] 

Ankur Dave commented on SPARK-32872:


Thanks, [~dongjoon]! Based on a quick look at the history, I believe this issue 
was introduced by PR #6159 (https://issues.apache.org/jira/browse/SPARK-7251, 
commit 
https://github.com/apache/spark/commit/f2faa7af30662e3bdf15780f8719c71108f8e30b).
 If this is true, it dates back to Spark 1.4.0. I augmented the "Affects 
Version" field accordingly.

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 
> 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Ankur Dave (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195252#comment-17195252
 ] 

Ankur Dave edited comment on SPARK-32872 at 9/14/20, 7:30 AM:
--

Thanks, [~dongjoon]! Based on a quick look at the history, I believe this issue 
was introduced by [PR #6159|https://github.com/apache/spark/pull/6159] 
([SPARK-7251|https://issues.apache.org/jira/browse/SPARK-7251], commit 
[f2faa7af30662e3bdf15780f8719c71108f8e30b|https://github.com/apache/spark/commit/f2faa7af30662e3bdf15780f8719c71108f8e30b]).
 If this is true, it dates back to Spark 1.4.0. I augmented the "Affects 
Version" field accordingly.


was (Author: ankurd):
Thanks, [~dongjoon]! Based on a quick look at the history, I believe this issue 
was introduced by PR #6159 (https://issues.apache.org/jira/browse/SPARK-7251, 
commit 
https://github.com/apache/spark/commit/f2faa7af30662e3bdf15780f8719c71108f8e30b).
 If this is true, it dates back to Spark 1.4.0. I augmented the "Affects 
Version" field accordingly.

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 
> 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Ankur Dave (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195252#comment-17195252
 ] 

Ankur Dave edited comment on SPARK-32872 at 9/14/20, 7:37 AM:
--

Thanks, [~dongjoon]! Based on a quick look at the history, I believe this issue 
was introduced by [PR #9241|https://github.com/apache/spark/pull/9241] 
([SPARK-10342|https://issues.apache.org/jira/browse/SPARK-10342]). If this is 
true, it dates back to Spark 1.6.0. I augmented the "Affects Version" field 
accordingly.


was (Author: ankurd):
Thanks, [~dongjoon]! Based on a quick look at the history, I believe this issue 
was introduced by [PR #6159|https://github.com/apache/spark/pull/6159] 
([SPARK-7251|https://issues.apache.org/jira/browse/SPARK-7251], commit 
[f2faa7af30662e3bdf15780f8719c71108f8e30b|https://github.com/apache/spark/commit/f2faa7af30662e3bdf15780f8719c71108f8e30b]).
 If this is true, it dates back to Spark 1.4.0. I augmented the "Affects 
Version" field accordingly.

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 
> 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Ankur Dave (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-32872:
---
Affects Version/s: (was: 1.5.2)
   (was: 1.4.1)

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32708) Query optimization fails to reuse exchange with DataSourceV2

2020-09-14 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-32708:
--

Assignee: Mingjia Liu

> Query optimization fails to reuse exchange with DataSourceV2
> 
>
> Key: SPARK-32708
> URL: https://issues.apache.org/jira/browse/SPARK-32708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Mingjia Liu
>Assignee: Mingjia Liu
>Priority: Major
>
> Repro query:
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format('parquet').load('gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim')
> #spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
>  
> df.createOrReplaceTempView(table)
>  
> df = spark.sql(""" 
> WITH t1 AS (
>  SELECT 
>  d_year, d_month_seq
>  FROM (
>  SELECT t1.d_year , t2.d_month_seq 
>  FROM 
>  date_dim t1
>  cross join
>  date_dim t2
>  where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
>  and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
>  )
>  GROUP BY d_year, d_month_seq)
>  
>  SELECT
>  prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002 
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> #df.show()
> {code}
>  
> *The above query has different plans with Parquet and DataSourceV2. Both 
> plans are correct tho. However, the DataSourceV2 plan is less optimized :*
> *Sub-plan [5-7] is exactly the same as sub-plan [1-3]( Aggregate on BHJed 
> dataset of two tables that are filtered, projected the same way).* 
> *Therefore, in the below parquet plan, exchange that happens after [1-3] is 
> reused to replace [5-6].*
>  *However, the DataSourceV2 plan failed to do so.*
>  
> Parquet:
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#21456L ASC NULLS 
> FIRST], output=[prev_year#21451L,year#21452L,d_month_seq#21456L])
> +- *(9) Project [d_year#21487L AS prev_year#21451L, d_year#20481L AS 
> year#21452L, d_month_seq#21456L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#20481L, d_month_seq#21456L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#20481L, d_month_seq#21456L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#20481L, d_month_seq#21456L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#20481L]
>   :   :  +- *(1) Filter (isnotnull(d_year#20481L) && 
> isnotnull(d_day_name#20489)) && isnotnull(d_fy_year#20486L)) && 
> (d_day_name#20489 = Monday)) && (d_fy_year#20486L > 2000)) && (d_year#20481L 
> = 2002))
>   :   : +- *(1) FileScan parquet 
> [d_year#20481L,d_fy_year#20486L,d_day_name#20489] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim],
>  PartitionFilters: [], PushedFilters: [IsNotNull(d_year), 
> IsNotNull(d_day_name), IsNotNull(d_fy_year), EqualTo(d_day_name,Monday), 
> Grea..., ReadSchema: struct
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#21456L]
>   : +- *(2) Filter (((isnotnull(d_day_name#21467) && 
> isnotnull(d_fy_year#21464L)) && (d_day_name#21467 = Monday)) && 
> (d_fy_year#21464L > 2000))
>   :+- *(2) FileScan parquet 
> [d_month_seq#21456L,d_fy_year#21464L,d_day_name#21467] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim],
>  PartitionFilters: [], PushedFilters: [IsNotNull(d_day_name), 
> IsNotNull(d_fy_year), EqualTo(d_day_name,Monday), GreaterThan(d_fy_year,2..., 
> ReadSchema: struct
>   +- *(8) HashAggregate(keys=[d_year#21487L, d_month_seq#21540L], 
> functions=[])
>  +- ReusedExchange [d_year#21487L, d_month_seq#21540L], Exchange 
> hashpartitioning(d_year#20481L, d_month_seq#21456L, 200){code}
>  
> DataSourceV2:
> {code:java}
> == Physical Plan ==
>  TakeOrderedAndProject(limit=100, orderBy=d_month_seq#22325L ASC NULLS FIRST, 
> output=prev_year#22320L,year#22321L,d_month_seq#22325L)
>  +- *(9) Project d_year#22356L AS prev_year#22320L, d_year#21696L AS 
> year#22321L, d_month_seq#22325L
>  +- CartesianProduct
>  :- *(4) HashAggregate(keys=d_year#21696L, d_month_seq#22325L, functions=[])
>  : +- Exchange hashpartitioning(d_year#21696L, d_month_seq#22325L, 200)
>  : +- *(3) HashAggregate(keys=d_year#21696L, d_month_seq#22325L, functions=[])
>  : +- BroadcastNestedLoopJoin BuildRight, Cross
>  : :- *(1) P

[jira] [Resolved] (SPARK-32708) Query optimization fails to reuse exchange with DataSourceV2

2020-09-14 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-32708.

Resolution: Fixed

The issue is resolved in https://github.com/apache/spark/pull/29564

> Query optimization fails to reuse exchange with DataSourceV2
> 
>
> Key: SPARK-32708
> URL: https://issues.apache.org/jira/browse/SPARK-32708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Mingjia Liu
>Assignee: Mingjia Liu
>Priority: Major
>
> Repro query:
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format('parquet').load('gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim')
> #spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
>  
> df.createOrReplaceTempView(table)
>  
> df = spark.sql(""" 
> WITH t1 AS (
>  SELECT 
>  d_year, d_month_seq
>  FROM (
>  SELECT t1.d_year , t2.d_month_seq 
>  FROM 
>  date_dim t1
>  cross join
>  date_dim t2
>  where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
>  and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
>  )
>  GROUP BY d_year, d_month_seq)
>  
>  SELECT
>  prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002 
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> #df.show()
> {code}
>  
> *The above query has different plans with Parquet and DataSourceV2. Both 
> plans are correct tho. However, the DataSourceV2 plan is less optimized :*
> *Sub-plan [5-7] is exactly the same as sub-plan [1-3]( Aggregate on BHJed 
> dataset of two tables that are filtered, projected the same way).* 
> *Therefore, in the below parquet plan, exchange that happens after [1-3] is 
> reused to replace [5-6].*
>  *However, the DataSourceV2 plan failed to do so.*
>  
> Parquet:
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#21456L ASC NULLS 
> FIRST], output=[prev_year#21451L,year#21452L,d_month_seq#21456L])
> +- *(9) Project [d_year#21487L AS prev_year#21451L, d_year#20481L AS 
> year#21452L, d_month_seq#21456L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#20481L, d_month_seq#21456L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#20481L, d_month_seq#21456L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#20481L, d_month_seq#21456L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#20481L]
>   :   :  +- *(1) Filter (isnotnull(d_year#20481L) && 
> isnotnull(d_day_name#20489)) && isnotnull(d_fy_year#20486L)) && 
> (d_day_name#20489 = Monday)) && (d_fy_year#20486L > 2000)) && (d_year#20481L 
> = 2002))
>   :   : +- *(1) FileScan parquet 
> [d_year#20481L,d_fy_year#20486L,d_day_name#20489] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim],
>  PartitionFilters: [], PushedFilters: [IsNotNull(d_year), 
> IsNotNull(d_day_name), IsNotNull(d_fy_year), EqualTo(d_day_name,Monday), 
> Grea..., ReadSchema: struct
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#21456L]
>   : +- *(2) Filter (((isnotnull(d_day_name#21467) && 
> isnotnull(d_fy_year#21464L)) && (d_day_name#21467 = Monday)) && 
> (d_fy_year#21464L > 2000))
>   :+- *(2) FileScan parquet 
> [d_month_seq#21456L,d_fy_year#21464L,d_day_name#21467] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim],
>  PartitionFilters: [], PushedFilters: [IsNotNull(d_day_name), 
> IsNotNull(d_fy_year), EqualTo(d_day_name,Monday), GreaterThan(d_fy_year,2..., 
> ReadSchema: struct
>   +- *(8) HashAggregate(keys=[d_year#21487L, d_month_seq#21540L], 
> functions=[])
>  +- ReusedExchange [d_year#21487L, d_month_seq#21540L], Exchange 
> hashpartitioning(d_year#20481L, d_month_seq#21456L, 200){code}
>  
> DataSourceV2:
> {code:java}
> == Physical Plan ==
>  TakeOrderedAndProject(limit=100, orderBy=d_month_seq#22325L ASC NULLS FIRST, 
> output=prev_year#22320L,year#22321L,d_month_seq#22325L)
>  +- *(9) Project d_year#22356L AS prev_year#22320L, d_year#21696L AS 
> year#22321L, d_month_seq#22325L
>  +- CartesianProduct
>  :- *(4) HashAggregate(keys=d_year#21696L, d_month_seq#22325L, functions=[])
>  : +- Exchange hashpartitioning(d_year#21696L, d_month_seq#22325L, 200)
>  : +- *(3) HashAggregate(keys=d_year#21696L, d_month_seq#22325L, functions=[])
>  :

[jira] [Reopened] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-09-14 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reopened SPARK-32542:
--

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-09-14 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32542.
--
Resolution: Won't Fix

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32875) TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + assert, extract the general method.

2020-09-14 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-32875:
---
Summary: TaskSchedulerImplSuite: For the pattern of submitTasks + 
resourceOffers + assert, extract the general method.  (was: 
TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + 
assert, extract the)

> TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + 
> assert, extract the general method.
> -
>
> Key: SPARK-32875
> URL: https://issues.apache.org/jira/browse/SPARK-32875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32875) TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + assert, extract the

2020-09-14 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-32875:
--

 Summary: TaskSchedulerImplSuite: For the pattern of submitTasks + 
resourceOffers + assert, extract the
 Key: SPARK-32875
 URL: https://issues.apache.org/jira/browse/SPARK-32875
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32790) FINISHED state of application is not final

2020-09-14 Thread Rosie Bloxsom (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195288#comment-17195288
 ] 

Rosie Bloxsom commented on SPARK-32790:
---

Thanks [~dongjoon],

Re. docs:
 * +1 on the snippet you posted
 * Also the descriptions of the FAILED and FINISHED states themselves imply 
that receiving one of those states means the application finished with that 
state (e.g. "The application finished with a successful status.") so I think 
that would also need clarifying.

I was not sure what documentation changes to suggest, because I have only been 
able to verify that this is what's happening in *client mode*, and the docs do 
not distinguish between client and cluster modes.

I don't have access to a cluster to test with, but I have heard from colleagues 
that error handling has not been a problem in cluster mode - so that implies 
that this only is a problem with client mode? I'm not sure, as I don't fully 
understand the differences between the modes, where processes run, and how they 
handle errors. So therefore I'm not sure whether the docs on this need 
completely changing, or just changing to clarify that with cluster mode the 
states work as intended, but in client mode the FINISHED state is not 
necessarily final.

 

> FINISHED state of application is not final
> --
>
> Key: SPARK-32790
> URL: https://issues.apache.org/jira/browse/SPARK-32790
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 (and probably every version since, from 
> looking at the code?)
> On a local machine.
>Reporter: Rosie Bloxsom
>Priority: Minor
>  Labels: application, spark-submit
>
> If you launch an application with SparkLauncher.startApplication, and pass a 
> listener to listener to the returned state, there are supposed to be two 
> possible "final" states:
>  * FINISHED, denoting success
>  * FAILED, denoting a failure
> Because they are final, if you receive a FINISHED signal you should be able 
> to proceed as if there was no error.
> Unfortunately, this code:
> https://github.com/apache/spark/blob/233c214a752771f5d8ca9fb2aea93cf1776a552d/launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java#L128
> which I think is related to decisions from this previous issue: 
> https://github.com/apache/spark/pull/18877
> means that in case of an error, a FINISHED event is sent, followed shortly 
> thereafter by a FAILED event, and both of these events are "final".
> I'm not sure if there's a way to fix it so that only one event is sent - 
> ideally when the child process fails, we would only send FAILED, rather than 
> sending "FINISHED" first? If we can't change it, then at least we should 
> update the docs to explain what happens, and maybe change the definition of 
> isFinal?
> To reproduce, install spark 2.4.4 and run this scala code using one of the 
> spark example jars. It shows the transition between the states for a 
> trivially erroring spark application. The states received are:
> {noformat}
> Received event updating state to CONNECTED
> Received event updating state to RUNNING
> Received event updating state to FINISHED
> Received event updating state to FAILED
> {noformat}
> {code:scala}
> package foo
> import org.apache.spark.launcher.{SparkAppHandle, SparkLauncher}
> import org.scalatest.flatspec.AnyFlatSpecLike
> import org.scalatest.matchers.should.Matchers
> import scala.concurrent.duration._
> import scala.concurrent.{Await, Promise}
> class FinishedStateNotFinalSpec extends AnyFlatSpecLike with Matchers {
>   "it" should "enter FAILED state without entering into FINISHED state" in {
> val examplesJar = 
> "file:/C:/spark/spark-2.4.4-bin-hadoop2.7/spark-2.4.4-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.4.jar"
> val launcher = new SparkLauncher()
>   
> .setSparkHome("""C:\spark\spark-2.4.4-bin-hadoop2.7\spark-2.4.4-bin-hadoop2.7""")
>   .setAppResource(examplesJar)
>   .redirectError()
>   .redirectOutput(java.io.File.createTempFile("spark-error", "log"))
>   .setAppName("Test")
>   .setMaster("local[1]")
>   .setMainClass("org.apache.spark.examples.SparkPi")
>   .addAppArgs("This causes an error, because it should be a number not a 
> string")
> val sparkCompletionPromise = Promise[Unit]()
> launcher.startApplication(new SparkAppListener(sparkCompletionPromise))
> Await.result(sparkCompletionPromise.future, 10 millis)
> // check in the console output to see which states were entered
>   }
> }
> class SparkAppListener(promise: Promise[Unit]) extends 
> SparkAppHandle.Listener {
>   def stateChanged(handle: SparkAppHandle): Unit = {
> val appState = handle.getState
> println(s"Received event upda

[jira] [Updated] (SPARK-32790) FINISHED state of application is not final

2020-09-14 Thread Rosie Bloxsom (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rosie Bloxsom updated SPARK-32790:
--
Description: 
If you launch an application with SparkLauncher.startApplication, and pass a 
listener to listen to the returned state, there are supposed to be two possible 
"final" states:
 * FINISHED, denoting success
 * FAILED, denoting a failure

Because they are final, if you receive a FINISHED signal you should be able to 
proceed as if there was no error.

Unfortunately, this code:
 
[https://github.com/apache/spark/blob/233c214a752771f5d8ca9fb2aea93cf1776a552d/launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java#L128]
 which I think is related to decisions from this previous issue: 
[https://github.com/apache/spark/pull/18877]
 means that in case of an error, a FINISHED event is sent, followed shortly 
thereafter by a FAILED event, and both of these events are "final".

I'm not sure if there's a way to fix it so that only one event is sent - 
ideally when the child process fails, we would only send FAILED, rather than 
sending "FINISHED" first? If we can't change it, then at least we should update 
the docs to explain what happens, and maybe change the definition of isFinal?

To reproduce, install spark 2.4.4 and run this scala code using one of the 
spark example jars. It shows the transition between the states for a trivially 
erroring spark application. The states received are:
{noformat}
Received event updating state to CONNECTED
Received event updating state to RUNNING
Received event updating state to FINISHED
Received event updating state to FAILED
{noformat}
{code:scala}
package foo

import org.apache.spark.launcher.{SparkAppHandle, SparkLauncher}
import org.scalatest.flatspec.AnyFlatSpecLike
import org.scalatest.matchers.should.Matchers
import scala.concurrent.duration._
import scala.concurrent.{Await, Promise}

class FinishedStateNotFinalSpec extends AnyFlatSpecLike with Matchers {
  "it" should "enter FAILED state without entering into FINISHED state" in {

val examplesJar = 
"file:/C:/spark/spark-2.4.4-bin-hadoop2.7/spark-2.4.4-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.4.jar"

val launcher = new SparkLauncher()
  
.setSparkHome("""C:\spark\spark-2.4.4-bin-hadoop2.7\spark-2.4.4-bin-hadoop2.7""")
  .setAppResource(examplesJar)
  .redirectError()
  .redirectOutput(java.io.File.createTempFile("spark-error", "log"))
  .setAppName("Test")
  .setMaster("local[1]")
  .setMainClass("org.apache.spark.examples.SparkPi")
  .addAppArgs("This causes an error, because it should be a number not a 
string")

val sparkCompletionPromise = Promise[Unit]()

launcher.startApplication(new SparkAppListener(sparkCompletionPromise))

Await.result(sparkCompletionPromise.future, 10 millis)

// check in the console output to see which states were entered
  }
}

class SparkAppListener(promise: Promise[Unit]) extends SparkAppHandle.Listener {

  def stateChanged(handle: SparkAppHandle): Unit = {
val appState = handle.getState
println(s"Received event updating state to $appState")
if (appState.isFinal && appState == SparkAppHandle.State.FINISHED) {
  // Without this sleep, my program continues as if the spark-submit was a 
success.
  // With this sleep, there is a chance for the correct "FAILED" state to 
be registered.
  // But we shouldn't need this sleep, we should receive the FAILED state 
as the only "final" state.
  Thread.sleep(1000)
  promise.success(Unit)
}
else if (appState.isFinal && appState == SparkAppHandle.State.FAILED) {
  promise.failure(new RuntimeException("Spark run failed"))
}
  }

  override def infoChanged(handle: SparkAppHandle): Unit = {}
}


{code}

  was:
If you launch an application with SparkLauncher.startApplication, and pass a 
listener to listener to the returned state, there are supposed to be two 
possible "final" states:
 * FINISHED, denoting success
 * FAILED, denoting a failure

Because they are final, if you receive a FINISHED signal you should be able to 
proceed as if there was no error.

Unfortunately, this code:
https://github.com/apache/spark/blob/233c214a752771f5d8ca9fb2aea93cf1776a552d/launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java#L128
which I think is related to decisions from this previous issue: 
https://github.com/apache/spark/pull/18877
means that in case of an error, a FINISHED event is sent, followed shortly 
thereafter by a FAILED event, and both of these events are "final".

I'm not sure if there's a way to fix it so that only one event is sent - 
ideally when the child process fails, we would only send FAILED, rather than 
sending "FINISHED" first? If we can't change it, then at least we should update 
the docs to explain what happens, and maybe change the definition of isFinal?

To reproduce, install sp

[jira] [Resolved] (SPARK-32854) Avoid per-row join type check in stream-stream join

2020-09-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32854.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29724
[https://github.com/apache/spark/pull/29724]

> Avoid per-row join type check in stream-stream join
> ---
>
> Key: SPARK-32854
> URL: https://issues.apache.org/jira/browse/SPARK-32854
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> In stream-stream join (`StreamingSymmetricHashJoinExec`), we can optimize to 
> avoid per row check for join type 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L486-L492]
>  ), to create the method before the loop of reading row. Similar optimization 
> (i.e. create auxiliary method/variable per different join type beforehand) 
> has been done in batch join world (`SortMergeJoinExec`, 
> `ShuffledHashJoinExec`). This is a minor change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32854) Avoid per-row join type check in stream-stream join

2020-09-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32854:
---

Assignee: Cheng Su

> Avoid per-row join type check in stream-stream join
> ---
>
> Key: SPARK-32854
> URL: https://issues.apache.org/jira/browse/SPARK-32854
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
>
> In stream-stream join (`StreamingSymmetricHashJoinExec`), we can optimize to 
> avoid per row check for join type 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L486-L492]
>  ), to create the method before the loop of reading row. Similar optimization 
> (i.e. create auxiliary method/variable per different join type beforehand) 
> has been done in batch join world (`SortMergeJoinExec`, 
> `ShuffledHashJoinExec`). This is a minor change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32844) Make `DataFrameReader.table` take the specified options for datasource v1

2020-09-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32844:
---

Assignee: Yuanjian Li

> Make `DataFrameReader.table` take the specified options for datasource v1
> -
>
> Key: SPARK-32844
> URL: https://issues.apache.org/jira/browse/SPARK-32844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> Same as the work in SPARK-32592, we need to keep the same behavior in 
> datasource v1 of taking the specified options for DataFrameReader.table API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32844) Make `DataFrameReader.table` take the specified options for datasource v1

2020-09-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32844.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29712
[https://github.com/apache/spark/pull/29712]

> Make `DataFrameReader.table` take the specified options for datasource v1
> -
>
> Key: SPARK-32844
> URL: https://issues.apache.org/jira/browse/SPARK-32844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.1.0
>
>
> Same as the work in SPARK-32592, we need to keep the same behavior in 
> datasource v1 of taking the specified options for DataFrameReader.table API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32876:
-
Priority: Minor  (was: Major)

> Change default fallback versions in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-32876
> URL: https://issues.apache.org/jira/browse/SPARK-32876
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Temporarily we set the fallback versions at 
> https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 
> 2.4.7. is released. 
> We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32876:
-
Issue Type: Test  (was: Bug)

> Change default fallback versions in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-32876
> URL: https://issues.apache.org/jira/browse/SPARK-32876
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Temporarily we set the fallback versions at 
> https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 
> 2.4.7. is released. 
> We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32876:


 Summary: Change default fallback versions in 
HiveExternalCatalogVersionsSuite
 Key: SPARK-32876
 URL: https://issues.apache.org/jira/browse/SPARK-32876
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.0.1, 2.4.7
Reporter: Hyukjin Kwon


Temporarily we set the fallback versions at 
https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 2.4.7. 
is released. 

We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32877) Fix Hive UDF not support decimal type in complex type

2020-09-14 Thread ulysses you (Jira)

ulysses you created SPARK-32877:
---

 Summary: Fix Hive UDF not support decimal type in complex type
 Key: SPARK-32877
 URL: https://issues.apache.org/jira/browse/SPARK-32877
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you


This pr aims to support Hive UDF when input complex type contains decimal type.

Before this pr, we failed in this code.
{code:java}
class ArraySumUDF extends UDF {
 import scala.collection.JavaConverters._
 def evaluate(values: java.util.List[java.lang.Double]): java.lang.Double = {
 var r = 0d
 for (v <- values.asScala) {
 r += v
 }
 r
 }
}

sql(s"CREATE FUNCTION testArraySum AS '${classOf[ArraySumUDF].getName}'")
sql("SELECT testArraySum(array(1, 1.1, 1.2))")
-- failed msg
Error in query: No handler for UDF/UDAF/UDTF 'ArraySumUDF': 
org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
for class ArraySumUDF with (array). Possible choices: 
_FUNC_(array) ; line 1 pos 7
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32876:
-
Affects Version/s: 3.1.0

> Change default fallback versions in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-32876
> URL: https://issues.apache.org/jira/browse/SPARK-32876
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Temporarily we set the fallback versions at 
> https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 
> 2.4.7. is released. 
> We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32876:


Assignee: Apache Spark

> Change default fallback versions in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-32876
> URL: https://issues.apache.org/jira/browse/SPARK-32876
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Temporarily we set the fallback versions at 
> https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 
> 2.4.7. is released. 
> We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195363#comment-17195363
 ] 

Apache Spark commented on SPARK-32876:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29748

> Change default fallback versions in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-32876
> URL: https://issues.apache.org/jira/browse/SPARK-32876
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Temporarily we set the fallback versions at 
> https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 
> 2.4.7. is released. 
> We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32876:


Assignee: (was: Apache Spark)

> Change default fallback versions in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-32876
> URL: https://issues.apache.org/jira/browse/SPARK-32876
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Temporarily we set the fallback versions at 
> https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 
> 2.4.7. is released. 
> We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32877) Fix Hive UDF not support decimal type in complex type

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195364#comment-17195364
 ] 

Apache Spark commented on SPARK-32877:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29749

> Fix Hive UDF not support decimal type in complex type
> -
>
> Key: SPARK-32877
> URL: https://issues.apache.org/jira/browse/SPARK-32877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> This pr aims to support Hive UDF when input complex type contains decimal 
> type.
> Before this pr, we failed in this code.
> {code:java}
> class ArraySumUDF extends UDF {
>  import scala.collection.JavaConverters._
>  def evaluate(values: java.util.List[java.lang.Double]): java.lang.Double = {
>  var r = 0d
>  for (v <- values.asScala) {
>  r += v
>  }
>  r
>  }
> }
> sql(s"CREATE FUNCTION testArraySum AS '${classOf[ArraySumUDF].getName}'")
> sql("SELECT testArraySum(array(1, 1.1, 1.2))")
> -- failed msg
> Error in query: No handler for UDF/UDAF/UDTF 'ArraySumUDF': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class ArraySumUDF with (array). Possible choices: 
> _FUNC_(array) ; line 1 pos 7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32877) Fix Hive UDF not support decimal type in complex type

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32877:


Assignee: (was: Apache Spark)

> Fix Hive UDF not support decimal type in complex type
> -
>
> Key: SPARK-32877
> URL: https://issues.apache.org/jira/browse/SPARK-32877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> This pr aims to support Hive UDF when input complex type contains decimal 
> type.
> Before this pr, we failed in this code.
> {code:java}
> class ArraySumUDF extends UDF {
>  import scala.collection.JavaConverters._
>  def evaluate(values: java.util.List[java.lang.Double]): java.lang.Double = {
>  var r = 0d
>  for (v <- values.asScala) {
>  r += v
>  }
>  r
>  }
> }
> sql(s"CREATE FUNCTION testArraySum AS '${classOf[ArraySumUDF].getName}'")
> sql("SELECT testArraySum(array(1, 1.1, 1.2))")
> -- failed msg
> Error in query: No handler for UDF/UDAF/UDTF 'ArraySumUDF': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class ArraySumUDF with (array). Possible choices: 
> _FUNC_(array) ; line 1 pos 7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32877) Fix Hive UDF not support decimal type in complex type

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32877:


Assignee: Apache Spark

> Fix Hive UDF not support decimal type in complex type
> -
>
> Key: SPARK-32877
> URL: https://issues.apache.org/jira/browse/SPARK-32877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> This pr aims to support Hive UDF when input complex type contains decimal 
> type.
> Before this pr, we failed in this code.
> {code:java}
> class ArraySumUDF extends UDF {
>  import scala.collection.JavaConverters._
>  def evaluate(values: java.util.List[java.lang.Double]): java.lang.Double = {
>  var r = 0d
>  for (v <- values.asScala) {
>  r += v
>  }
>  r
>  }
> }
> sql(s"CREATE FUNCTION testArraySum AS '${classOf[ArraySumUDF].getName}'")
> sql("SELECT testArraySum(array(1, 1.1, 1.2))")
> -- failed msg
> Error in query: No handler for UDF/UDAF/UDTF 'ArraySumUDF': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class ArraySumUDF with (array). Possible choices: 
> _FUNC_(array) ; line 1 pos 7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32877) Fix Hive UDF not support decimal type in complex type

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195365#comment-17195365
 ] 

Apache Spark commented on SPARK-32877:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29749

> Fix Hive UDF not support decimal type in complex type
> -
>
> Key: SPARK-32877
> URL: https://issues.apache.org/jira/browse/SPARK-32877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> This pr aims to support Hive UDF when input complex type contains decimal 
> type.
> Before this pr, we failed in this code.
> {code:java}
> class ArraySumUDF extends UDF {
>  import scala.collection.JavaConverters._
>  def evaluate(values: java.util.List[java.lang.Double]): java.lang.Double = {
>  var r = 0d
>  for (v <- values.asScala) {
>  r += v
>  }
>  r
>  }
> }
> sql(s"CREATE FUNCTION testArraySum AS '${classOf[ArraySumUDF].getName}'")
> sql("SELECT testArraySum(array(1, 1.1, 1.2))")
> -- failed msg
> Error in query: No handler for UDF/UDAF/UDTF 'ArraySumUDF': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class ArraySumUDF with (array). Possible choices: 
> _FUNC_(array) ; line 1 pos 7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32868) Mark more aggregate functions as order irrelevant

2020-09-14 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32868.
--
Fix Version/s: 3.1.0
 Assignee: Tanel Kiis
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29740

> Mark more aggregate functions as order irrelevant
> -
>
> Key: SPARK-32868
> URL: https://issues.apache.org/jira/browse/SPARK-32868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.1.0
>
>
> In the EliminateSorts rule there is a list of aggregate functions, that are 
> order irrelevant. The list is incomplete - for example HyperLogLogPlusPlus 
> and BitAggregate are also order irrelevant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32780) Fill since fields for all the expressions

2020-09-14 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32780.
--
Resolution: Duplicate

> Fill since fields for all the expressions
> -
>
> Key: SPARK-32780
> URL: https://issues.apache.org/jira/browse/SPARK-32780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: starter
>
> Some since files in ExpressionDescription are missing now, it is worth 
> filling them to make documents better;
> {code:java}
>   test("Since has a valid value") {
> val badExpressions = spark.sessionState.functionRegistry.listFunction()
>   .map(spark.sessionState.catalog.lookupFunctionInfo)
>   .filter(funcInfo => 
> !funcInfo.getSince.matches("[0-9]+\\.[0-9]+\\.[0-9]+"))
>   .map(_.getClassName)
>   .distinct
>   .sorted
> if (badExpressions.nonEmpty) {
>   fail(s"${badExpressions.length} expressions with invalid 'since':\n"
> + badExpressions.mkString("\n"))
> }
>   }
> [info] - Since has a valid value *** FAILED *** (16 milliseconds)
> [info]   67 expressions with invalid 'since':
> [info]   org.apache.spark.sql.catalyst.expressions.Abs
> [info]   org.apache.spark.sql.catalyst.expressions.Add
> [info]   org.apache.spark.sql.catalyst.expressions.And
> [info]   org.apache.spark.sql.catalyst.expressions.ArrayContains
> [info]   org.apache.spark.sql.catalyst.expressions.AssertTrue
> [info]   org.apache.spark.sql.catalyst.expressions.BitwiseAnd
> [info]   org.apache.spark.sql.catalyst.expressions.BitwiseNot
> [info]   org.apache.spark.sql.catalyst.expressions.BitwiseOr
> [info]   org.apache.spark.sql.catalyst.expressions.BitwiseXor
> [info]   org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
> [info]   org.apache.spark.sql.catalyst.expressions.CaseWhen
> [info]   org.apache.spark.sql.catalyst.expressions.Cast
> [info]   org.apache.spark.sql.catalyst.expressions.Concat
> [info]   org.apache.spark.sql.catalyst.expressions.Crc32
> [info]   org.apache.spark.sql.catalyst.expressions.CreateArray
> [info]   org.apache.spark.sql.catalyst.expressions.CreateMap
> [info]   org.apache.spark.sql.catalyst.expressions.CreateNamedStruct
> [info]   org.apache.spark.sql.catalyst.expressions.CurrentDatabase
> [info]   org.apache.spark.sql.catalyst.expressions.Divide
> [info]   org.apache.spark.sql.catalyst.expressions.EqualNullSafe
> [info]   org.apache.spark.sql.catalyst.expressions.EqualTo
> [info]   org.apache.spark.sql.catalyst.expressions.Explode
> [info]   org.apache.spark.sql.catalyst.expressions.GetJsonObject
> [info]   org.apache.spark.sql.catalyst.expressions.GreaterThan
> [info]   org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual
> [info]   org.apache.spark.sql.catalyst.expressions.Greatest
> [info]   org.apache.spark.sql.catalyst.expressions.If
> [info]   org.apache.spark.sql.catalyst.expressions.In
> [info]   org.apache.spark.sql.catalyst.expressions.Inline
> [info]   org.apache.spark.sql.catalyst.expressions.InputFileBlockLength
> [info]   org.apache.spark.sql.catalyst.expressions.InputFileBlockStart
> [info]   org.apache.spark.sql.catalyst.expressions.InputFileName
> [info]   org.apache.spark.sql.catalyst.expressions.JsonTuple
> [info]   org.apache.spark.sql.catalyst.expressions.Least
> [info]   org.apache.spark.sql.catalyst.expressions.LessThan
> [info]   org.apache.spark.sql.catalyst.expressions.LessThanOrEqual
> [info]   org.apache.spark.sql.catalyst.expressions.MapKeys
> [info]   org.apache.spark.sql.catalyst.expressions.MapValues
> [info]   org.apache.spark.sql.catalyst.expressions.Md5
> [info]   org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
> [info]   org.apache.spark.sql.catalyst.expressions.Multiply
> [info]   org.apache.spark.sql.catalyst.expressions.Murmur3Hash
> [info]   org.apache.spark.sql.catalyst.expressions.Not
> [info]   org.apache.spark.sql.catalyst.expressions.Or
> [info]   org.apache.spark.sql.catalyst.expressions.Overlay
> [info]   org.apache.spark.sql.catalyst.expressions.Pmod
> [info]   org.apache.spark.sql.catalyst.expressions.PosExplode
> [info]   org.apache.spark.sql.catalyst.expressions.Remainder
> [info]   org.apache.spark.sql.catalyst.expressions.Sha1
> [info]   org.apache.spark.sql.catalyst.expressions.Sha2
> [info]   org.apache.spark.sql.catalyst.expressions.Size
> [info]   org.apache.spark.sql.catalyst.expressions.SortArray
> [info]   org.apache.spark.sql.catalyst.expressions.SparkPartitionID
> [info]   org.apache.spark.sql.catalyst.expressions.Stack
> [info]   org.apache.spark.sql.catalyst.expressions.Subtract
> [info]   org.apache.spark.sql.catalyst.expressions.TimeWindow
> [info]   org.apache.spark.sql.catalyst.expressions.UnaryMinus
> [in

[jira] [Created] (SPARK-32878) Avoid scheduling TaskSetManager which has no pending tasks

2020-09-14 Thread wuyi (Jira)

wuyi created SPARK-32878:


 Summary: Avoid scheduling TaskSetManager which has no pending tasks
 Key: SPARK-32878
 URL: https://issues.apache.org/jira/browse/SPARK-32878
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: wuyi


Spark always tries to schedule a TaskSetManager even if it has no pending 
tasks. We can skip them to avoid any potential overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32879) SessionStateBuilder should apply options set when creating the session state

2020-09-14 Thread Jira

Herman van Hövell created SPARK-32879:
-

 Summary: SessionStateBuilder should apply options set when 
creating the session state
 Key: SPARK-32879
 URL: https://issues.apache.org/jira/browse/SPARK-32879
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-09-14 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195531#comment-17195531
 ] 

Thomas Graves commented on SPARK-32037:
---

it is a good point about blocklist being typoed (but I would hope would be 
caught in reviews) but if you are looking at amount of change it is only 1 
character.  Also I don't really see how BlocklistTracker sounds any worse then 
BlacklistTracker.  Both might be a bit weird. HealthTracker might be better 
there although would be better if we could give context to what health and in 
this case its either node or executor which is hard to give a name to that 
includes both.  Like you pointed out then you have TaskSetHealthTracker - which 
isn't really right because its tracking the health of the node/executor for 
that taskset not the taskset itself.

If you look at the description to the config denied seems a bit weird to me:

_If set to "true", prevent Spark from scheduling tasks on executors that have 
been blacklisted due to too many task failures. The blacklisting algorithm can 
be further controlled by the other "spark.blacklist" configuration options._

If we look at the options in the context of this sentence...:

executor that have been denied due to too many task failures

executors that have been blocked due to too many task failures

executors that have been excluded due to to many task failures

The last 2 definitely make more sense in that context.  Now you could 
definitely re-write the sentence for denied, but the other thing is that 
executors can be removed from the list so denied/allowed or removed from denied 
doesn't make as much sense to me in this context.  block or exclude make more 
sense to me if they can go active again (blocked/unblocked or 
excluded/included).  

Naming things is always a pain.  I think based on all the feedback if no one 
has strong objections I will go with "blocklist".  I'll start to make the 
changes and should start to see in the context of this if it doesn't make 
sense.  Perhaps we can do a mix of things where the BlacklistTracker would be 
renamed HealthTracker but other things internally are referred to as blocklist 
or blocked.

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32878) Avoid scheduling TaskSetManager which has no pending tasks

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32878:


Assignee: (was: Apache Spark)

> Avoid scheduling TaskSetManager which has no pending tasks
> --
>
> Key: SPARK-32878
> URL: https://issues.apache.org/jira/browse/SPARK-32878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Spark always tries to schedule a TaskSetManager even if it has no pending 
> tasks. We can skip them to avoid any potential overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32878) Avoid scheduling TaskSetManager which has no pending tasks

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195532#comment-17195532
 ] 

Apache Spark commented on SPARK-32878:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29750

> Avoid scheduling TaskSetManager which has no pending tasks
> --
>
> Key: SPARK-32878
> URL: https://issues.apache.org/jira/browse/SPARK-32878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Spark always tries to schedule a TaskSetManager even if it has no pending 
> tasks. We can skip them to avoid any potential overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32878) Avoid scheduling TaskSetManager which has no pending tasks

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32878:


Assignee: Apache Spark

> Avoid scheduling TaskSetManager which has no pending tasks
> --
>
> Key: SPARK-32878
> URL: https://issues.apache.org/jira/browse/SPARK-32878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Spark always tries to schedule a TaskSetManager even if it has no pending 
> tasks. We can skip them to avoid any potential overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32878) Avoid scheduling TaskSetManager which has no pending tasks

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195533#comment-17195533
 ] 

Apache Spark commented on SPARK-32878:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29750

> Avoid scheduling TaskSetManager which has no pending tasks
> --
>
> Key: SPARK-32878
> URL: https://issues.apache.org/jira/browse/SPARK-32878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Spark always tries to schedule a TaskSetManager even if it has no pending 
> tasks. We can skip them to avoid any potential overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-09-14 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195542#comment-17195542
 ] 

Erik Krogen commented on SPARK-32037:
-

Thanks for continuing to push this forward [~tgraves]! Let me know if I can be 
of assistance :)

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32708) Query optimization fails to reuse exchange with DataSourceV2

2020-09-14 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-32708:
---
Affects Version/s: (was: 2.4.6)

> Query optimization fails to reuse exchange with DataSourceV2
> 
>
> Key: SPARK-32708
> URL: https://issues.apache.org/jira/browse/SPARK-32708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Mingjia Liu
>Assignee: Mingjia Liu
>Priority: Major
>
> Repro query:
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format('parquet').load('gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim')
> #spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
>  
> df.createOrReplaceTempView(table)
>  
> df = spark.sql(""" 
> WITH t1 AS (
>  SELECT 
>  d_year, d_month_seq
>  FROM (
>  SELECT t1.d_year , t2.d_month_seq 
>  FROM 
>  date_dim t1
>  cross join
>  date_dim t2
>  where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
>  and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
>  )
>  GROUP BY d_year, d_month_seq)
>  
>  SELECT
>  prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002 
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> #df.show()
> {code}
>  
> *The above query has different plans with Parquet and DataSourceV2. Both 
> plans are correct tho. However, the DataSourceV2 plan is less optimized :*
> *Sub-plan [5-7] is exactly the same as sub-plan [1-3]( Aggregate on BHJed 
> dataset of two tables that are filtered, projected the same way).* 
> *Therefore, in the below parquet plan, exchange that happens after [1-3] is 
> reused to replace [5-6].*
>  *However, the DataSourceV2 plan failed to do so.*
>  
> Parquet:
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#21456L ASC NULLS 
> FIRST], output=[prev_year#21451L,year#21452L,d_month_seq#21456L])
> +- *(9) Project [d_year#21487L AS prev_year#21451L, d_year#20481L AS 
> year#21452L, d_month_seq#21456L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#20481L, d_month_seq#21456L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#20481L, d_month_seq#21456L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#20481L, d_month_seq#21456L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#20481L]
>   :   :  +- *(1) Filter (isnotnull(d_year#20481L) && 
> isnotnull(d_day_name#20489)) && isnotnull(d_fy_year#20486L)) && 
> (d_day_name#20489 = Monday)) && (d_fy_year#20486L > 2000)) && (d_year#20481L 
> = 2002))
>   :   : +- *(1) FileScan parquet 
> [d_year#20481L,d_fy_year#20486L,d_day_name#20489] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim],
>  PartitionFilters: [], PushedFilters: [IsNotNull(d_year), 
> IsNotNull(d_day_name), IsNotNull(d_fy_year), EqualTo(d_day_name,Monday), 
> Grea..., ReadSchema: struct
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#21456L]
>   : +- *(2) Filter (((isnotnull(d_day_name#21467) && 
> isnotnull(d_fy_year#21464L)) && (d_day_name#21467 = Monday)) && 
> (d_fy_year#21464L > 2000))
>   :+- *(2) FileScan parquet 
> [d_month_seq#21456L,d_fy_year#21464L,d_day_name#21467] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[gs://dataproc-kokoro-tests-us-central1/tpcds/1G/parquet/date_dim],
>  PartitionFilters: [], PushedFilters: [IsNotNull(d_day_name), 
> IsNotNull(d_fy_year), EqualTo(d_day_name,Monday), GreaterThan(d_fy_year,2..., 
> ReadSchema: struct
>   +- *(8) HashAggregate(keys=[d_year#21487L, d_month_seq#21540L], 
> functions=[])
>  +- ReusedExchange [d_year#21487L, d_month_seq#21540L], Exchange 
> hashpartitioning(d_year#20481L, d_month_seq#21456L, 200){code}
>  
> DataSourceV2:
> {code:java}
> == Physical Plan ==
>  TakeOrderedAndProject(limit=100, orderBy=d_month_seq#22325L ASC NULLS FIRST, 
> output=prev_year#22320L,year#22321L,d_month_seq#22325L)
>  +- *(9) Project d_year#22356L AS prev_year#22320L, d_year#21696L AS 
> year#22321L, d_month_seq#22325L
>  +- CartesianProduct
>  :- *(4) HashAggregate(keys=d_year#21696L, d_month_seq#22325L, functions=[])
>  : +- Exchange hashpartitioning(d_year#21696L, d_month_seq#22325L, 200)
>  : +- *(3) HashAggregate(keys=d_year#21696L, d_month_seq#22325L, functions=[])
>  : +- BroadcastNestedLoopJoin BuildRight, Cross
>  : :- *(1) P

[jira] [Created] (SPARK-32880) Improve and refactor parallel listing in HadoopFSUtils

2020-09-14 Thread Chao Sun (Jira)

Chao Sun created SPARK-32880:


 Summary: Improve and refactor parallel listing in HadoopFSUtils
 Key: SPARK-32880
 URL: https://issues.apache.org/jira/browse/SPARK-32880
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Chao Sun


We have a few follow-ups after refactoring the parallel listing feature into 
Spark Core:
 # Expose metrics as a callback
 # Enhance listing support for S3A/other cloud `FileSystem` impls
 # Other cleanups and refactorings on the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32881) NoSuchElementException occurs during decommissioning

2020-09-14 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-32881:
-

 Summary: NoSuchElementException occurs during decommissioning
 Key: SPARK-32881
 URL: https://issues.apache.org/jira/browse/SPARK-32881
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


`BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` 
due to `java.util.NoSuchElementException`. This happens on K8s IT testing, but 
the main code seems to need a graceful handling of `NoSuchElementException` 
instead of showing a naive error message.
{code}
  private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): 
Seq[ReplicateBlock] = {
val info = blockManagerInfo(blockManagerId)
   ...
}
{code}
{code}
  20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors 
from Kubernetes.
  20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script 
to lifecycle
  20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: 
Executor decommission.
  20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested
  20/09/14 18:56:55 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 1
  20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove executor 
1 from BlockManagerMaster.
  20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, 172.17.0.4, 41235, None)
  20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1)
  20/09/14 18:56:55 ERROR Inbox: Ignoring error
  java.util.NoSuchElementException
at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
at 
org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getReplicateInfoForRDDBlocks(BlockManagerMasterEndpoint.scala:383)
at 
org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:171)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
  20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove executor 
1 from BlockManagerMaster.
  20/09/14 18:56:55 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
  20/09/14 18:56:55 INFO DAGScheduler: Shuffle files lost for executor: 1 
(epoch 1)
  20/09/14 18:56:58 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor 
NettyRpcEndpointRef(spark-client://Executor) (172.17.0.7:46674) with ID 4,  
ResourceProfileId 0
  20/09/14 18:56:58 INFO BlockManagerMasterEndpoint: Registering block manager 
172.17.0.7:40495 with 593.9 MiB RAM, BlockManagerId(4, 172.17.0.7, 40495, None)
  20/09/14 18:57:23 INFO SparkContext: Starting job: count at 
/opt/spark/tests/decommissioning.py:49
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32881) NoSuchElementException occurs during decommissioning

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32881:
--
Description: 
`BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` 
due to `java.util.NoSuchElementException`. This happens on K8s IT testing, but 
the main code seems to need a graceful handling of `NoSuchElementException` 
instead of showing a naive error message.
{code}
private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): 
Seq[ReplicateBlock] = {
val info = blockManagerInfo(blockManagerId)
   ...
}
{code}
{code}
  20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors 
from Kubernetes.
  20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script 
to lifecycle
  20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: 
Executor decommission.
  20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested
  20/09/14 18:56:55 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 1
  20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove executor 
1 from BlockManagerMaster.
  20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, 172.17.0.4, 41235, None)
  20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1)
  20/09/14 18:56:55 ERROR Inbox: Ignoring error
  java.util.NoSuchElementException
at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
at 
org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getReplicateInfoForRDDBlocks(BlockManagerMasterEndpoint.scala:383)
at 
org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:171)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
  20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove executor 
1 from BlockManagerMaster.
  20/09/14 18:56:55 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
  20/09/14 18:56:55 INFO DAGScheduler: Shuffle files lost for executor: 1 
(epoch 1)
  20/09/14 18:56:58 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor 
NettyRpcEndpointRef(spark-client://Executor) (172.17.0.7:46674) with ID 4,  
ResourceProfileId 0
  20/09/14 18:56:58 INFO BlockManagerMasterEndpoint: Registering block manager 
172.17.0.7:40495 with 593.9 MiB RAM, BlockManagerId(4, 172.17.0.7, 40495, None)
  20/09/14 18:57:23 INFO SparkContext: Starting job: count at 
/opt/spark/tests/decommissioning.py:49
{code}

  was:
`BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` 
due to `java.util.NoSuchElementException`. This happens on K8s IT testing, but 
the main code seems to need a graceful handling of `NoSuchElementException` 
instead of showing a naive error message.
{code}
  private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): 
Seq[ReplicateBlock] = {
val info = blockManagerInfo(blockManagerId)
   ...
}
{code}
{code}
  20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors 
from Kubernetes.
  20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script 
to lifecycle
  20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: 
Executor decommission.
  20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested
  20/09/14 18:56:55 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 1
  20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove executor 
1 from BlockManagerMaster.
  20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, 172.17.0.4, 41235, None)
  20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1)
  20/09/14 18:56:55 ERROR Inbox: Ignoring error
  java.util.NoSuchElementException
at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
at 
org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$Blo

[jira] [Commented] (SPARK-32881) NoSuchElementException occurs during decommissioning

2020-09-14 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195700#comment-17195700
 ] 

Holden Karau commented on SPARK-32881:
--

Thanks for the catch, I'll take a look at this issue.

> NoSuchElementException occurs during decommissioning
> 
>
> Key: SPARK-32881
> URL: https://issues.apache.org/jira/browse/SPARK-32881
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` 
> due to `java.util.NoSuchElementException`. This happens on K8s IT testing, 
> but the main code seems to need a graceful handling of 
> `NoSuchElementException` instead of showing a naive error message.
> {code}
> private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): 
> Seq[ReplicateBlock] = {
> val info = blockManagerInfo(blockManagerId)
>...
> }
> {code}
> {code}
>   20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors 
> from Kubernetes.
>   20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: 
> Executor decommission.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested
>   20/09/14 18:56:55 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
> non-existent executor 1
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, 172.17.0.4, 41235, None)
>   20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1)
>   20/09/14 18:56:55 ERROR Inbox: Ignoring error
>   java.util.NoSuchElementException
>   at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getReplicateInfoForRDDBlocks(BlockManagerMasterEndpoint.scala:383)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:171)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
>   20/09/14 18:56:55 INFO DAGScheduler: Shuffle files lost for executor: 1 
> (epoch 1)
>   20/09/14 18:56:58 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.7:46674) with 
> ID 4,  ResourceProfileId 0
>   20/09/14 18:56:58 INFO BlockManagerMasterEndpoint: Registering block 
> manager 172.17.0.7:40495 with 593.9 MiB RAM, BlockManagerId(4, 172.17.0.7, 
> 40495, None)
>   20/09/14 18:57:23 INFO SparkContext: Starting job: count at 
> /opt/spark/tests/decommissioning.py:49
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32876:
-

Assignee: Hyukjin Kwon

> Change default fallback versions in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-32876
> URL: https://issues.apache.org/jira/browse/SPARK-32876
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Temporarily we set the fallback versions at 
> https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 
> 2.4.7. is released. 
> We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32876) Change default fallback versions in HiveExternalCatalogVersionsSuite

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32876.
---
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29748
[https://github.com/apache/spark/pull/29748]

> Change default fallback versions in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-32876
> URL: https://issues.apache.org/jira/browse/SPARK-32876
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.1.0, 3.0.2
>
>
> Temporarily we set the fallback versions at 
> https://issues.apache.org/jira/browse/SPARK-31716.  Spark 2.3 is EOL and 
> 2.4.7. is released. 
> We should better match it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32872.
---
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29744
[https://github.com/apache/spark/pull/29744]

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32882) Remove python2 installation in K8s python image

2020-09-14 Thread William Hyun (Jira)

William Hyun created SPARK-32882:


 Summary: Remove python2 installation in K8s python image
 Key: SPARK-32882
 URL: https://issues.apache.org/jira/browse/SPARK-32882
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32882) Remove python2 installation in K8s python image

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32882:


Assignee: (was: Apache Spark)

> Remove python2 installation in K8s python image
> ---
>
> Key: SPARK-32882
> URL: https://issues.apache.org/jira/browse/SPARK-32882
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32882) Remove python2 installation in K8s python image

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195738#comment-17195738
 ] 

Apache Spark commented on SPARK-32882:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29751

> Remove python2 installation in K8s python image
> ---
>
> Key: SPARK-32882
> URL: https://issues.apache.org/jira/browse/SPARK-32882
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32882) Remove python2 installation in K8s python image

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32882:


Assignee: Apache Spark

> Remove python2 installation in K8s python image
> ---
>
> Key: SPARK-32882
> URL: https://issues.apache.org/jira/browse/SPARK-32882
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32879) SessionStateBuilder should apply options set when creating the session state

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32879:


Assignee: Apache Spark  (was: Herman van Hövell)

> SessionStateBuilder should apply options set when creating the session state
> 
>
> Key: SPARK-32879
> URL: https://issues.apache.org/jira/browse/SPARK-32879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32879) SessionStateBuilder should apply options set when creating the session state

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32879:


Assignee: Herman van Hövell  (was: Apache Spark)

> SessionStateBuilder should apply options set when creating the session state
> 
>
> Key: SPARK-32879
> URL: https://issues.apache.org/jira/browse/SPARK-32879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32879) SessionStateBuilder should apply options set when creating the session state

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195740#comment-17195740
 ] 

Apache Spark commented on SPARK-32879:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/29752

> SessionStateBuilder should apply options set when creating the session state
> 
>
> Key: SPARK-32879
> URL: https://issues.apache.org/jira/browse/SPARK-32879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195746#comment-17195746
 ] 

Apache Spark commented on SPARK-32872:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/29753

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195749#comment-17195749
 ] 

Apache Spark commented on SPARK-32872:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/29753

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32882) Remove python2 installation in K8s python image

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32882:
-

Assignee: William Hyun

> Remove python2 installation in K8s python image
> ---
>
> Key: SPARK-32882
> URL: https://issues.apache.org/jira/browse/SPARK-32882
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32882) Remove python2 installation in K8s python image

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32882.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29751
[https://github.com/apache/spark/pull/29751]

> Remove python2 installation in K8s python image
> ---
>
> Key: SPARK-32882
> URL: https://issues.apache.org/jira/browse/SPARK-32882
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32871) Append toMap to Map#filterKeys if the result of filter is concatenated with another Map for Scala 2.13

2020-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32871.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29742
[https://github.com/apache/spark/pull/29742]

> Append toMap to Map#filterKeys if the result of filter is concatenated with 
> another Map for Scala 2.13
> --
>
> Key: SPARK-32871
> URL: https://issues.apache.org/jira/browse/SPARK-32871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> As of Scala 2.13, Map#filterKeys return a MapView, not the original Map type.
> This can cause compile error.
> {code:java}
> /sql/DataFrameReader.scala:279: type mismatch;
> [error]  found   : Iterable[(String, String)]
> [error]  required: java.util.Map[String,String]
> [error] Error occurred in an application involving default arguments.
> [error]   val dsOptions = new 
> CaseInsensitiveStringMap(finalOptions.asJava) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32862) Left semi stream-stream join

2020-09-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195790#comment-17195790
 ] 

Hyukjin Kwon commented on SPARK-32862:
--

Nice. [~chengsu], it would have been better if you link related tickets like 
SPARK-32863 or file an umbrella ticket

> Left semi stream-stream join
> 
>
> Key: SPARK-32862
> URL: https://issues.apache.org/jira/browse/SPARK-32862
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). We do see internally a lot of users are using left semi stream-stream 
> join (not spark structured streaming), e.g. I want to get the ad impression 
> (join left side) which has click (joint right side), but I don't care how 
> many clicks per ad (left semi semantics).
>  
> Left semi stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store
>   (1.1). if there's a match, output the left side row.
>   (1.2). if there's no match, put the row in left side state store (with 
> "matched" field to set to false in state store).
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, update left side row state with "matched" field to 
> set to true. Put the right side row in right side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is true.
> (4).for right side row needs to be evicted from state store, doing nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32834) from_avro is giving empty result

2020-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32834:
-
Description: 
I am trying to read a Kafka topic with Spark readStream but getting problem 
while applying avro schema

 

Code:

{code}
df = spark\
  .readStream\
  .format("kafka")\
  .option("kafka.bootstrap.servers", "host:6667")\
  .option("subscribe", "utopic1")\
  .option("failOnDataLoss", "false")\
  .option("startingOffsets", "earliest")\
  .option("checkpointLocation", "/home/abc/wspace/spark_test/data/")\
  .load()
 
outputDF = df\
        .select(from_avro("value", jsonFormatSchema, 
options={"mode":"FASTFAIL"}).alias("user"))

outputDF.printSchema()

query = outputDF.writeStream.format("console").start()
time.sleep(10)
{code}

Input:

avro schema file: 
[user.avsc|https://github.com/apache/spark/raw/4ad9bfd53b84a6d2497668c73af6899bae14c187/examples/src/main/resources/user.avsc]

Kafka topic: \{'favorite_color': 'Red', 'name': 'Alyssa'}

Expected Output:

It should print values. 

Actual Output:

{code}
++
|user|
++
|[,]|
++
{code}

Additional information:
 - Searched in the internet and found that other peson faced same issue. 
[https://stackoverflow.com/questions/59222774/spark-from-avro-function-returning-null-values]
 - I am able to print values to console if I cast to String using below line 
df.selectExpr("CAST(value AS STRING)")

 

  was:
I am trying to read a Kafka topic with Spark readStream but getting problem 
while applying avro schema

 

Code:

{code}
df = spark\
  .readStream\
  .format("kafka")\
  .option("kafka.bootstrap.servers", "host:6667")\
  .option("subscribe", "utopic1")\
  .option("failOnDataLoss", "false")\
  .option("startingOffsets", "earliest")\
  .option("checkpointLocation", "/home/abc/wspace/spark_test/data/")\
  .load()
 
outputDF = df\
        .select(from_avro("value", jsonFormatSchema, 
options=\{"mode":"FASTFAIL"}).alias("user"))

outputDF.printSchema()

query = outputDF.writeStream.format("console").start()
time.sleep(10)
{code}

Input:

avro schema file: 
[user.avsc|https://github.com/apache/spark/raw/4ad9bfd53b84a6d2497668c73af6899bae14c187/examples/src/main/resources/user.avsc]

Kafka topic: \{'favorite_color': 'Red', 'name': 'Alyssa'}

Expected Output:

It should print values. 

Actual Output:

{code}
++
|user|
++
|[,]|
++
{code}

Additional information:
 - Searched in the internet and found that other peson faced same issue. 
[https://stackoverflow.com/questions/59222774/spark-from-avro-function-returning-null-values]
 - I am able to print values to console if I cast to String using below line 
df.selectExpr("CAST(value AS STRING)")

 


> from_avro is giving empty result
> 
>
> Key: SPARK-32834
> URL: https://issues.apache.org/jira/browse/SPARK-32834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Ubuntu 18
> Spark 3.0
> Kafka 2.0.0
>Reporter: Chaitanya
>Priority: Major
>
> I am trying to read a Kafka topic with Spark readStream but getting problem 
> while applying avro schema
>  
> Code:
> {code}
> df = spark\
>   .readStream\
>   .format("kafka")\
>   .option("kafka.bootstrap.servers", "host:6667")\
>   .option("subscribe", "utopic1")\
>   .option("failOnDataLoss", "false")\
>   .option("startingOffsets", "earliest")\
>   .option("checkpointLocation", "/home/abc/wspace/spark_test/data/")\
>   .load()
>  
> outputDF = df\
>         .select(from_avro("value", jsonFormatSchema, 
> options={"mode":"FASTFAIL"}).alias("user"))
> outputDF.printSchema()
> query = outputDF.writeStream.format("console").start()
> time.sleep(10)
> {code}
> Input:
> avro schema file: 
> [user.avsc|https://github.com/apache/spark/raw/4ad9bfd53b84a6d2497668c73af6899bae14c187/examples/src/main/resources/user.avsc]
> Kafka topic: \{'favorite_color': 'Red', 'name': 'Alyssa'}
> Expected Output:
> It should print values. 
> Actual Output:
> {code}
> ++
> |user|
> ++
> |[,]|
> ++
> {code}
> Additional information:
>  - Searched in the internet and found that other peson faced same issue. 
> [https://stackoverflow.com/questions/59222774/spark-from-avro-function-returning-null-values]
>  - I am able to print values to console if I cast to String using below line 
> df.selectExpr("CAST(value AS STRING)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32834) from_avro is giving empty result

2020-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32834:
-
Description: 
I am trying to read a Kafka topic with Spark readStream but getting problem 
while applying avro schema

 

Code:

{code}
df = spark\
  .readStream\
  .format("kafka")\
  .option("kafka.bootstrap.servers", "host:6667")\
  .option("subscribe", "utopic1")\
  .option("failOnDataLoss", "false")\
  .option("startingOffsets", "earliest")\
  .option("checkpointLocation", "/home/abc/wspace/spark_test/data/")\
  .load()
 
outputDF = df\
        .select(from_avro("value", jsonFormatSchema, 
options=\{"mode":"FASTFAIL"}).alias("user"))

outputDF.printSchema()

query = outputDF.writeStream.format("console").start()
time.sleep(10)
{code}

Input:

avro schema file: 
[user.avsc|https://github.com/apache/spark/raw/4ad9bfd53b84a6d2497668c73af6899bae14c187/examples/src/main/resources/user.avsc]

Kafka topic: \{'favorite_color': 'Red', 'name': 'Alyssa'}

Expected Output:

It should print values. 

Actual Output:

{code}
++
|user|
++
|[,]|
++
{code}

Additional information:
 - Searched in the internet and found that other peson faced same issue. 
[https://stackoverflow.com/questions/59222774/spark-from-avro-function-returning-null-values]
 - I am able to print values to console if I cast to String using below line 
df.selectExpr("CAST(value AS STRING)")

 

  was:
I am trying to read a Kafka topic with Spark readStream but getting problem 
while applying avro schema

 

Code:

df = spark\

  .readStream\

  .format("kafka")\

  .option("kafka.bootstrap.servers", "host:6667")\

  .option("subscribe", "utopic1")\

  .option("failOnDataLoss", "false")\

  .option("startingOffsets", "earliest")\

  .option("checkpointLocation", "/home/abc/wspace/spark_test/data/")\

  .load()

 

outputDF = df\

        .select(from_avro("value", jsonFormatSchema, 
options=\{"mode":"FASTFAIL"}).alias("user"))

outputDF.printSchema()

 

query = outputDF.writeStream.format("console").start()

time.sleep(10)

Input:

avro schema file: 
[user.avsc|https://github.com/apache/spark/raw/4ad9bfd53b84a6d2497668c73af6899bae14c187/examples/src/main/resources/user.avsc]

Kafka topic: \{'favorite_color': 'Red', 'name': 'Alyssa'}

Expected Output:

It should print values. 

Actual Output:

++
|user|

++
|[,]|

++

Additional information:
 # Searched in the internet and found that other peson faced same issue. 
[https://stackoverflow.com/questions/59222774/spark-from-avro-function-returning-null-values]
 # I am able to print values to console if I cast to String using below line 
df.selectExpr("CAST(value AS STRING)")

 


> from_avro is giving empty result
> 
>
> Key: SPARK-32834
> URL: https://issues.apache.org/jira/browse/SPARK-32834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Ubuntu 18
> Spark 3.0
> Kafka 2.0.0
>Reporter: Chaitanya
>Priority: Major
>
> I am trying to read a Kafka topic with Spark readStream but getting problem 
> while applying avro schema
>  
> Code:
> {code}
> df = spark\
>   .readStream\
>   .format("kafka")\
>   .option("kafka.bootstrap.servers", "host:6667")\
>   .option("subscribe", "utopic1")\
>   .option("failOnDataLoss", "false")\
>   .option("startingOffsets", "earliest")\
>   .option("checkpointLocation", "/home/abc/wspace/spark_test/data/")\
>   .load()
>  
> outputDF = df\
>         .select(from_avro("value", jsonFormatSchema, 
> options=\{"mode":"FASTFAIL"}).alias("user"))
> outputDF.printSchema()
> query = outputDF.writeStream.format("console").start()
> time.sleep(10)
> {code}
> Input:
> avro schema file: 
> [user.avsc|https://github.com/apache/spark/raw/4ad9bfd53b84a6d2497668c73af6899bae14c187/examples/src/main/resources/user.avsc]
> Kafka topic: \{'favorite_color': 'Red', 'name': 'Alyssa'}
> Expected Output:
> It should print values. 
> Actual Output:
> {code}
> ++
> |user|
> ++
> |[,]|
> ++
> {code}
> Additional information:
>  - Searched in the internet and found that other peson faced same issue. 
> [https://stackoverflow.com/questions/59222774/spark-from-avro-function-returning-null-values]
>  - I am able to print values to console if I cast to String using below line 
> df.selectExpr("CAST(value AS STRING)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32821) cannot group by with window in sql statement for structured streaming with watermark

2020-09-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195792#comment-17195792
 ] 

Hyukjin Kwon commented on SPARK-32821:
--

[~johnny bai] can you provide a self-contained reproducer BTW? It's easier to 
discuss and talk with reproducible examples.

> cannot group by with window in sql statement for structured streaming with 
> watermark
> 
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Johnny Bai
>Priority: Major
>
> current only support dsl style as below: 
> {code}
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
> {code}
>  
> but not support group by with window in sql style as below:
> {code}
> select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql statement for structured streaming with watermark

2020-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32821:
-
Description: 
current only support dsl style as below: 

{code}
import spark.implicits._
val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()
{code}
 
but not support group by with window in sql style as below:

{code}
select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 minute') 
with watermark 1 minute from tableX group by ts_field
{code}
 

  was:
current only support dsl style as below: 

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 
but not support group by with window in sql style as below:

"select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 minute') 
with watermark 1 minute from tableX group by ts_field"

 


> cannot group by with window in sql statement for structured streaming with 
> watermark
> 
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Johnny Bai
>Priority: Major
>
> current only support dsl style as below: 
> {code}
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
> {code}
>  
> but not support group by with window in sql style as below:
> {code}
> select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32715) Broadcast block pieces may memory leak

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32715.
---
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29558
[https://github.com/apache/spark/pull/29558]

> Broadcast block pieces may memory leak
> --
>
> Key: SPARK-32715
> URL: https://issues.apache.org/jira/browse/SPARK-32715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> We use Spark thrift-server as a long-running service. A bad query submitted a 
> heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the 
> bad query but we found the driver's memory usage was still high and full GCs 
> had very frequency. By investigating with GC dump and log, we found the 
> broadcast may memory leak.
> 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 
> 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
> 116G->112G(170G), 184.9121920 secs]
> [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 
> 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]
> num #instances #bytes class name
> --
> 1: 676531691 72035438432 [B
> 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
> 3: 99551 12018117568 [Ljava.lang.Object;
> 4: 26570 4349629040 [I
> 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
> 6: 1708819 256299456 [C
> 7: 2338 179615208 [J
> 8: 1703669 54517408 java.lang.String
> 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
> 10: 177396 25545024 java.net.URI
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32715) Broadcast block pieces may memory leak

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32715:
-

Assignee: Lantao Jin

> Broadcast block pieces may memory leak
> --
>
> Key: SPARK-32715
> URL: https://issues.apache.org/jira/browse/SPARK-32715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>
> We use Spark thrift-server as a long-running service. A bad query submitted a 
> heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the 
> bad query but we found the driver's memory usage was still high and full GCs 
> had very frequency. By investigating with GC dump and log, we found the 
> broadcast may memory leak.
> 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 
> 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
> 116G->112G(170G), 184.9121920 secs]
> [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 
> 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]
> num #instances #bytes class name
> --
> 1: 676531691 72035438432 [B
> 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
> 3: 99551 12018117568 [Ljava.lang.Object;
> 4: 26570 4349629040 [I
> 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
> 6: 1708819 256299456 [C
> 7: 2338 179615208 [J
> 8: 1703669 54517408 java.lang.String
> 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
> 10: 177396 25545024 java.net.URI
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32883) Stream-stream join improvement

2020-09-14 Thread Cheng Su (Jira)

Cheng Su created SPARK-32883:


 Summary: Stream-stream join improvement
 Key: SPARK-32883
 URL: https://issues.apache.org/jira/browse/SPARK-32883
 Project: Spark
  Issue Type: Umbrella
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Cheng Su


Creating this umbrella Jira to track overall progress for stream-stream join 
improvement. See each individual sub-task for details.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32862) Left semi stream-stream join

2020-09-14 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32862:
-
Parent: SPARK-32883
Issue Type: Sub-task  (was: New Feature)

> Left semi stream-stream join
> 
>
> Key: SPARK-32862
> URL: https://issues.apache.org/jira/browse/SPARK-32862
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). We do see internally a lot of users are using left semi stream-stream 
> join (not spark structured streaming), e.g. I want to get the ad impression 
> (join left side) which has click (joint right side), but I don't care how 
> many clicks per ad (left semi semantics).
>  
> Left semi stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store
>   (1.1). if there's a match, output the left side row.
>   (1.2). if there's no match, put the row in left side state store (with 
> "matched" field to set to false in state store).
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, update left side row state with "matched" field to 
> set to true. Put the right side row in right side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is true.
> (4).for right side row needs to be evicted from state store, doing nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32854) Avoid per-row join type check in stream-stream join

2020-09-14 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32854:
-
Parent: SPARK-32883
Issue Type: Sub-task  (was: Improvement)

> Avoid per-row join type check in stream-stream join
> ---
>
> Key: SPARK-32854
> URL: https://issues.apache.org/jira/browse/SPARK-32854
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> In stream-stream join (`StreamingSymmetricHashJoinExec`), we can optimize to 
> avoid per row check for join type 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L486-L492]
>  ), to create the method before the loop of reading row. Similar optimization 
> (i.e. create auxiliary method/variable per different join type beforehand) 
> has been done in batch join world (`SortMergeJoinExec`, 
> `ShuffledHashJoinExec`). This is a minor change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32863) Full outer stream-stream join

2020-09-14 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32863:
-
Parent: SPARK-32883
Issue Type: Sub-task  (was: New Feature)

> Full outer stream-stream join
> -
>
> Key: SPARK-32863
> URL: https://issues.apache.org/jira/browse/SPARK-32863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). With current design of stream-stream join (which marks whether the row is 
> matched or not in state store), it would be very easy to support full outer 
> join as well.
>  
> Full outer stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store. If there's a match, output all matched rows. Put the row in left side 
> state store.
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, output all matched rows and update left side rows 
> state with "matched" field to set to true. Put the right side row in right 
> side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is false.
> (4).for right side row needs to be evicted from state store, output the row 
> if "matched" field is false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32862) Left semi stream-stream join

2020-09-14 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195802#comment-17195802
 ] 

Cheng Su commented on SPARK-32862:
--

Thanks [~hyukjin.kwon], I created an umbrella Jira - 
https://issues.apache.org/jira/browse/SPARK-32883 to track all of these, thanks.

> Left semi stream-stream join
> 
>
> Key: SPARK-32862
> URL: https://issues.apache.org/jira/browse/SPARK-32862
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). We do see internally a lot of users are using left semi stream-stream 
> join (not spark structured streaming), e.g. I want to get the ad impression 
> (join left side) which has click (joint right side), but I don't care how 
> many clicks per ad (left semi semantics).
>  
> Left semi stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store
>   (1.1). if there's a match, output the left side row.
>   (1.2). if there's no match, put the row in left side state store (with 
> "matched" field to set to false in state store).
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, update left side row state with "matched" field to 
> set to true. Put the right side row in right side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is true.
> (4).for right side row needs to be evicted from state store, doing nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32454) Fix HistoryServerSuite in Scala 2.13

2020-09-14 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195847#comment-17195847
 ] 

Dongjoon Hyun commented on SPARK-32454:
---

Please close this if you check it~ This seems to be outdated. Thanks, 
[~sarutak].

> Fix HistoryServerSuite in Scala 2.13
> 
>
> Key: SPARK-32454
> URL: https://issues.apache.org/jira/browse/SPARK-32454
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> - app environment *** FAILED ***
>   java.lang.AssertionError: assertion failed: Expected:
> ...
> - download all logs for app with multiple attempts *** FAILED ***
>   404 was not equal to 200 (HistoryServerSuite.scala:249)
> ...
> - response codes on bad paths *** FAILED ***
> ...
> - ui and api authorization checks *** FAILED ***
>   404 did not equal 200 Unexpected status code 404 for 
> http://localhost:18080/api/v1/applications/local-1430917381535/logs (user = 
> irashid) (HistoryServerSuite.scala:580)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32875) TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + assert, extract the general method.

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32875:


Assignee: (was: Apache Spark)

> TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + 
> assert, extract the general method.
> -
>
> Key: SPARK-32875
> URL: https://issues.apache.org/jira/browse/SPARK-32875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32875) TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + assert, extract the general method.

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195857#comment-17195857
 ] 

Apache Spark commented on SPARK-32875:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/29754

> TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + 
> assert, extract the general method.
> -
>
> Key: SPARK-32875
> URL: https://issues.apache.org/jira/browse/SPARK-32875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32875) TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + assert, extract the general method.

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32875:


Assignee: Apache Spark

> TaskSchedulerImplSuite: For the pattern of submitTasks + resourceOffers + 
> assert, extract the general method.
> -
>
> Key: SPARK-32875
> URL: https://issues.apache.org/jira/browse/SPARK-32875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32878) Avoid scheduling TaskSetManager which has no pending tasks

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32878.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29750
[https://github.com/apache/spark/pull/29750]

> Avoid scheduling TaskSetManager which has no pending tasks
> --
>
> Key: SPARK-32878
> URL: https://issues.apache.org/jira/browse/SPARK-32878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark always tries to schedule a TaskSetManager even if it has no pending 
> tasks. We can skip them to avoid any potential overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32878) Avoid scheduling TaskSetManager which has no pending tasks

2020-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32878:
-

Assignee: wuyi

> Avoid scheduling TaskSetManager which has no pending tasks
> --
>
> Key: SPARK-32878
> URL: https://issues.apache.org/jira/browse/SPARK-32878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Spark always tries to schedule a TaskSetManager even if it has no pending 
> tasks. We can skip them to avoid any potential overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32884) Mark TPCDSQuery*Suite as ExtendedSQLTest

2020-09-14 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-32884:
-

 Summary: Mark TPCDSQuery*Suite as ExtendedSQLTest
 Key: SPARK-32884
 URL: https://issues.apache.org/jira/browse/SPARK-32884
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32884) Mark TPCDSQuery*Suite as ExtendedSQLTest

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195864#comment-17195864
 ] 

Apache Spark commented on SPARK-32884:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29755

> Mark TPCDSQuery*Suite as ExtendedSQLTest
> 
>
> Key: SPARK-32884
> URL: https://issues.apache.org/jira/browse/SPARK-32884
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32884) Mark TPCDSQuery*Suite as ExtendedSQLTest

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32884:


Assignee: Apache Spark

> Mark TPCDSQuery*Suite as ExtendedSQLTest
> 
>
> Key: SPARK-32884
> URL: https://issues.apache.org/jira/browse/SPARK-32884
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32884) Mark TPCDSQuery*Suite as ExtendedSQLTest

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32884:


Assignee: (was: Apache Spark)

> Mark TPCDSQuery*Suite as ExtendedSQLTest
> 
>
> Key: SPARK-32884
> URL: https://issues.apache.org/jira/browse/SPARK-32884
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32884) Mark TPCDSQuery*Suite as ExtendedSQLTest

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195865#comment-17195865
 ] 

Apache Spark commented on SPARK-32884:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29755

> Mark TPCDSQuery*Suite as ExtendedSQLTest
> 
>
> Key: SPARK-32884
> URL: https://issues.apache.org/jira/browse/SPARK-32884
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32885) Add DataStreamReader.table API

2020-09-14 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-32885:
---

 Summary: Add DataStreamReader.table API
 Key: SPARK-32885
 URL: https://issues.apache.org/jira/browse/SPARK-32885
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Yuanjian Li


This ticket aims to add a new `table` API in DataStreamReader, which is similar 
to the table API in DataFrameReader. Users can directly use this API to get a 
Streaming DataFrame on a table. Below is a simple example:

Application 1 for initializing and starting the streaming job:
{code:java}
val path = "/home/yuanjian.li/runtime/to_be_deleted"
val tblName = "my_table"

// Write some data to `my_table`
spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName)

// Read the table as a streaming source, write result to destination directory
val table = spark.readStream.table(tblName)
table.writeStream.format("parquet").option("checkpointLocation", 
"/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")
{code}
Application 2 for appending new data:
{code:java}
// Append new data into the path
spark.range(5).write.format("parquet").option("path", 
"/home/yuanjian.li/runtime/to_be_deleted").mode("append").save(){code}
Check result:
{code:java}
// The desitination directory should contains all written data
spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32879) SessionStateBuilder should apply options set when creating the session state

2020-09-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32879.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29752
[https://github.com/apache/spark/pull/29752]

> SessionStateBuilder should apply options set when creating the session state
> 
>
> Key: SPARK-32879
> URL: https://issues.apache.org/jira/browse/SPARK-32879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32885) Add DataStreamReader.table API

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32885:


Assignee: Apache Spark

> Add DataStreamReader.table API
> --
>
> Key: SPARK-32885
> URL: https://issues.apache.org/jira/browse/SPARK-32885
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> This ticket aims to add a new `table` API in DataStreamReader, which is 
> similar to the table API in DataFrameReader. Users can directly use this API 
> to get a Streaming DataFrame on a table. Below is a simple example:
> Application 1 for initializing and starting the streaming job:
> {code:java}
> val path = "/home/yuanjian.li/runtime/to_be_deleted"
> val tblName = "my_table"
> // Write some data to `my_table`
> spark.range(3).write.format("parquet").option("path", 
> path).saveAsTable(tblName)
> // Read the table as a streaming source, write result to destination directory
> val table = spark.readStream.table(tblName)
> table.writeStream.format("parquet").option("checkpointLocation", 
> "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")
> {code}
> Application 2 for appending new data:
> {code:java}
> // Append new data into the path
> spark.range(5).write.format("parquet").option("path", 
> "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save(){code}
> Check result:
> {code:java}
> // The desitination directory should contains all written data
> spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32885) Add DataStreamReader.table API

2020-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32885:


Assignee: (was: Apache Spark)

> Add DataStreamReader.table API
> --
>
> Key: SPARK-32885
> URL: https://issues.apache.org/jira/browse/SPARK-32885
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> This ticket aims to add a new `table` API in DataStreamReader, which is 
> similar to the table API in DataFrameReader. Users can directly use this API 
> to get a Streaming DataFrame on a table. Below is a simple example:
> Application 1 for initializing and starting the streaming job:
> {code:java}
> val path = "/home/yuanjian.li/runtime/to_be_deleted"
> val tblName = "my_table"
> // Write some data to `my_table`
> spark.range(3).write.format("parquet").option("path", 
> path).saveAsTable(tblName)
> // Read the table as a streaming source, write result to destination directory
> val table = spark.readStream.table(tblName)
> table.writeStream.format("parquet").option("checkpointLocation", 
> "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")
> {code}
> Application 2 for appending new data:
> {code:java}
> // Append new data into the path
> spark.range(5).write.format("parquet").option("path", 
> "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save(){code}
> Check result:
> {code:java}
> // The desitination directory should contains all written data
> spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32885) Add DataStreamReader.table API

2020-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195904#comment-17195904
 ] 

Apache Spark commented on SPARK-32885:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/29756

> Add DataStreamReader.table API
> --
>
> Key: SPARK-32885
> URL: https://issues.apache.org/jira/browse/SPARK-32885
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> This ticket aims to add a new `table` API in DataStreamReader, which is 
> similar to the table API in DataFrameReader. Users can directly use this API 
> to get a Streaming DataFrame on a table. Below is a simple example:
> Application 1 for initializing and starting the streaming job:
> {code:java}
> val path = "/home/yuanjian.li/runtime/to_be_deleted"
> val tblName = "my_table"
> // Write some data to `my_table`
> spark.range(3).write.format("parquet").option("path", 
> path).saveAsTable(tblName)
> // Read the table as a streaming source, write result to destination directory
> val table = spark.readStream.table(tblName)
> table.writeStream.format("parquet").option("checkpointLocation", 
> "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")
> {code}
> Application 2 for appending new data:
> {code:java}
> // Append new data into the path
> spark.range(5).write.format("parquet").option("path", 
> "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save(){code}
> Check result:
> {code:java}
> // The desitination directory should contains all written data
> spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32454) Fix HistoryServerSuite in Scala 2.13

2020-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32454.
--
Resolution: Duplicate

> Fix HistoryServerSuite in Scala 2.13
> 
>
> Key: SPARK-32454
> URL: https://issues.apache.org/jira/browse/SPARK-32454
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> - app environment *** FAILED ***
>   java.lang.AssertionError: assertion failed: Expected:
> ...
> - download all logs for app with multiple attempts *** FAILED ***
>   404 was not equal to 200 (HistoryServerSuite.scala:249)
> ...
> - response codes on bad paths *** FAILED ***
> ...
> - ui and api authorization checks *** FAILED ***
>   404 did not equal 200 Unexpected status code 404 for 
> http://localhost:18080/api/v1/applications/local-1430917381535/logs (user = 
> irashid) (HistoryServerSuite.scala:580)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32872) BytesToBytesMap at MAX_CAPACITY exceeds growth threshold

2020-09-14 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195907#comment-17195907
 ] 

Dongjoon Hyun commented on SPARK-32872:
---

Thank you for updating, [~ankurd]!

> BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
> 
>
> Key: SPARK-32872
> URL: https://issues.apache.org/jira/browse/SPARK-32872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> When BytesToBytesMap is at {{MAX_CAPACITY}} and reaches the growth threshold, 
> {{numKeys >= growthThreshold}} is true but {{longArray.size() / 2 < 
> MAX_CAPACITY}} is false. This correctly prevents the map from growing, but 
> {{canGrowArray}} incorrectly remains true. Therefore the map keeps accepting 
> new keys and exceeds its growth threshold. If we attempt to spill the map in 
> this state, the UnsafeKVExternalSorter will not be able to reuse the long 
> array for sorting, causing grouping aggregations to fail with the following 
> error:
> {{2020-09-13 18:33:48,765 ERROR Executor - Exception in task 0.0 in stage 7.0 
> (TID 69)
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 12982025696 
> bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:160)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
>   at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:118)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>   at org.apache.spark.scheduler.Task.run(Task.scala:117)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32738) thread safe endpoints may hang due to fatal error

2020-09-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32738.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29580
[https://github.com/apache/spark/pull/29580]

> thread safe endpoints may hang due to fatal error
> -
>
> Key: SPARK-32738
> URL: https://issues.apache.org/jira/browse/SPARK-32738
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
> `Inbox`. Now if any fatal error happens during `Inbox.process`, 
> 'numActiveThreads' is not reduced. Then other threads can not process 
> messages in that inbox, which causes the endpoint to "hang".
> This problem is more serious in previous Spark 2.x versions since the driver, 
> executor and block manager endpoints are all thread safe endpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32738) thread safe endpoints may hang due to fatal error

2020-09-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32738:
---

Assignee: Zhenhua Wang

> thread safe endpoints may hang due to fatal error
> -
>
> Key: SPARK-32738
> URL: https://issues.apache.org/jira/browse/SPARK-32738
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>
> Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
> `Inbox`. Now if any fatal error happens during `Inbox.process`, 
> 'numActiveThreads' is not reduced. Then other threads can not process 
> messages in that inbox, which causes the endpoint to "hang".
> This problem is more serious in previous Spark 2.x versions since the driver, 
> executor and block manager endpoints are all thread safe endpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

95 matches

Mail list logo