date:20200715



[ 
https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158906#comment-17158906
 ] 

Apache Spark commented on SPARK-32330:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29130

> Preserve shuffled hash join build side partitioning
> ---
>
> Key: SPARK-32330
> URL: https://issues.apache.org/jira/browse/SPARK-32330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Currently `ShuffledHashJoin.outputPartitioning` inherits from 
> `HashJoin.outputPartitioning`, which only preserves stream side partitioning:
> `HashJoin.scala`
> {code:java}
> override def outputPartitioning: Partitioning = 
> streamedPlan.outputPartitioning
> {code}
> This loses build side partitioning information, and causes extra shuffle if 
> there's another join / group-by after this join.
> Example:
>  
> {code:java}
> // code placeholder
> withSQLConf(
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "2",
> SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
>   val df1 = spark.range(10).select($"id".as("k1"))
>   val df2 = spark.range(30).select($"id".as("k2"))
>   Seq("inner", "cross").foreach(joinType => {
> val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
>   .queryExecution.executedPlan
> assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
> // No extra shuffle before aggregate
> assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
>   })
> }{code}
>  
> Current physical plan (having an extra shuffle on `k1` before aggregate)
>  
> {code:java}
> *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
>+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>   +- *(3) Project [k1#220L]
>  +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
> :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
> :  +- *(1) Project [id#218L AS k1#220L]
> : +- *(1) Range (0, 10, step=1, splits=2)
> +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
>+- *(2) Project [id#222L AS k2#224L]
>   +- *(2) Range (0, 30, step=1, splits=2){code}
>  
> Ideal physical plan (no shuffle on `k1` before aggregate)
> {code:java}
>  *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>+- *(3) Project [k1#220L]
>   +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
>  :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
>  :  +- *(1) Project [id#218L AS k1#220L]
>  : +- *(1) Range (0, 10, step=1, splits=2)
>  +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
> +- *(2) Project [id#222L AS k2#224L]
>+- *(2) Range (0, 30, step=1, splits=2){code}
>  
> This can be fixed by overriding `outputPartitioning` method in 
> `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32330) Preserve shuffled hash join build side partitioning



 [ 
https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32330:


Assignee: Apache Spark

> Preserve shuffled hash join build side partitioning
> ---
>
> Key: SPARK-32330
> URL: https://issues.apache.org/jira/browse/SPARK-32330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently `ShuffledHashJoin.outputPartitioning` inherits from 
> `HashJoin.outputPartitioning`, which only preserves stream side partitioning:
> `HashJoin.scala`
> {code:java}
> override def outputPartitioning: Partitioning = 
> streamedPlan.outputPartitioning
> {code}
> This loses build side partitioning information, and causes extra shuffle if 
> there's another join / group-by after this join.
> Example:
>  
> {code:java}
> // code placeholder
> withSQLConf(
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "2",
> SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
>   val df1 = spark.range(10).select($"id".as("k1"))
>   val df2 = spark.range(30).select($"id".as("k2"))
>   Seq("inner", "cross").foreach(joinType => {
> val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
>   .queryExecution.executedPlan
> assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
> // No extra shuffle before aggregate
> assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
>   })
> }{code}
>  
> Current physical plan (having an extra shuffle on `k1` before aggregate)
>  
> {code:java}
> *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
>+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>   +- *(3) Project [k1#220L]
>  +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
> :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
> :  +- *(1) Project [id#218L AS k1#220L]
> : +- *(1) Range (0, 10, step=1, splits=2)
> +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
>+- *(2) Project [id#222L AS k2#224L]
>   +- *(2) Range (0, 30, step=1, splits=2){code}
>  
> Ideal physical plan (no shuffle on `k1` before aggregate)
> {code:java}
>  *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>+- *(3) Project [k1#220L]
>   +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
>  :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
>  :  +- *(1) Project [id#218L AS k1#220L]
>  : +- *(1) Range (0, 10, step=1, splits=2)
>  +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
> +- *(2) Project [id#222L AS k2#224L]
>+- *(2) Range (0, 30, step=1, splits=2){code}
>  
> This can be fixed by overriding `outputPartitioning` method in 
> `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32330) Preserve shuffled hash join build side partitioning



 [ 
https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32330:


Assignee: (was: Apache Spark)

> Preserve shuffled hash join build side partitioning
> ---
>
> Key: SPARK-32330
> URL: https://issues.apache.org/jira/browse/SPARK-32330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Currently `ShuffledHashJoin.outputPartitioning` inherits from 
> `HashJoin.outputPartitioning`, which only preserves stream side partitioning:
> `HashJoin.scala`
> {code:java}
> override def outputPartitioning: Partitioning = 
> streamedPlan.outputPartitioning
> {code}
> This loses build side partitioning information, and causes extra shuffle if 
> there's another join / group-by after this join.
> Example:
>  
> {code:java}
> // code placeholder
> withSQLConf(
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "2",
> SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
>   val df1 = spark.range(10).select($"id".as("k1"))
>   val df2 = spark.range(30).select($"id".as("k2"))
>   Seq("inner", "cross").foreach(joinType => {
> val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
>   .queryExecution.executedPlan
> assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
> // No extra shuffle before aggregate
> assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
>   })
> }{code}
>  
> Current physical plan (having an extra shuffle on `k1` before aggregate)
>  
> {code:java}
> *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
>+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>   +- *(3) Project [k1#220L]
>  +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
> :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
> :  +- *(1) Project [id#218L AS k1#220L]
> : +- *(1) Range (0, 10, step=1, splits=2)
> +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
>+- *(2) Project [id#222L AS k2#224L]
>   +- *(2) Range (0, 30, step=1, splits=2){code}
>  
> Ideal physical plan (no shuffle on `k1` before aggregate)
> {code:java}
>  *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>+- *(3) Project [k1#220L]
>   +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
>  :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
>  :  +- *(1) Project [id#218L AS k1#220L]
>  : +- *(1) Range (0, 10, step=1, splits=2)
>  +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
> +- *(2) Project [id#222L AS k2#224L]
>+- *(2) Range (0, 30, step=1, splits=2){code}
>  
> This can be fixed by overriding `outputPartitioning` method in 
> `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32330) Preserve shuffled hash join build side partitioning



[ 
https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158905#comment-17158905
 ] 

Apache Spark commented on SPARK-32330:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29130

> Preserve shuffled hash join build side partitioning
> ---
>
> Key: SPARK-32330
> URL: https://issues.apache.org/jira/browse/SPARK-32330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Currently `ShuffledHashJoin.outputPartitioning` inherits from 
> `HashJoin.outputPartitioning`, which only preserves stream side partitioning:
> `HashJoin.scala`
> {code:java}
> override def outputPartitioning: Partitioning = 
> streamedPlan.outputPartitioning
> {code}
> This loses build side partitioning information, and causes extra shuffle if 
> there's another join / group-by after this join.
> Example:
>  
> {code:java}
> // code placeholder
> withSQLConf(
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "2",
> SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
>   val df1 = spark.range(10).select($"id".as("k1"))
>   val df2 = spark.range(30).select($"id".as("k2"))
>   Seq("inner", "cross").foreach(joinType => {
> val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
>   .queryExecution.executedPlan
> assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
> // No extra shuffle before aggregate
> assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
>   })
> }{code}
>  
> Current physical plan (having an extra shuffle on `k1` before aggregate)
>  
> {code:java}
> *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
>+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>   +- *(3) Project [k1#220L]
>  +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
> :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
> :  +- *(1) Project [id#218L AS k1#220L]
> : +- *(1) Range (0, 10, step=1, splits=2)
> +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
>+- *(2) Project [id#222L AS k2#224L]
>   +- *(2) Range (0, 30, step=1, splits=2){code}
>  
> Ideal physical plan (no shuffle on `k1` before aggregate)
> {code:java}
>  *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>+- *(3) Project [k1#220L]
>   +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
>  :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
>  :  +- *(1) Project [id#218L AS k1#220L]
>  : +- *(1) Range (0, 10, step=1, splits=2)
>  +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
> +- *(2) Project [id#222L AS k2#224L]
>+- *(2) Range (0, 30, step=1, splits=2){code}
>  
> This can be fixed by overriding `outputPartitioning` method in 
> `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32325) JSON predicate pushdown for nested fields

2020-07-15 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158902#comment-17158902
 ] 

Maxim Gekk commented on SPARK-32325:


The JIRA ticket was opened while addressing [~dongjoon] comments in the PR 
https://github.com/apache/spark/pull/27366 but the PR has not merged yet.

> JSON predicate pushdown for nested fields
> -
>
> Key: SPARK-32325
> URL: https://issues.apache.org/jira/browse/SPARK-32325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SPARK-30648 should support filters pushdown to JSON datasource but it 
> supports only filters that refer to top-level fields. The ticket aims to 
> support nested fields as well. See the needed changes: 
> https://github.com/apache/spark/pull/27366#discussion_r443340603



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32330) Preserve shuffled hash join build side partitioning

2020-07-15 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32330:
-
Description: 
Currently `ShuffledHashJoin.outputPartitioning` inherits from 
`HashJoin.outputPartitioning`, which only preserves stream side partitioning:

`HashJoin.scala`
{code:java}
override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning
{code}
This loses build side partitioning information, and causes extra shuffle if 
there's another join / group-by after this join.

Example:

 
{code:java}
// code placeholder
withSQLConf(
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
SQLConf.SHUFFLE_PARTITIONS.key -> "2",
SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
  val df1 = spark.range(10).select($"id".as("k1"))
  val df2 = spark.range(30).select($"id".as("k2"))
  Seq("inner", "cross").foreach(joinType => {
val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
  .queryExecution.executedPlan
assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
// No extra shuffle before aggregate
assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
  })
}{code}
 

Current physical plan (having an extra shuffle on `k1` before aggregate)

 
{code:java}
*(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
count#235L])
+- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
   +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
output=[k1#220L, count#239L])
  +- *(3) Project [k1#220L]
 +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
:- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
:  +- *(1) Project [id#218L AS k1#220L]
: +- *(1) Range (0, 10, step=1, splits=2)
+- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
   +- *(2) Project [id#222L AS k2#224L]
  +- *(2) Range (0, 30, step=1, splits=2){code}
 

Ideal physical plan (no shuffle on `k1` before aggregate)
{code:java}
 *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
count#235L])
+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
output=[k1#220L, count#239L])
   +- *(3) Project [k1#220L]
  +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
 :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
 :  +- *(1) Project [id#218L AS k1#220L]
 : +- *(1) Range (0, 10, step=1, splits=2)
 +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
+- *(2) Project [id#222L AS k2#224L]
   +- *(2) Range (0, 30, step=1, splits=2){code}
 

This can be fixed by overriding `outputPartitioning` method in 
`ShuffledHashJoinExec`, similar to `SortMergeJoinExec`.

  was:
Currently `ShuffledHashJoin.outputPartitioning` inherits from 
`HashJoin.outputPartitioning`, which only preserves stream side partitioning:

`HashJoin.scala`
{code:java}
override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning
{code}
This loses build side partitioning information, and causes extra shuffle if 
there's another join / group-by after this join.

Example:

 
{code:java}
// code placeholder
withSQLConf(
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
SQLConf.SHUFFLE_PARTITIONS.key -> "2",
SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
  val df1 = spark.range(10).select($"id".as("k1"))
  val df2 = spark.range(30).select($"id".as("k2"))
  Seq("inner", "cross").foreach(joinType => {
val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
  .queryExecution.executedPlan
assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
// No extra shuffle before aggregate
assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
  })
}{code}
 

Current physical plan (having an extra shuffle on `k1` before aggregate)

 
{code:java}
*(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
count#235L])
+- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
   +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
output=[k1#220L, count#239L])
  +- *(3) Project [k1#220L]
 +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
:- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
:  +- *(1) Project [id#218L AS k1#220L]
: +- *(1) Range (0, 10, step=1, splits=2)
+- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
   +- *(2) Project [id#222L AS k2#224L]
  +- *(2) Range (0, 30, step=1, splits=2){code}
 

Ideal physical plan (no shuffle on `k1` before aggregate)
{code:java}
 *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
count#235L])
+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)],

[jira] [Created] (SPARK-32330) Preserve shuffled hash join build side partitioning

2020-07-15 Thread Cheng Su (Jira)

Cheng Su created SPARK-32330:


 Summary: Preserve shuffled hash join build side partitioning
 Key: SPARK-32330
 URL: https://issues.apache.org/jira/browse/SPARK-32330
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su


Currently `ShuffledHashJoin.outputPartitioning` inherits from 
`HashJoin.outputPartitioning`, which only preserves stream side partitioning:

`HashJoin.scala`
{code:java}
override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning
{code}
This loses build side partitioning information, and causes extra shuffle if 
there's another join / group-by after this join.

Example:

 
{code:java}
// code placeholder
withSQLConf(
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
SQLConf.SHUFFLE_PARTITIONS.key -> "2",
SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
  val df1 = spark.range(10).select($"id".as("k1"))
  val df2 = spark.range(30).select($"id".as("k2"))
  Seq("inner", "cross").foreach(joinType => {
val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
  .queryExecution.executedPlan
assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
// No extra shuffle before aggregate
assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
  })
}{code}
 

Current physical plan (having an extra shuffle on `k1` before aggregate)

 
{code:java}
*(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
count#235L])
+- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
   +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
output=[k1#220L, count#239L])
  +- *(3) Project [k1#220L]
 +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
:- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
:  +- *(1) Project [id#218L AS k1#220L]
: +- *(1) Range (0, 10, step=1, splits=2)
+- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
   +- *(2) Project [id#222L AS k2#224L]
  +- *(2) Range (0, 30, step=1, splits=2){code}
 

Ideal physical plan (no shuffle on `k1` before aggregate)
{code:java}
 *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
count#235L])
+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
output=[k1#220L, count#239L])
   +- *(3) Project [k1#220L]
  +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
 :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
 :  +- *(1) Project [id#218L AS k1#220L]
 : +- *(1) Range (0, 10, step=1, splits=2)
 +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
+- *(2) Project [id#222L AS k2#224L]
   +- *(2) Range (0, 30, step=1, splits=2){code}
 

This can be fixed by overriding `outputPartitioning` method in 
`ShuffledHashJoinExec`, similar to `SortMergeJoinExec`.

`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31831) Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)



[ 
https://issues.apache.org/jira/browse/SPARK-31831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158886#comment-17158886
 ] 

Apache Spark commented on SPARK-31831:
--

User 'frankyin-factual' has created a pull request for this issue:
https://github.com/apache/spark/pull/29129

> Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It 
> is not a test it is a sbt.testing.SuiteSelector)
> 
>
> Key: SPARK-31831
> URL: https://issues.apache.org/jira/browse/SPARK-31831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Frank Yin
>Priority: Major
> Fix For: 3.1.0
>
>
> I've seen the failures two times (not in a row but closely) which seems to 
> require investigation.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123147/testReport
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123150/testReport
> {noformat}
> org.mockito.exceptions.base.MockitoException:  ClassCastException occurred 
> while creating the mockito mock :   class to mock : 
> 'org.apache.hive.service.cli.session.SessionManager', loaded by classloader : 
> 'sun.misc.Launcher$AppClassLoader@483bf400'   created class : 
> 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by 
> classloader : 
> 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'   proxy 
> instance class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', 
> loaded by classloader : 
> 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'   instance 
> creation by : ObjenesisInstantiator  You might experience classloading 
> issues, please ask the mockito mailing-list. 
>  Stack Trace
> sbt.ForkMain$ForkError: org.mockito.exceptions.base.MockitoException: 
> ClassCastException occurred while creating the mockito mock :
>   class to mock : 'org.apache.hive.service.cli.session.SessionManager', 
> loaded by classloader : 'sun.misc.Launcher$AppClassLoader@483bf400'
>   created class : 
> 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by 
> classloader : 
> 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'
>   proxy instance class : 
> 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by 
> classloader : 
> 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'
>   instance creation by : ObjenesisInstantiator
> You might experience classloading issues, please ask the mockito mailing-list.
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.beforeAll(HiveSessionImplSuite.scala:44)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: java.lang.ClassCastException: 
> org.mockito.codegen.SessionManager$MockitoMock$1696557705 cannot be cast to 
> org.mockito.internal.creation.bytebuddy.MockAccess
>   at 
> org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:48)
>   at 
> org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25)
>   at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35)
>   at org.mockito.internal.MockitoCore.mock(MockitoCore.java:63)
>   at org.mockito.Mockito.mock(Mockito.java:1908)
>   at org.mockito.Mockito.mock(Mockito.java:1817)
>   ... 13 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31831) Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)



[ 
https://issues.apache.org/jira/browse/SPARK-31831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158885#comment-17158885
 ] 

Apache Spark commented on SPARK-31831:
--

User 'frankyin-factual' has created a pull request for this issue:
https://github.com/apache/spark/pull/29129

> Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It 
> is not a test it is a sbt.testing.SuiteSelector)
> 
>
> Key: SPARK-31831
> URL: https://issues.apache.org/jira/browse/SPARK-31831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Frank Yin
>Priority: Major
> Fix For: 3.1.0
>
>
> I've seen the failures two times (not in a row but closely) which seems to 
> require investigation.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123147/testReport
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123150/testReport
> {noformat}
> org.mockito.exceptions.base.MockitoException:  ClassCastException occurred 
> while creating the mockito mock :   class to mock : 
> 'org.apache.hive.service.cli.session.SessionManager', loaded by classloader : 
> 'sun.misc.Launcher$AppClassLoader@483bf400'   created class : 
> 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by 
> classloader : 
> 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'   proxy 
> instance class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', 
> loaded by classloader : 
> 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'   instance 
> creation by : ObjenesisInstantiator  You might experience classloading 
> issues, please ask the mockito mailing-list. 
>  Stack Trace
> sbt.ForkMain$ForkError: org.mockito.exceptions.base.MockitoException: 
> ClassCastException occurred while creating the mockito mock :
>   class to mock : 'org.apache.hive.service.cli.session.SessionManager', 
> loaded by classloader : 'sun.misc.Launcher$AppClassLoader@483bf400'
>   created class : 
> 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by 
> classloader : 
> 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'
>   proxy instance class : 
> 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by 
> classloader : 
> 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'
>   instance creation by : ObjenesisInstantiator
> You might experience classloading issues, please ask the mockito mailing-list.
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.beforeAll(HiveSessionImplSuite.scala:44)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: java.lang.ClassCastException: 
> org.mockito.codegen.SessionManager$MockitoMock$1696557705 cannot be cast to 
> org.mockito.internal.creation.bytebuddy.MockAccess
>   at 
> org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:48)
>   at 
> org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25)
>   at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35)
>   at org.mockito.internal.MockitoCore.mock(MockitoCore.java:63)
>   at org.mockito.Mockito.mock(Mockito.java:1908)
>   at org.mockito.Mockito.mock(Mockito.java:1817)
>   ... 13 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32329) Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES



 [ 
https://issues.apache.org/jira/browse/SPARK-32329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32329:


Assignee: (was: Apache Spark)

> Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
> 
>
> Key: SPARK-32329
> URL: https://issues.apache.org/jira/browse/SPARK-32329
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32329) Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES



 [ 
https://issues.apache.org/jira/browse/SPARK-32329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32329:


Assignee: Apache Spark

> Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
> 
>
> Key: SPARK-32329
> URL: https://issues.apache.org/jira/browse/SPARK-32329
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32329) Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES



[ 
https://issues.apache.org/jira/browse/SPARK-32329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158881#comment-17158881
 ] 

Apache Spark commented on SPARK-32329:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29128

> Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
> 
>
> Key: SPARK-32329
> URL: https://issues.apache.org/jira/browse/SPARK-32329
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32329) Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES

2020-07-15 Thread William Hyun (Jira)

William Hyun created SPARK-32329:


 Summary: Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
 Key: SPARK-32329
 URL: https://issues.apache.org/jira/browse/SPARK-32329
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.1.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32325) JSON predicate pushdown for nested fields

2020-07-15 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158878#comment-17158878
 ] 

pavithra ramachandran commented on SPARK-32325:
---

i would like to work on this

> JSON predicate pushdown for nested fields
> -
>
> Key: SPARK-32325
> URL: https://issues.apache.org/jira/browse/SPARK-32325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SPARK-30648 should support filters pushdown to JSON datasource but it 
> supports only filters that refer to top-level fields. The ticket aims to 
> support nested fields as well. See the needed changes: 
> https://github.com/apache/spark/pull/27366#discussion_r443340603



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32328) Avro predicate pushdown for nested fields

2020-07-15 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158877#comment-17158877
 ] 

pavithra ramachandran commented on SPARK-32328:
---

i would like to work on this

> Avro predicate pushdown for nested fields
> -
>
> Key: SPARK-32328
> URL: https://issues.apache.org/jira/browse/SPARK-32328
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-07-15 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-32125:
--

Assignee: Zhongwei Zhu

> [UI] Support get taskList by status in Web UI and SHS Rest API
> --
>
> Key: SPARK-32125
> URL: https://issues.apache.org/jira/browse/SPARK-32125
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
>
> Support fetching taskList by status as below:
> /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-07-15 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-32125.

Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/28942

> [UI] Support get taskList by status in Web UI and SHS Rest API
> --
>
> Key: SPARK-32125
> URL: https://issues.apache.org/jira/browse/SPARK-32125
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Support fetching taskList by status as below:
> /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-07-15 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158863#comment-17158863
 ] 

Ankit Raj Boudh commented on SPARK-32306:
-

[~seanmalory], i will raise the pr for this soon

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32324) Fix error messages during using PIVOT and lateral view

2020-07-15 Thread philipse (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

philipse updated SPARK-32324:
-
Description: 
Currently when we use `lateral view` and `pivot` together in from clause, if  
`lateral view` is before `pivot`, the error message is "LATERAL cannot be used 
together with PIVOT in FROM clause".if if  `lateral view` is after `pivot`,the 
query will be normal ,So the error messages "LATERAL cannot be used together 
with PIVOT in FROM clause" is not accurate, we may improve it.

 

Steps to reproduce:
{code:java}
CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING);
 INSERT INTO person VALUES
 (100, 'John', 30, 1, 'Street 1'),
 (200, 'Mary', NULL, 1, 'Street 2'),
 (300, 'Mike', 80, 3, 'Street 3'),
 (400, 'Dan', 50, 4, 'Street 4');
{code}
 

Query1:

 
{code:java}
SELECT * FROM person
 lateral view outer explode(array(30,60)) tabelName as c_age
 lateral view explode(array(40,80)) as d_age
 PIVOT (
 count(distinct age) as a
 for name in ('Mary','John')
 )
{code}
Result 1:

 
{code:java}
Error: org.apache.spark.sql.catalyst.parser.ParseException: 
 LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9)
== SQL ==
 SELECT * FROM person
 -^^^
 lateral view outer explode(array(30,60)) tabelName as c_age
 lateral view explode(array(40,80)) as d_age
 PIVOT (
 count(distinct age) as a
 for name in ('Mary','John')
 ) (state=,code=0)
{code}
 

 

Query2:

 
{code:java}
SELECT * FROM person
 PIVOT (
 count(distinct age) as a
 for name in ('Mary','John')
 )
 lateral view outer explode(array(30,60)) tabelName as c_age
 lateral view explode(array(40,80)) as d_age
{code}
 

Reuslt2:

+---+--++---++
|id|Mary|John|c_age|d_age|

+---+--++---++
|300|NULL|NULL|30|40|
|300|NULL|NULL|30|80|
|300|NULL|NULL|60|40|
|300|NULL|NULL|60|80|
|100|0|NULL|30|40|
|100|0|NULL|30|80|
|100|0|NULL|60|40|
|100|0|NULL|60|80|
|400|NULL|NULL|30|40|
|400|NULL|NULL|30|80|
|400|NULL|NULL|60|40|
|400|NULL|NULL|60|80|
|200|NULL|1|30|40|
|200|NULL|1|30|80|
|200|NULL|1|60|40|
|200|NULL|1|60|80|

+---+--++---++

```

 

  was:
Currently when we use `lateral view` and `pivot` together in from clause, if  
`lateral view` is before `pivot`, the error message is "LATERAL cannot be used 
together with PIVOT in FROM clause".if if  `lateral view` is after `pivot`,the 
query will be normal ,So the error messages "LATERAL cannot be used together 
with PIVOT in FROM clause" is not accurate, we may improve it.

 

Steps to reproduce:

```

CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');

```

Query1:

```

SELECT * FROM person
lateral view outer explode(array(30,60)) tabelName as c_age
lateral view explode(array(40,80)) as d_age
PIVOT (
 count(distinct age) as a
for name in ('Mary','John')
)

```

Result 1:

```

Error: org.apache.spark.sql.catalyst.parser.ParseException: 
LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9)

== SQL ==
SELECT * FROM person
-^^^
lateral view outer explode(array(30,60)) tabelName as c_age
lateral view explode(array(40,80)) as d_age
PIVOT (
 count(distinct age) as a
for name in ('Mary','John')
) (state=,code=0)

```

 

Query2:

```

SELECT * FROM person
PIVOT (
 count(distinct age) as a
for name in ('Mary','John')
)
lateral view outer explode(array(30,60)) tabelName as c_age
lateral view explode(array(40,80)) as d_age

```

Reuslt2:

```

+--+---+---+++
| id | Mary | John | c_age | d_age |
+--+---+---+++
| 300 | NULL | NULL | 30 | 40 |
| 300 | NULL | NULL | 30 | 80 |
| 300 | NULL | NULL | 60 | 40 |
| 300 | NULL | NULL | 60 | 80 |
| 100 | 0 | NULL | 30 | 40 |
| 100 | 0 | NULL | 30 | 80 |
| 100 | 0 | NULL | 60 | 40 |
| 100 | 0 | NULL | 60 | 80 |
| 400 | NULL | NULL | 30 | 40 |
| 400 | NULL | NULL | 30 | 80 |
| 400 | NULL | NULL | 60 | 40 |
| 400 | NULL | NULL | 60 | 80 |
| 200 | NULL | 1 | 30 | 40 |
| 200 | NULL | 1 | 30 | 80 |
| 200 | NULL | 1 | 60 | 40 |
| 200 | NULL | 1 | 60 | 80 |
+--+---+---+++

```

 


> Fix error messages during using PIVOT and lateral view
> --
>
> Key: SPARK-32324
> URL: https://issues.apache.org/jira/browse/SPARK-32324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Currently when we use `lateral view` and `pivot` together in from clause, if  
> `lateral view` is before `pivot`, the error message is "LATERAL cannot be 
> used together with PIVOT in FROM clau

[jira] [Created] (SPARK-32328) Avro predicate pushdown for nested fields

2020-07-15 Thread jobit mathew (Jira)

jobit mathew created SPARK-32328:


 Summary: Avro predicate pushdown for nested fields
 Key: SPARK-32328
 URL: https://issues.apache.org/jira/browse/SPARK-32328
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jobit mathew






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32307) Aggression that use map type input UDF as group expression can fail



[ 
https://issues.apache.org/jira/browse/SPARK-32307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158814#comment-17158814
 ] 

Dongjoon Hyun commented on SPARK-32307:
---

Hi, [~Ngone51]. It seems that we need to pass a full Jenkins run on branch-4.0. 
Could you make a backporting PR please?

> Aggression that use map type input UDF as group expression can fail
> ---
>
> Key: SPARK-32307
> URL: https://issues.apache.org/jira/browse/SPARK-32307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> {code:java}
> spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt))
> Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t")
> checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil)
> [info]   org.apache.spark.sql.AnalysisException: expression 't.`a`' is 
> neither present in the group by, nor is it an aggregate function. Add to 
> group by or wrap in first() (or first_value) if you don't care which value 
> you get.;;
> [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8]
> [info] +- SubqueryAlias t
> [info]+- Project [value#3 AS a#6]
> [info]   +- LocalRelation [value#3]
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
> [info]   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> [info]   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
> [info]   at scala.collection.immutable.List.foreach(List.scala:392)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13(CheckAnalysis.scala:286)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13$adapted(CheckAnalysis.scala:286)
> [info]   at scala.collection.immutable.List.foreach(List.scala:392)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:286)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:70)
> [info]   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:135)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:135)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70)
> [info]   at 
> org.apach

[jira] [Updated] (SPARK-32307) Aggression that use map type input UDF as group expression can fail



 [ 
https://issues.apache.org/jira/browse/SPARK-32307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32307:
--
Fix Version/s: (was: 3.0.1)

> Aggression that use map type input UDF as group expression can fail
> ---
>
> Key: SPARK-32307
> URL: https://issues.apache.org/jira/browse/SPARK-32307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> {code:java}
> spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt))
> Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t")
> checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil)
> [info]   org.apache.spark.sql.AnalysisException: expression 't.`a`' is 
> neither present in the group by, nor is it an aggregate function. Add to 
> group by or wrap in first() (or first_value) if you don't care which value 
> you get.;;
> [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8]
> [info] +- SubqueryAlias t
> [info]+- Project [value#3 AS a#6]
> [info]   +- LocalRelation [value#3]
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
> [info]   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> [info]   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
> [info]   at scala.collection.immutable.List.foreach(List.scala:392)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13(CheckAnalysis.scala:286)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13$adapted(CheckAnalysis.scala:286)
> [info]   at scala.collection.immutable.List.foreach(List.scala:392)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:286)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:70)
> [info]   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:135)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:135)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.assertAna

[jira] [Commented] (SPARK-32307) Aggression that use map type input UDF as group expression can fail



[ 
https://issues.apache.org/jira/browse/SPARK-32307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158813#comment-17158813
 ] 

Dongjoon Hyun commented on SPARK-32307:
---

This is reverted from `branch-3.0` due to the UT failure.

> Aggression that use map type input UDF as group expression can fail
> ---
>
> Key: SPARK-32307
> URL: https://issues.apache.org/jira/browse/SPARK-32307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> {code:java}
> spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt))
> Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t")
> checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil)
> [info]   org.apache.spark.sql.AnalysisException: expression 't.`a`' is 
> neither present in the group by, nor is it an aggregate function. Add to 
> group by or wrap in first() (or first_value) if you don't care which value 
> you get.;;
> [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8]
> [info] +- SubqueryAlias t
> [info]+- Project [value#3 AS a#6]
> [info]   +- LocalRelation [value#3]
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
> [info]   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> [info]   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
> [info]   at scala.collection.immutable.List.foreach(List.scala:392)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13(CheckAnalysis.scala:286)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13$adapted(CheckAnalysis.scala:286)
> [info]   at scala.collection.immutable.List.foreach(List.scala:392)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:286)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:70)
> [info]   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:135)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:135)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.sc

[jira] [Assigned] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view



 [ 
https://issues.apache.org/jira/browse/SPARK-32327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32327:


Assignee: Apache Spark

> Introduce UnresolvedTableOrPermanentView for commands that support a 
> table/view but not a temporary view
> 
>
> Key: SPARK-32327
> URL: https://issues.apache.org/jira/browse/SPARK-32327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> We should have UnresolvedTableOrPermanentView for commands that do support a 
> table or a view, but not a temporary view, such that an analysis can fail if 
> an identifier is resolved to a temporary view for those commands 
>  
> For example, SHOW TBLPROPERTIES should not support a temp view since it 
> always returns an empty result, which could be misleading.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view



[ 
https://issues.apache.org/jira/browse/SPARK-32327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158745#comment-17158745
 ] 

Apache Spark commented on SPARK-32327:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/29127

> Introduce UnresolvedTableOrPermanentView for commands that support a 
> table/view but not a temporary view
> 
>
> Key: SPARK-32327
> URL: https://issues.apache.org/jira/browse/SPARK-32327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> We should have UnresolvedTableOrPermanentView for commands that do support a 
> table or a view, but not a temporary view, such that an analysis can fail if 
> an identifier is resolved to a temporary view for those commands 
>  
> For example, SHOW TBLPROPERTIES should not support a temp view since it 
> always returns an empty result, which could be misleading.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view



 [ 
https://issues.apache.org/jira/browse/SPARK-32327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32327:


Assignee: (was: Apache Spark)

> Introduce UnresolvedTableOrPermanentView for commands that support a 
> table/view but not a temporary view
> 
>
> Key: SPARK-32327
> URL: https://issues.apache.org/jira/browse/SPARK-32327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> We should have UnresolvedTableOrPermanentView for commands that do support a 
> table or a view, but not a temporary view, such that an analysis can fail if 
> an identifier is resolved to a temporary view for those commands 
>  
> For example, SHOW TBLPROPERTIES should not support a temp view since it 
> always returns an empty result, which could be misleading.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view



[ 
https://issues.apache.org/jira/browse/SPARK-32327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158744#comment-17158744
 ] 

Apache Spark commented on SPARK-32327:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/29127

> Introduce UnresolvedTableOrPermanentView for commands that support a 
> table/view but not a temporary view
> 
>
> Key: SPARK-32327
> URL: https://issues.apache.org/jira/browse/SPARK-32327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> We should have UnresolvedTableOrPermanentView for commands that do support a 
> table or a view, but not a temporary view, such that an analysis can fail if 
> an identifier is resolved to a temporary view for those commands 
>  
> For example, SHOW TBLPROPERTIES should not support a temp view since it 
> always returns an empty result, which could be misleading.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation



[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158739#comment-17158739
 ] 

Thomas Graves commented on SPARK-32037:
---

Any other opinions on what we should go with here?

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view

2020-07-15 Thread Terry Kim (Jira)

Terry Kim created SPARK-32327:
-

 Summary: Introduce UnresolvedTableOrPermanentView for commands 
that support a table/view but not a temporary view
 Key: SPARK-32327
 URL: https://issues.apache.org/jira/browse/SPARK-32327
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


We should have UnresolvedTableOrPermanentView for commands that do support a 
table or a view, but not a temporary view, such that an analysis can fail if an 
identifier is resolved to a temporary view for those commands 

 

For example, SHOW TBLPROPERTIES should not support a temp view since it always 
returns an empty result, which could be misleading.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32326) R version is too old on Jenkins k8s PRB

2020-07-15 Thread Holden Karau (Jira)

Holden Karau created SPARK-32326:


 Summary: R version is too old on Jenkins k8s PRB
 Key: SPARK-32326
 URL: https://issues.apache.org/jira/browse/SPARK-32326
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Holden Karau
Assignee: Shane Knapp


I'm seeing a consistent failure indicating the R version is out of date - 
[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/30513/console]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32324) Fix error messages during using PIVOT and lateral view



[ 
https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158584#comment-17158584
 ] 

Apache Spark commented on SPARK-32324:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/29126

> Fix error messages during using PIVOT and lateral view
> --
>
> Key: SPARK-32324
> URL: https://issues.apache.org/jira/browse/SPARK-32324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Currently when we use `lateral view` and `pivot` together in from clause, if  
> `lateral view` is before `pivot`, the error message is "LATERAL cannot be 
> used together with PIVOT in FROM clause".if if  `lateral view` is after 
> `pivot`,the query will be normal ,So the error messages "LATERAL cannot be 
> used together with PIVOT in FROM clause" is not accurate, we may improve it.
>  
> Steps to reproduce:
> ```
> CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING);
> INSERT INTO person VALUES
> (100, 'John', 30, 1, 'Street 1'),
> (200, 'Mary', NULL, 1, 'Street 2'),
> (300, 'Mike', 80, 3, 'Street 3'),
> (400, 'Dan', 50, 4, 'Street 4');
> ```
> Query1:
> ```
> SELECT * FROM person
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> )
> ```
> Result 1:
> ```
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9)
> == SQL ==
> SELECT * FROM person
> -^^^
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> ) (state=,code=0)
> ```
>  
> Query2:
> ```
> SELECT * FROM person
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> )
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> ```
> Reuslt2:
> ```
> +--+---+---+++
> | id | Mary | John | c_age | d_age |
> +--+---+---+++
> | 300 | NULL | NULL | 30 | 40 |
> | 300 | NULL | NULL | 30 | 80 |
> | 300 | NULL | NULL | 60 | 40 |
> | 300 | NULL | NULL | 60 | 80 |
> | 100 | 0 | NULL | 30 | 40 |
> | 100 | 0 | NULL | 30 | 80 |
> | 100 | 0 | NULL | 60 | 40 |
> | 100 | 0 | NULL | 60 | 80 |
> | 400 | NULL | NULL | 30 | 40 |
> | 400 | NULL | NULL | 30 | 80 |
> | 400 | NULL | NULL | 60 | 40 |
> | 400 | NULL | NULL | 60 | 80 |
> | 200 | NULL | 1 | 30 | 40 |
> | 200 | NULL | 1 | 30 | 80 |
> | 200 | NULL | 1 | 60 | 40 |
> | 200 | NULL | 1 | 60 | 80 |
> +--+---+---+++
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32324) Fix error messages during using PIVOT and lateral view



 [ 
https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32324:


Assignee: (was: Apache Spark)

> Fix error messages during using PIVOT and lateral view
> --
>
> Key: SPARK-32324
> URL: https://issues.apache.org/jira/browse/SPARK-32324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Currently when we use `lateral view` and `pivot` together in from clause, if  
> `lateral view` is before `pivot`, the error message is "LATERAL cannot be 
> used together with PIVOT in FROM clause".if if  `lateral view` is after 
> `pivot`,the query will be normal ,So the error messages "LATERAL cannot be 
> used together with PIVOT in FROM clause" is not accurate, we may improve it.
>  
> Steps to reproduce:
> ```
> CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING);
> INSERT INTO person VALUES
> (100, 'John', 30, 1, 'Street 1'),
> (200, 'Mary', NULL, 1, 'Street 2'),
> (300, 'Mike', 80, 3, 'Street 3'),
> (400, 'Dan', 50, 4, 'Street 4');
> ```
> Query1:
> ```
> SELECT * FROM person
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> )
> ```
> Result 1:
> ```
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9)
> == SQL ==
> SELECT * FROM person
> -^^^
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> ) (state=,code=0)
> ```
>  
> Query2:
> ```
> SELECT * FROM person
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> )
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> ```
> Reuslt2:
> ```
> +--+---+---+++
> | id | Mary | John | c_age | d_age |
> +--+---+---+++
> | 300 | NULL | NULL | 30 | 40 |
> | 300 | NULL | NULL | 30 | 80 |
> | 300 | NULL | NULL | 60 | 40 |
> | 300 | NULL | NULL | 60 | 80 |
> | 100 | 0 | NULL | 30 | 40 |
> | 100 | 0 | NULL | 30 | 80 |
> | 100 | 0 | NULL | 60 | 40 |
> | 100 | 0 | NULL | 60 | 80 |
> | 400 | NULL | NULL | 30 | 40 |
> | 400 | NULL | NULL | 30 | 80 |
> | 400 | NULL | NULL | 60 | 40 |
> | 400 | NULL | NULL | 60 | 80 |
> | 200 | NULL | 1 | 30 | 40 |
> | 200 | NULL | 1 | 30 | 80 |
> | 200 | NULL | 1 | 60 | 40 |
> | 200 | NULL | 1 | 60 | 80 |
> +--+---+---+++
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32324) Fix error messages during using PIVOT and lateral view



[ 
https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158582#comment-17158582
 ] 

Apache Spark commented on SPARK-32324:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/29126

> Fix error messages during using PIVOT and lateral view
> --
>
> Key: SPARK-32324
> URL: https://issues.apache.org/jira/browse/SPARK-32324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Currently when we use `lateral view` and `pivot` together in from clause, if  
> `lateral view` is before `pivot`, the error message is "LATERAL cannot be 
> used together with PIVOT in FROM clause".if if  `lateral view` is after 
> `pivot`,the query will be normal ,So the error messages "LATERAL cannot be 
> used together with PIVOT in FROM clause" is not accurate, we may improve it.
>  
> Steps to reproduce:
> ```
> CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING);
> INSERT INTO person VALUES
> (100, 'John', 30, 1, 'Street 1'),
> (200, 'Mary', NULL, 1, 'Street 2'),
> (300, 'Mike', 80, 3, 'Street 3'),
> (400, 'Dan', 50, 4, 'Street 4');
> ```
> Query1:
> ```
> SELECT * FROM person
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> )
> ```
> Result 1:
> ```
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9)
> == SQL ==
> SELECT * FROM person
> -^^^
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> ) (state=,code=0)
> ```
>  
> Query2:
> ```
> SELECT * FROM person
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> )
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> ```
> Reuslt2:
> ```
> +--+---+---+++
> | id | Mary | John | c_age | d_age |
> +--+---+---+++
> | 300 | NULL | NULL | 30 | 40 |
> | 300 | NULL | NULL | 30 | 80 |
> | 300 | NULL | NULL | 60 | 40 |
> | 300 | NULL | NULL | 60 | 80 |
> | 100 | 0 | NULL | 30 | 40 |
> | 100 | 0 | NULL | 30 | 80 |
> | 100 | 0 | NULL | 60 | 40 |
> | 100 | 0 | NULL | 60 | 80 |
> | 400 | NULL | NULL | 30 | 40 |
> | 400 | NULL | NULL | 30 | 80 |
> | 400 | NULL | NULL | 60 | 40 |
> | 400 | NULL | NULL | 60 | 80 |
> | 200 | NULL | 1 | 30 | 40 |
> | 200 | NULL | 1 | 30 | 80 |
> | 200 | NULL | 1 | 60 | 40 |
> | 200 | NULL | 1 | 60 | 80 |
> +--+---+---+++
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32324) Fix error messages during using PIVOT and lateral view



 [ 
https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32324:


Assignee: Apache Spark

> Fix error messages during using PIVOT and lateral view
> --
>
> Key: SPARK-32324
> URL: https://issues.apache.org/jira/browse/SPARK-32324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Assignee: Apache Spark
>Priority: Minor
>
> Currently when we use `lateral view` and `pivot` together in from clause, if  
> `lateral view` is before `pivot`, the error message is "LATERAL cannot be 
> used together with PIVOT in FROM clause".if if  `lateral view` is after 
> `pivot`,the query will be normal ,So the error messages "LATERAL cannot be 
> used together with PIVOT in FROM clause" is not accurate, we may improve it.
>  
> Steps to reproduce:
> ```
> CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING);
> INSERT INTO person VALUES
> (100, 'John', 30, 1, 'Street 1'),
> (200, 'Mary', NULL, 1, 'Street 2'),
> (300, 'Mike', 80, 3, 'Street 3'),
> (400, 'Dan', 50, 4, 'Street 4');
> ```
> Query1:
> ```
> SELECT * FROM person
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> )
> ```
> Result 1:
> ```
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9)
> == SQL ==
> SELECT * FROM person
> -^^^
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> ) (state=,code=0)
> ```
>  
> Query2:
> ```
> SELECT * FROM person
> PIVOT (
>  count(distinct age) as a
> for name in ('Mary','John')
> )
> lateral view outer explode(array(30,60)) tabelName as c_age
> lateral view explode(array(40,80)) as d_age
> ```
> Reuslt2:
> ```
> +--+---+---+++
> | id | Mary | John | c_age | d_age |
> +--+---+---+++
> | 300 | NULL | NULL | 30 | 40 |
> | 300 | NULL | NULL | 30 | 80 |
> | 300 | NULL | NULL | 60 | 40 |
> | 300 | NULL | NULL | 60 | 80 |
> | 100 | 0 | NULL | 30 | 40 |
> | 100 | 0 | NULL | 30 | 80 |
> | 100 | 0 | NULL | 60 | 40 |
> | 100 | 0 | NULL | 60 | 80 |
> | 400 | NULL | NULL | 30 | 40 |
> | 400 | NULL | NULL | 30 | 80 |
> | 400 | NULL | NULL | 60 | 40 |
> | 400 | NULL | NULL | 60 | 80 |
> | 200 | NULL | 1 | 30 | 40 |
> | 200 | NULL | 1 | 30 | 80 |
> | 200 | NULL | 1 | 60 | 40 |
> | 200 | NULL | 1 | 60 | 80 |
> +--+---+---+++
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32325) JSON predicate pushdown for nested fields

2020-07-15 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-32325:
--

 Summary: JSON predicate pushdown for nested fields
 Key: SPARK-32325
 URL: https://issues.apache.org/jira/browse/SPARK-32325
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


SPARK-30648 should support filters pushdown to JSON datasource but it supports 
only filters that refer to top-level fields. The ticket aims to support nested 
fields as well. See the needed changes: 
https://github.com/apache/spark/pull/27366#discussion_r443340603



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32324) Fix error messages during using PIVOT and lateral view

2020-07-15 Thread philipse (Jira)

philipse created SPARK-32324:


 Summary: Fix error messages during using PIVOT and lateral view
 Key: SPARK-32324
 URL: https://issues.apache.org/jira/browse/SPARK-32324
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: philipse


Currently when we use `lateral view` and `pivot` together in from clause, if  
`lateral view` is before `pivot`, the error message is "LATERAL cannot be used 
together with PIVOT in FROM clause".if if  `lateral view` is after `pivot`,the 
query will be normal ,So the error messages "LATERAL cannot be used together 
with PIVOT in FROM clause" is not accurate, we may improve it.

 

Steps to reproduce:

```

CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');

```

Query1:

```

SELECT * FROM person
lateral view outer explode(array(30,60)) tabelName as c_age
lateral view explode(array(40,80)) as d_age
PIVOT (
 count(distinct age) as a
for name in ('Mary','John')
)

```

Result 1:

```

Error: org.apache.spark.sql.catalyst.parser.ParseException: 
LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9)

== SQL ==
SELECT * FROM person
-^^^
lateral view outer explode(array(30,60)) tabelName as c_age
lateral view explode(array(40,80)) as d_age
PIVOT (
 count(distinct age) as a
for name in ('Mary','John')
) (state=,code=0)

```

 

Query2:

```

SELECT * FROM person
PIVOT (
 count(distinct age) as a
for name in ('Mary','John')
)
lateral view outer explode(array(30,60)) tabelName as c_age
lateral view explode(array(40,80)) as d_age

```

Reuslt2:

```

+--+---+---+++
| id | Mary | John | c_age | d_age |
+--+---+---+++
| 300 | NULL | NULL | 30 | 40 |
| 300 | NULL | NULL | 30 | 80 |
| 300 | NULL | NULL | 60 | 40 |
| 300 | NULL | NULL | 60 | 80 |
| 100 | 0 | NULL | 30 | 40 |
| 100 | 0 | NULL | 30 | 80 |
| 100 | 0 | NULL | 60 | 40 |
| 100 | 0 | NULL | 60 | 80 |
| 400 | NULL | NULL | 30 | 40 |
| 400 | NULL | NULL | 30 | 80 |
| 400 | NULL | NULL | 60 | 40 |
| 400 | NULL | NULL | 60 | 80 |
| 200 | NULL | 1 | 30 | 40 |
| 200 | NULL | 1 | 30 | 80 |
| 200 | NULL | 1 | 60 | 40 |
| 200 | NULL | 1 | 60 | 80 |
+--+---+---+++

```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal



[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158553#comment-17158553
 ] 

Apache Spark commented on SPARK-32018:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29125

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Priority: Major
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal



[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158548#comment-17158548
 ] 

Apache Spark commented on SPARK-32018:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29125

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Priority: Major
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32140) Add summary to FMClassificationModel

2020-07-15 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-32140.

Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28960
[https://github.com/apache/spark/pull/28960]

> Add summary to FMClassificationModel
> 
>
> Key: SPARK-32140
> URL: https://issues.apache.org/jira/browse/SPARK-32140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32140) Add summary to FMClassificationModel

2020-07-15 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reassigned SPARK-32140:
--

Assignee: Huaxin Gao

> Add summary to FMClassificationModel
> 
>
> Key: SPARK-32140
> URL: https://issues.apache.org/jira/browse/SPARK-32140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32287) Flaky Test: ExecutorAllocationManagerSuite.add executors default profile



[ 
https://issues.apache.org/jira/browse/SPARK-32287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158521#comment-17158521
 ] 

Thomas Graves commented on SPARK-32287:
---

I'll try to reproduce and investigate locally

> Flaky Test: ExecutorAllocationManagerSuite.add executors default profile
> 
>
> Key: SPARK-32287
> URL: https://issues.apache.org/jira/browse/SPARK-32287
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
>  This test becomes flaky in Github Actions, see: 
> https://github.com/apache/spark/pull/29072/checks?check_run_id=861689509
> {code:java}
> [info] - add executors default profile *** FAILED *** (33 milliseconds)
> [info]   4 did not equal 2 (ExecutorAllocationManagerSuite.scala:132)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
> [info]   at 
> org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$7(ExecutorAllocationManagerSuite.scala:132)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32036) Remove references to "blacklist"/"whitelist" language (outside of blacklisting feature)



 [ 
https://issues.apache.org/jira/browse/SPARK-32036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-32036:
-

Assignee: Erik Krogen

> Remove references to "blacklist"/"whitelist" language (outside of 
> blacklisting feature)
> ---
>
> Key: SPARK-32036
> URL: https://issues.apache.org/jira/browse/SPARK-32036
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
> Fix For: 3.1.0
>
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist" and 
> "whitelist". While it seems to me that there is some valid debate as to 
> whether these terms have racist origins, the cultural connotations are 
> inescapable in today's world.
> Renaming the entire blacklisting feature would be a large effort with lots of 
> care needed to maintain public-facing APIs and configurations. Though I think 
> this will be a very rewarding effort for which I've filed SPARK-32037, I'd 
> like to start by tackling all of the other references to such terminology in 
> the codebase, of which there are still dozens or hundreds beyond the 
> blacklisting feature.
> I'm not sure what the best "Component" is for this so I put Spark Core for 
> now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32036) Remove references to "blacklist"/"whitelist" language (outside of blacklisting feature)



 [ 
https://issues.apache.org/jira/browse/SPARK-32036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-32036.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Remove references to "blacklist"/"whitelist" language (outside of 
> blacklisting feature)
> ---
>
> Key: SPARK-32036
> URL: https://issues.apache.org/jira/browse/SPARK-32036
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
> Fix For: 3.1.0
>
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist" and 
> "whitelist". While it seems to me that there is some valid debate as to 
> whether these terms have racist origins, the cultural connotations are 
> inescapable in today's world.
> Renaming the entire blacklisting feature would be a large effort with lots of 
> care needed to maintain public-facing APIs and configurations. Though I think 
> this will be a very rewarding effort for which I've filed SPARK-32037, I'd 
> like to start by tackling all of the other references to such terminology in 
> the codebase, of which there are still dozens or hundreds beyond the 
> blacklisting feature.
> I'm not sure what the best "Component" is for this so I put Spark Core for 
> now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved



[ 
https://issues.apache.org/jira/browse/SPARK-32321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158484#comment-17158484
 ] 

Gabor Somogyi commented on SPARK-32321:
---

[~zsxwing] since you're the author and written the following:
{quote}In addition, to avoid other unknown issues, we also use the previous 
known offsets to audit the latest offsets returned by Kafka.{quote}
What do you think about this extra safety feature? Should we keep it or drop it?


> Rollback SPARK-26267 workaround since KAFKA-7703 resolved
> -
>
> Key: SPARK-32321
> URL: https://issues.apache.org/jira/browse/SPARK-32321
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved



[ 
https://issues.apache.org/jira/browse/SPARK-32321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158475#comment-17158475
 ] 

Gabor Somogyi commented on SPARK-32321:
---

Hope it will make KafkaOffsetReader area more clear because couple of users has 
hit SPARK-28367

> Rollback SPARK-26267 workaround since KAFKA-7703 resolved
> -
>
> Key: SPARK-32321
> URL: https://issues.apache.org/jira/browse/SPARK-32321
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32318) Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE BY



 [ 
https://issues.apache.org/jira/browse/SPARK-32318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32318.
---
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29118
[https://github.com/apache/spark/pull/29118]

> Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE 
> BY
> ---
>
> Key: SPARK-32318
> URL: https://issues.apache.org/jira/browse/SPARK-32318
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> This is found during reviewing SPARK-32276.
> *AFTER SPARK-32276*
> {code}
> scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, 
> x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
> scala> sql("select * from (select * from t order by b) distribute by 
> a").write.orc("/tmp/SPARK-32276")
> $ ls -al /tmp/SPARK-32276/
> total 632
> drwxr-xr-x  10 dongjoon  wheel 320 Jul 14 22:08 ./
> drwxrwxrwt  14 root  wheel 448 Jul 14 22:08 ../
> -rw-r--r--   1 dongjoon  wheel   8 Jul 14 22:08 ._SUCCESS.crc
> -rw-r--r--   1 dongjoon  wheel  12 Jul 14 22:08 
> .part-0-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel1188 Jul 14 22:08 
> .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel1188 Jul 14 22:08 
> .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel   0 Jul 14 22:08 _SUCCESS
> -rw-r--r--   1 dongjoon  wheel 119 Jul 14 22:08 
> part-0-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
> -rw-r--r--   1 dongjoon  wheel  150735 Jul 14 22:08 
> part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
> -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 
> part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
> {code}
> *BEFORE*
> {code}
> scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, 
> x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
> scala> sql("select * from (select * from t order by b) distribute by 
> a").write.orc("/tmp/master")
> $ ls -al /tmp/master/
> total 56
> drwxr-xr-x  10 dongjoon  wheel  320 Jul 14 22:12 ./
> drwxrwxrwt  15 root  wheel  480 Jul 14 22:12 ../
> -rw-r--r--   1 dongjoon  wheel8 Jul 14 22:12 ._SUCCESS.crc
> -rw-r--r--   1 dongjoon  wheel   12 Jul 14 22:12 
> .part-0-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12 
> .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12 
> .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel0 Jul 14 22:12 _SUCCESS
> -rw-r--r--   1 dongjoon  wheel  119 Jul 14 22:12 
> part-0-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
> -rw-r--r--   1 dongjoon  wheel  932 Jul 14 22:12 
> part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
> -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 
> part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32318) Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE BY



 [ 
https://issues.apache.org/jira/browse/SPARK-32318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32318:
-

Assignee: Dongjoon Hyun

> Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE 
> BY
> ---
>
> Key: SPARK-32318
> URL: https://issues.apache.org/jira/browse/SPARK-32318
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> This is found during reviewing SPARK-32276.
> *AFTER SPARK-32276*
> {code}
> scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, 
> x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
> scala> sql("select * from (select * from t order by b) distribute by 
> a").write.orc("/tmp/SPARK-32276")
> $ ls -al /tmp/SPARK-32276/
> total 632
> drwxr-xr-x  10 dongjoon  wheel 320 Jul 14 22:08 ./
> drwxrwxrwt  14 root  wheel 448 Jul 14 22:08 ../
> -rw-r--r--   1 dongjoon  wheel   8 Jul 14 22:08 ._SUCCESS.crc
> -rw-r--r--   1 dongjoon  wheel  12 Jul 14 22:08 
> .part-0-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel1188 Jul 14 22:08 
> .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel1188 Jul 14 22:08 
> .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel   0 Jul 14 22:08 _SUCCESS
> -rw-r--r--   1 dongjoon  wheel 119 Jul 14 22:08 
> part-0-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
> -rw-r--r--   1 dongjoon  wheel  150735 Jul 14 22:08 
> part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
> -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 
> part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
> {code}
> *BEFORE*
> {code}
> scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, 
> x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")
> scala> sql("select * from (select * from t order by b) distribute by 
> a").write.orc("/tmp/master")
> $ ls -al /tmp/master/
> total 56
> drwxr-xr-x  10 dongjoon  wheel  320 Jul 14 22:12 ./
> drwxrwxrwt  15 root  wheel  480 Jul 14 22:12 ../
> -rw-r--r--   1 dongjoon  wheel8 Jul 14 22:12 ._SUCCESS.crc
> -rw-r--r--   1 dongjoon  wheel   12 Jul 14 22:12 
> .part-0-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12 
> .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12 
> .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
> -rw-r--r--   1 dongjoon  wheel0 Jul 14 22:12 _SUCCESS
> -rw-r--r--   1 dongjoon  wheel  119 Jul 14 22:12 
> part-0-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
> -rw-r--r--   1 dongjoon  wheel  932 Jul 14 22:12 
> part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
> -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 
> part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32323) Javascript/HTML bug in spark application UI



 [ 
https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ihor Bobak updated SPARK-32323:
---
Description: 
I attached screeenshot - everything is written on it.

This appeared in Spark 3.0.0 in the Firefox browser (latest version)

 

  was:
I attached screeenshot - everything is written on it.

This appeared in Spark 3.0.0

 


> Javascript/HTML bug in spark application UI
> ---
>
> Key: SPARK-32323
> URL: https://issues.apache.org/jira/browse/SPARK-32323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: Ubuntu 18,  Spark 3.0.0 standalone cluster
>Reporter: Ihor Bobak
>Priority: Major
> Attachments: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png
>
>
> I attached screeenshot - everything is written on it.
> This appeared in Spark 3.0.0 in the Firefox browser (latest version)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32323) Javascript/HTML bug in spark application UI



 [ 
https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ihor Bobak updated SPARK-32323:
---
Attachment: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png

> Javascript/HTML bug in spark application UI
> ---
>
> Key: SPARK-32323
> URL: https://issues.apache.org/jira/browse/SPARK-32323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: Ubuntu 18,  Spark 3.0.0 standalone cluster
>Reporter: Ihor Bobak
>Priority: Major
> Attachments: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png
>
>
> I attached screeenshot - everything is written on it.
> This appeared in Spark 3.0.0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32323) Javascript/HTML bug in spark application UI



 [ 
https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ihor Bobak updated SPARK-32323:
---
Description: 
I attached screeenshot - everything is written on it.

This appeared in Spark 3.0.0

 

  was:
I attached screeenshot - everything is written on it.

This appeared in Spark 3.0.0

!image-2020-07-15-16-40-42-328.png!


> Javascript/HTML bug in spark application UI
> ---
>
> Key: SPARK-32323
> URL: https://issues.apache.org/jira/browse/SPARK-32323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: Ubuntu 18,  Spark 3.0.0 standalone cluster
>Reporter: Ihor Bobak
>Priority: Major
>
> I attached screeenshot - everything is written on it.
> This appeared in Spark 3.0.0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32323) Javascript/HTML bug in spark application UI

Ihor Bobak created SPARK-32323:
--

 Summary: Javascript/HTML bug in spark application UI
 Key: SPARK-32323
 URL: https://issues.apache.org/jira/browse/SPARK-32323
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 3.0.0
 Environment: Ubuntu 18,  Spark 3.0.0 standalone cluster
Reporter: Ihor Bobak


I attached screeenshot - everything is written on it.

This appeared in Spark 3.0.0

!image-2020-07-15-16-40-42-328.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32322) Pyspark not launching in Spark IPV6 environment

2020-07-15 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158165#comment-17158165
 ] 

pavithra ramachandran commented on SPARK-32322:
---

i would like to check this

> Pyspark not launching in Spark IPV6 environment
> ---
>
> Key: SPARK-32322
> URL: https://issues.apache.org/jira/browse/SPARK-32322
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> pyspark  is not launching in Spark IPV6 environment.
> Initial analysis looks like python is not supporting IPV6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32322) Pyspark not launching in Spark IPV6 environment

2020-07-15 Thread jobit mathew (Jira)

jobit mathew created SPARK-32322:


 Summary: Pyspark not launching in Spark IPV6 environment
 Key: SPARK-32322
 URL: https://issues.apache.org/jira/browse/SPARK-32322
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
Reporter: jobit mathew


pyspark  is not launching in Spark IPV6 environment.

Initial analysis looks like python is not supporting IPV6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved



[ 
https://issues.apache.org/jira/browse/SPARK-32321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158161#comment-17158161
 ] 

Gabor Somogyi commented on SPARK-32321:
---

I've started to work on this

> Rollback SPARK-26267 workaround since KAFKA-7703 resolved
> -
>
> Key: SPARK-32321
> URL: https://issues.apache.org/jira/browse/SPARK-32321
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved



[ 
https://issues.apache.org/jira/browse/SPARK-32321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158160#comment-17158160
 ] 

Gabor Somogyi commented on SPARK-32321:
---

FYI [~zsxwing]

> Rollback SPARK-26267 workaround since KAFKA-7703 resolved
> -
>
> Key: SPARK-32321
> URL: https://issues.apache.org/jira/browse/SPARK-32321
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved

Gabor Somogyi created SPARK-32321:
-

 Summary: Rollback SPARK-26267 workaround since KAFKA-7703 resolved
 Key: SPARK-32321
 URL: https://issues.apache.org/jira/browse/SPARK-32321
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32281) Spark wipes out SORTED spec in metastore when DESC is used

2020-07-15 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158125#comment-17158125
 ] 

Ankit Raj Boudh commented on SPARK-32281:
-

[~bersprockets], I will raise PR for this soon.

> Spark wipes out SORTED spec in metastore when DESC is used
> --
>
> Key: SPARK-32281
> URL: https://issues.apache.org/jira/browse/SPARK-32281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When altering a Hive bucketed table or updating its statistics, Spark will 
> wipe out the SORTED specification in the metastore if the specification uses 
> DESC.
>  For example:
> {noformat}
> 0: jdbc:hive2://localhost:1> -- in beeline
> 0: jdbc:hive2://localhost:1> create table bucketed (a int, b int, c int, 
> d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
> No rows affected (0.045 seconds)
> 0: jdbc:hive2://localhost:1> show create table bucketed;
> ++
> |   createtab_stmt   |
> ++
> | CREATE TABLE `bucketed`(   |
> |   `a` int, |
> |   `b` int, |
> |   `c` int, |
> |   `d` int) |
> | CLUSTERED BY ( |
> |   c)   |
> | SORTED BY (|
> |   c ASC,   |
> |   d DESC)  |
> | INTO 10 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.mapred.TextInputFormat'   |
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION   |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (|
> |   'transient_lastDdlTime'='1594488043')|
> ++
> 21 rows selected (0.042 seconds)
> 0: jdbc:hive2://localhost:1> 
> -
> -
> -
> scala> // in spark
> scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
> 20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> res0: org.apache.spark.sql.DataFrame = []
> scala> 
> -
> -
> -
> 0: jdbc:hive2://localhost:1> -- back in beeline
> 0: jdbc:hive2://localhost:1> show create table bucketed;
> ++
> |   createtab_stmt   |
> ++
> | CREATE TABLE `bucketed`(   |
> |   `a` int, |
> |   `b` int, |
> |   `c` int, |
> |   `d` int) |
> | CLUSTERED BY ( |
> |   c)   |
> | INTO 10 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.mapred.TextInputFormat'   |
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION   |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (|
> |   'foo'='bar', |
> |   'spark.sql.partitionProvider'='catalog', |
> |   'transient_lastDdlTime'='1594488098')|
> ++
> 20 rows selected (0.038 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> Note that the SORTED specification disappears.
> Another example, this time using insert:
> {noformat}
> 0: jdbc:hive2://localhost:1> -- in beeline
> 0: jdbc:hive2://localhost:1> create table bucketed (a int,

[jira] [Assigned] (SPARK-31168) Upgrade Scala to 2.12.12



 [ 
https://issues.apache.org/jira/browse/SPARK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31168:


Assignee: Apache Spark

> Upgrade Scala to 2.12.12
> 
>
> Key: SPARK-31168
> URL: https://issues.apache.org/jira/browse/SPARK-31168
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> h2. Highlights
>  * Performance improvements in the collections library: algorithmic 
> improvements and changes to avoid unnecessary allocations ([list of 
> PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+label%3Alibrary%3Acollections+label%3Aperformance])
>  * Performance improvements in the compiler ([list of 
> PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+-label%3Alibrary%3Acollections+label%3Aperformance+],
>  minor [effects in our 
> benchmarks|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1567985515850&to=1584355915694&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench@scalabench@])
>  * Improvements to {{-Yrepl-class-based}}, an alternative internal REPL 
> encoding that avoids deadlocks (details on 
> [#8712|https://github.com/scala/scala/pull/8712])
>  * A new {{-Yrepl-use-magic-imports}} flag that avoids deep class nesting in 
> the REPL, which can lead to deteriorating performance in long sessions 
> ([#8576|https://github.com/scala/scala/pull/8576])
>  * Fix some {{toX}} methods that could expose the underlying mutability of a 
> {{ListBuffer}}-generated collection 
> ([#8674|https://github.com/scala/scala/pull/8674])
> h3. JDK 9+ support
>  * ASM was upgraded to 7.3.1, allowing the optimizer to run on JDK 13+ 
> ([#8676|https://github.com/scala/scala/pull/8676])
>  * {{:javap}} in the REPL now works on JDK 9+ 
> ([#8400|https://github.com/scala/scala/pull/8400])
> h3. Other changes
>  * Support new labels for creating durations for consistency: 
> {{Duration("1m")}}, {{Duration("3 hrs")}} 
> ([#8325|https://github.com/scala/scala/pull/8325], 
> [#8450|https://github.com/scala/scala/pull/8450])
>  * Fix memory leak in runtime reflection's {{TypeTag}} caches 
> ([#8470|https://github.com/scala/scala/pull/8470]) and some thread safety 
> issues in runtime reflection 
> ([#8433|https://github.com/scala/scala/pull/8433])
>  * When using compiler plugins, the ordering of compiler phases may change 
> due to [#8427|https://github.com/scala/scala/pull/8427]
> For more details, see [https://github.com/scala/scala/releases/tag/v2.12.11].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31168) Upgrade Scala to 2.12.12



 [ 
https://issues.apache.org/jira/browse/SPARK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31168:


Assignee: (was: Apache Spark)

> Upgrade Scala to 2.12.12
> 
>
> Key: SPARK-31168
> URL: https://issues.apache.org/jira/browse/SPARK-31168
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. Highlights
>  * Performance improvements in the collections library: algorithmic 
> improvements and changes to avoid unnecessary allocations ([list of 
> PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+label%3Alibrary%3Acollections+label%3Aperformance])
>  * Performance improvements in the compiler ([list of 
> PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+-label%3Alibrary%3Acollections+label%3Aperformance+],
>  minor [effects in our 
> benchmarks|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1567985515850&to=1584355915694&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench@scalabench@])
>  * Improvements to {{-Yrepl-class-based}}, an alternative internal REPL 
> encoding that avoids deadlocks (details on 
> [#8712|https://github.com/scala/scala/pull/8712])
>  * A new {{-Yrepl-use-magic-imports}} flag that avoids deep class nesting in 
> the REPL, which can lead to deteriorating performance in long sessions 
> ([#8576|https://github.com/scala/scala/pull/8576])
>  * Fix some {{toX}} methods that could expose the underlying mutability of a 
> {{ListBuffer}}-generated collection 
> ([#8674|https://github.com/scala/scala/pull/8674])
> h3. JDK 9+ support
>  * ASM was upgraded to 7.3.1, allowing the optimizer to run on JDK 13+ 
> ([#8676|https://github.com/scala/scala/pull/8676])
>  * {{:javap}} in the REPL now works on JDK 9+ 
> ([#8400|https://github.com/scala/scala/pull/8400])
> h3. Other changes
>  * Support new labels for creating durations for consistency: 
> {{Duration("1m")}}, {{Duration("3 hrs")}} 
> ([#8325|https://github.com/scala/scala/pull/8325], 
> [#8450|https://github.com/scala/scala/pull/8450])
>  * Fix memory leak in runtime reflection's {{TypeTag}} caches 
> ([#8470|https://github.com/scala/scala/pull/8470]) and some thread safety 
> issues in runtime reflection 
> ([#8433|https://github.com/scala/scala/pull/8433])
>  * When using compiler plugins, the ordering of compiler phases may change 
> due to [#8427|https://github.com/scala/scala/pull/8427]
> For more details, see [https://github.com/scala/scala/releases/tag/v2.12.11].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31168) Upgrade Scala to 2.12.12



[ 
https://issues.apache.org/jira/browse/SPARK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158056#comment-17158056
 ] 

Apache Spark commented on SPARK-31168:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29124

> Upgrade Scala to 2.12.12
> 
>
> Key: SPARK-31168
> URL: https://issues.apache.org/jira/browse/SPARK-31168
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. Highlights
>  * Performance improvements in the collections library: algorithmic 
> improvements and changes to avoid unnecessary allocations ([list of 
> PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+label%3Alibrary%3Acollections+label%3Aperformance])
>  * Performance improvements in the compiler ([list of 
> PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+-label%3Alibrary%3Acollections+label%3Aperformance+],
>  minor [effects in our 
> benchmarks|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1567985515850&to=1584355915694&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench@scalabench@])
>  * Improvements to {{-Yrepl-class-based}}, an alternative internal REPL 
> encoding that avoids deadlocks (details on 
> [#8712|https://github.com/scala/scala/pull/8712])
>  * A new {{-Yrepl-use-magic-imports}} flag that avoids deep class nesting in 
> the REPL, which can lead to deteriorating performance in long sessions 
> ([#8576|https://github.com/scala/scala/pull/8576])
>  * Fix some {{toX}} methods that could expose the underlying mutability of a 
> {{ListBuffer}}-generated collection 
> ([#8674|https://github.com/scala/scala/pull/8674])
> h3. JDK 9+ support
>  * ASM was upgraded to 7.3.1, allowing the optimizer to run on JDK 13+ 
> ([#8676|https://github.com/scala/scala/pull/8676])
>  * {{:javap}} in the REPL now works on JDK 9+ 
> ([#8400|https://github.com/scala/scala/pull/8400])
> h3. Other changes
>  * Support new labels for creating durations for consistency: 
> {{Duration("1m")}}, {{Duration("3 hrs")}} 
> ([#8325|https://github.com/scala/scala/pull/8325], 
> [#8450|https://github.com/scala/scala/pull/8450])
>  * Fix memory leak in runtime reflection's {{TypeTag}} caches 
> ([#8470|https://github.com/scala/scala/pull/8470]) and some thread safety 
> issues in runtime reflection 
> ([#8433|https://github.com/scala/scala/pull/8433])
>  * When using compiler plugins, the ordering of compiler phases may change 
> due to [#8427|https://github.com/scala/scala/pull/8427]
> For more details, see [https://github.com/scala/scala/releases/tag/v2.12.11].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-07-15 Thread Dilip Biswal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal reassigned SPARK-31480:


Assignee: Dilip Biswal

> Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
> ---
>
> Key: SPARK-31480
> URL: https://issues.apache.org/jira/browse/SPARK-31480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
>
> Below is the EXPLAIN OUTPUT when using the *DSV2* 
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), 
> (col.dots#39L = 500)], Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc...,
>  PartitionFilters: [], ReadSchema: struct
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) BatchScan
> Output [1]: [col.dots#39L]
> Arguments: [col.dots#39L], 
> JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L),
>  (col.dots#39L = 500)))
> {code}
> When using *DSV1*, the output is much cleaner than the output of DSV2, 
> especially for EXPLAIN FORMATTED.
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- FileScan json [col.dots#37L] Batched: false, DataFilters: 
> [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), 
> EqualTo(`col.dots`,500)], ReadSchema: struct 
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) Scan json 
> Output [1]: [col.dots#37L]
> Batched: false
> Location: InMemoryFileIndex 
> [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0]
> PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)]
> ReadSchema: struct{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-07-15 Thread Dilip Biswal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal resolved SPARK-31480.
--
Resolution: Fixed

> Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
> ---
>
> Key: SPARK-31480
> URL: https://issues.apache.org/jira/browse/SPARK-31480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
>
> Below is the EXPLAIN OUTPUT when using the *DSV2* 
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), 
> (col.dots#39L = 500)], Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc...,
>  PartitionFilters: [], ReadSchema: struct
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) BatchScan
> Output [1]: [col.dots#39L]
> Arguments: [col.dots#39L], 
> JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L),
>  (col.dots#39L = 500)))
> {code}
> When using *DSV1*, the output is much cleaner than the output of DSV2, 
> especially for EXPLAIN FORMATTED.
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- FileScan json [col.dots#37L] Batched: false, DataFilters: 
> [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), 
> EqualTo(`col.dots`,500)], ReadSchema: struct 
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) Scan json 
> Output [1]: [col.dots#37L]
> Batched: false
> Location: InMemoryFileIndex 
> [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0]
> PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)]
> ReadSchema: struct{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32283) Multiple Kryo registrators can't be used anymore



 [ 
https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32283:


Assignee: (was: Apache Spark)

> Multiple Kryo registrators can't be used anymore
> 
>
> Key: SPARK-32283
> URL: https://issues.apache.org/jira/browse/SPARK-32283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lorenz Bühmann
>Priority: Minor
>
> This is a regression in Spark 3.0 as it is working with Spark 2.
> According to the docs, it should be possible to register multiple Kryo 
> registrators via Spark config option spark.kryo.registrator . 
> In Spark 3.0 the code to parse Kryo config options has been refactored into 
> Scala class 
> [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala].
>  The code to parse the registrators is in [Line 
> 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32]
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .createOptional
> {code}
> but it should be
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .toSequence
> .createOptional
> {code}
>  to split the comma seprated list.
> In previous Spark 2.x it was done differently directly in [KryoSerializer 
> Line 
> 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79]
> {code:scala}
> private val userRegistrators = conf.get("spark.kryo.registrator", "")
> .split(',').map(_.trim)
> .filter(!_.isEmpty)
> {code}
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32283) Multiple Kryo registrators can't be used anymore



 [ 
https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32283:


Assignee: Apache Spark

> Multiple Kryo registrators can't be used anymore
> 
>
> Key: SPARK-32283
> URL: https://issues.apache.org/jira/browse/SPARK-32283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lorenz Bühmann
>Assignee: Apache Spark
>Priority: Minor
>
> This is a regression in Spark 3.0 as it is working with Spark 2.
> According to the docs, it should be possible to register multiple Kryo 
> registrators via Spark config option spark.kryo.registrator . 
> In Spark 3.0 the code to parse Kryo config options has been refactored into 
> Scala class 
> [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala].
>  The code to parse the registrators is in [Line 
> 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32]
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .createOptional
> {code}
> but it should be
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .toSequence
> .createOptional
> {code}
>  to split the comma seprated list.
> In previous Spark 2.x it was done differently directly in [KryoSerializer 
> Line 
> 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79]
> {code:scala}
> private val userRegistrators = conf.get("spark.kryo.registrator", "")
> .split(',').map(_.trim)
> .filter(!_.isEmpty)
> {code}
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32283) Multiple Kryo registrators can't be used anymore



[ 
https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158034#comment-17158034
 ] 

Apache Spark commented on SPARK-32283:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29123

> Multiple Kryo registrators can't be used anymore
> 
>
> Key: SPARK-32283
> URL: https://issues.apache.org/jira/browse/SPARK-32283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lorenz Bühmann
>Priority: Minor
>
> This is a regression in Spark 3.0 as it is working with Spark 2.
> According to the docs, it should be possible to register multiple Kryo 
> registrators via Spark config option spark.kryo.registrator . 
> In Spark 3.0 the code to parse Kryo config options has been refactored into 
> Scala class 
> [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala].
>  The code to parse the registrators is in [Line 
> 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32]
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .createOptional
> {code}
> but it should be
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .toSequence
> .createOptional
> {code}
>  to split the comma seprated list.
> In previous Spark 2.x it was done differently directly in [KryoSerializer 
> Line 
> 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79]
> {code:scala}
> private val userRegistrators = conf.get("spark.kryo.registrator", "")
> .split(',').map(_.trim)
> .filter(!_.isEmpty)
> {code}
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32283) Multiple Kryo registrators can't be used anymore

2020-07-15 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158018#comment-17158018
 ] 

Lantao Jin commented on SPARK-32283:


Thanks for reporting this. Will file a patch.

> Multiple Kryo registrators can't be used anymore
> 
>
> Key: SPARK-32283
> URL: https://issues.apache.org/jira/browse/SPARK-32283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lorenz Bühmann
>Priority: Minor
>
> This is a regression in Spark 3.0 as it is working with Spark 2.
> According to the docs, it should be possible to register multiple Kryo 
> registrators via Spark config option spark.kryo.registrator . 
> In Spark 3.0 the code to parse Kryo config options has been refactored into 
> Scala class 
> [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala].
>  The code to parse the registrators is in [Line 
> 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32]
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .createOptional
> {code}
> but it should be
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .toSequence
> .createOptional
> {code}
>  to split the comma seprated list.
> In previous Spark 2.x it was done differently directly in [KryoSerializer 
> Line 
> 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79]
> {code:scala}
> private val userRegistrators = conf.get("spark.kryo.registrator", "")
> .split(',').map(_.trim)
> .filter(!_.isEmpty)
> {code}
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28367) Kafka connector infinite wait because metadata never updated



 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28367:
--
Affects Version/s: 3.1.0

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32320) Remove mutable default arguments



[ 
https://issues.apache.org/jira/browse/SPARK-32320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157996#comment-17157996
 ] 

Apache Spark commented on SPARK-32320:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/29122

> Remove mutable default arguments
> 
>
> Key: SPARK-32320
> URL: https://issues.apache.org/jira/browse/SPARK-32320
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32320) Remove mutable default arguments



 [ 
https://issues.apache.org/jira/browse/SPARK-32320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32320:


Assignee: Apache Spark

> Remove mutable default arguments
> 
>
> Key: SPARK-32320
> URL: https://issues.apache.org/jira/browse/SPARK-32320
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32320) Remove mutable default arguments



 [ 
https://issues.apache.org/jira/browse/SPARK-32320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32320:


Assignee: (was: Apache Spark)

> Remove mutable default arguments
> 
>
> Key: SPARK-32320
> URL: https://issues.apache.org/jira/browse/SPARK-32320
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32320) Remove mutable default arguments



[ 
https://issues.apache.org/jira/browse/SPARK-32320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157995#comment-17157995
 ] 

Apache Spark commented on SPARK-32320:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/29122

> Remove mutable default arguments
> 
>
> Key: SPARK-32320
> URL: https://issues.apache.org/jira/browse/SPARK-32320
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32320) Remove mutable default arguments

2020-07-15 Thread Fokko Driesprong (Jira)

Fokko Driesprong created SPARK-32320:


 Summary: Remove mutable default arguments
 Key: SPARK-32320
 URL: https://issues.apache.org/jira/browse/SPARK-32320
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32271) Add option for k-fold cross-validation to CrossValidator

2020-07-15 Thread Austin Jordan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Austin Jordan updated SPARK-32271:
--
Summary: Add option for k-fold cross-validation to CrossValidator  (was: 
Update CrossValidator to parallelize fit method across folds)

> Add option for k-fold cross-validation to CrossValidator
> 
>
> Key: SPARK-32271
> URL: https://issues.apache.org/jira/browse/SPARK-32271
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Austin Jordan
>Priority: Minor
>
> *What changes were proposed in this pull request?*
> I have added a `method` parameter to `CrossValidator.scala` to allow the user 
> to choose between repeated random sub-sampling cross-validation (current 
> behavior) and _k_-fold cross-validation (optional new behavior). The default 
> method is random sub-sampling cross-validation.
> If _k_-fold cross-validation is chosen, the new behavior is as follows:
>  # Instead of splitting the input dataset into _k_ training and validation 
> sets, I split them into _k_ folds; for each fold of training, one of the _k_ 
> splits is selected for validation, and the others are unioned together for 
> training.
>  # Instead of caching each training and validation set _k_ times, I cache 
> each of the folds once.
>  # Instead of waiting for every model to finish training on fold _n_ before 
> moving on to fold _n+1_, new fold/model combinations will be trained as soon 
> as resources are available.
>  # Instead of creating one `Future` per model for each fold in series, all 
> `Future`s for each fold & parameter grid pair are created and trained in 
> parallel.
>  # A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` 
> instead of `Future[Double]`) in order to keep track of which `Future` belongs 
> to which parameter grid.
> *Why are the changes needed?*
> These changes allow the user to choose between repeated random sub-sampling 
> cross-validation (current behavior) and _k_-fold cross-validation (optional 
> new behavior). These changes:
>  1. allow the user to choose between two types of cross-validation.
>  2. (If _k_-fold is chosen) only require caching the entire dataset once 
> (instead of _k_ times in repeated random sub-sampling cross-validation, as it 
> does now).
>  3. (if _k_-fold is chosen) free resources to train new model/fold 
> combinations as soon as the previous one finishes. Currently, a model can 
> only train one fold at a time. If _k_-fold is chosen, the added functionality 
> will allow the `fit` to train multiple folds at once for the same model, and, 
> in the case of a grid search, allow it to train multiple model/fold 
> combinations at once, without needing to wait for the slowest model to fit 
> the first fold before moving onto the second.
> *Does this PR introduce _any_ user-facing change?*
> Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the 
> `method` parameter is not set, the behavior will be the same as it has always 
> been.
> *How was this patch tested?*
> Unit tests will be added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32271) Update CrossValidator to parallelize fit method across folds

2020-07-15 Thread Austin Jordan (Jira)

[
https://issues.apache.org/jira/browse/SPARK-32271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Austin Jordan updated SPARK-32271:
--
Description:
*What changes were proposed in this pull request?*

I have added a `method` parameter to `CrossValidator.scala` to allow the user
to choose between repeated random sub-sampling cross-validation (current
behavior) and _k_-fold cross-validation (optional new behavior). The default
method is random sub-sampling cross-validation.

If _k_-fold cross-validation is chosen, the new behavior is as follows:
# Instead of splitting the input dataset into _k_ training and validation
sets, I split them into _k_ folds; for each fold of training, one of the _k_
splits is selected for validation, and the others are unioned together for
training.
# Instead of caching each training and validation set _k_ times, I cache each
of the folds once.
# Instead of waiting for every model to finish training on fold _n_ before
moving on to fold _n+1_, new fold/model combinations will be trained as soon as
resources are available.
# Instead of creating one `Future` per model for each fold in series, all
`Future`s for each fold & parameter grid pair are created and trained in
parallel.
# A new `Int` parameter is added to the `Future` (now `Future[Int, Double]`
instead of `Future[Double]`) in order to keep track of which `Future` belongs
to which parameter grid.

*Why are the changes needed?*

These changes allow the user to choose between repeated random sub-sampling
cross-validation (current behavior) and _k_-fold cross-validation (optional new
behavior). These changes:
1. allow the user to choose between two types of cross-validation.
2. (If _k_-fold is chosen) only require caching the entire dataset once
(instead of _k_ times in repeated random sub-sampling cross-validation, as it
does now).
3. (if _k_-fold is chosen) free resources to train new model/fold combinations
as soon as the previous one finishes. Currently, a model can only train one
fold at a time. If _k_-fold is chosen, the added functionality will allow the
`fit` to train multiple folds at once for the same model, and, in the case of a
grid search, allow it to train multiple model/fold combinations at once,
without needing to wait for the slowest model to fit the first fold before
moving onto the second.

*Does this PR introduce _any_ user-facing change?*

Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the
`method` parameter is not set, the behavior will be the same as it has always
been.

*How was this patch tested?*

Unit tests will be added.

was:
### What changes were proposed in this pull request?

If _k_-fold cross-validation is chosen, the new behavior is as follows:

1. Instead of splitting the input dataset into _k_ training and validation
sets, I split them into _k_ folds; for each fold of training, one of the _k_
splits is selected for validation, and the others are unioned together for
training.
2. Instead of caching each training and validation set _k_ times, I cache each
of the folds once.
3. Instead of waiting for every model to finish training on fold _n_ before
moving on to fold _n+1_, new fold/model combinations will be trained as soon as
resources are available.
4. Instead of creating one `Future` per model for each fold in series, all
`Future`s for each fold & parameter grid pair are created and trained in
parallel.
5. A new `Int` parameter is added to the `Future` (now `Future[Int, Double]`
instead of `Future[Double]`) in order to keep track of which `Future` belongs
to which parameter grid.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes. This PR introduces the `

[jira] [Updated] (SPARK-32271) Update CrossValidator to parallelize fit method across folds

2020-07-15 Thread Austin Jordan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Austin Jordan updated SPARK-32271:
--
Description: 
### What changes were proposed in this pull request?

I have added a `method` parameter to `CrossValidator.scala` to allow the user 
to choose between repeated random sub-sampling cross-validation (current 
behavior) and _k_-fold cross-validation (optional new behavior). The default 
method is random sub-sampling cross-validation.

If _k_-fold cross-validation is chosen, the new behavior is as follows:

1. Instead of splitting the input dataset into _k_ training and validation 
sets, I split them into _k_ folds; for each fold of training, one of the _k_ 
splits is selected for validation, and the others are unioned together for 
training.
2. Instead of caching each training and validation set _k_ times, I cache each 
of the folds once.
3. Instead of waiting for every model to finish training on fold _n_ before 
moving on to fold _n+1_, new fold/model combinations will be trained as soon as 
resources are available.
4. Instead of creating one `Future` per model for each fold in series, all 
`Future`s for each fold & parameter grid pair are created and trained in 
parallel.
5. A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` 
instead of `Future[Double]`) in order to keep track of which `Future` belongs 
to which parameter grid.

### Why are the changes needed?

These changes allow the user to choose between repeated random sub-sampling 
cross-validation (current behavior) and _k_-fold cross-validation (optional new 
behavior). These changes:
1. allow the user to choose between two types of cross-validation.
2. (If _k_-fold is chosen) only require caching the entire dataset once 
(instead of _k_ times in repeated random sub-sampling cross-validation, as it 
does now).
3. (if _k_-fold is chosen) free resources to train new model/fold combinations 
as soon as the previous one finishes. Currently, a model can only train one 
fold at a time. If _k_-fold is chosen, the added functionality will allow the 
`fit` to train multiple folds at once for the same model, and, in the case of a 
grid search, allow it to train multiple model/fold combinations at once, 
without needing to wait for the slowest model to fit the first fold before 
moving onto the second.

### Does this PR introduce _any_ user-facing change?

Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the 
`method` parameter is not set, the behavior will be the same as it has always 
been.

### How was this patch tested?

Unit tests will be added.

  was:
Currently, fitting a CrossValidator is only parallelized across models. This 
means that a CrossValidator will only fit as quickly as the slowest-to-train 
model would fit by itself.

If a 2x2x3 parameter grid is provided for 10-fold cross validation, all 12 
models will begin training on the first fold. However, if 6 of these models 
will train for 1 hour/fold and the other 6 will train for 3 hours/fold (e.g. 
when tuning number of early stopping rounds in XGBoost), the first 6 models 
will not move on to the second fold until the last 6 are finished.

If fitting was parallelized across folds, the first 6 models would finish after 
10 hours, freeing up cluster resources to run multiple folds for the last 6 
models in parallel.

Changes to be made:
 * Instead of splitting data into multiple training and validation sets, split 
into the folds.
 * Cache each of the folds (so each fold only ends up getting cached once, 
instead of 10 times how it is now).
 * For each fold index, form the training and validation sets by selecting the 
current fold as the validation set and unioning the rest into the training set.
 * Make associated changes to calculate fold metrics, now that folds are being 
parallelized as well.


> Update CrossValidator to parallelize fit method across folds
> 
>
> Key: SPARK-32271
> URL: https://issues.apache.org/jira/browse/SPARK-32271
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Austin Jordan
>Priority: Minor
>
> ### What changes were proposed in this pull request?
> I have added a `method` parameter to `CrossValidator.scala` to allow the user 
> to choose between repeated random sub-sampling cross-validation (current 
> behavior) and _k_-fold cross-validation (optional new behavior). The default 
> method is random sub-sampling cross-validation.
> If _k_-fold cross-validation is chosen, the new behavior is as follows:
> 1. Instead of splitting the input dataset into _k_ training and validation 
> sets, I split them into _k_ folds; for each fold of training, one of the _k_ 
> splits is selected for validation, and the others are unio

[jira] [Commented] (SPARK-32319) Remove unused imports



[ 
https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157983#comment-17157983
 ] 

Apache Spark commented on SPARK-32319:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/29121

> Remove unused imports
> -
>
> Key: SPARK-32319
> URL: https://issues.apache.org/jira/browse/SPARK-32319
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> We don't want to import stuff that we're not going to use, to reduce the 
> memory pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32319) Remove unused imports



[ 
https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157980#comment-17157980
 ] 

Apache Spark commented on SPARK-32319:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/29121

> Remove unused imports
> -
>
> Key: SPARK-32319
> URL: https://issues.apache.org/jira/browse/SPARK-32319
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> We don't want to import stuff that we're not going to use, to reduce the 
> memory pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32319) Remove unused imports



 [ 
https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32319:


Assignee: (was: Apache Spark)

> Remove unused imports
> -
>
> Key: SPARK-32319
> URL: https://issues.apache.org/jira/browse/SPARK-32319
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> We don't want to import stuff that we're not going to use, to reduce the 
> memory pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32319) Remove unused imports



 [ 
https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32319:


Assignee: Apache Spark

> Remove unused imports
> -
>
> Key: SPARK-32319
> URL: https://issues.apache.org/jira/browse/SPARK-32319
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Apache Spark
>Priority: Major
>
> We don't want to import stuff that we're not going to use, to reduce the 
> memory pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32319) Remove unused imports

2020-07-15 Thread Fokko Driesprong (Jira)

Fokko Driesprong created SPARK-32319:


 Summary: Remove unused imports
 Key: SPARK-32319
 URL: https://issues.apache.org/jira/browse/SPARK-32319
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong


We don't want to import stuff that we're not going to use, to reduce the memory 
pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join



 [ 
https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32291:


Assignee: Apache Spark

> COALESCE should not reduce the child parallelism if it is Join
> --
>
> Key: SPARK-32291
> URL: https://issues.apache.org/jira/browse/SPARK-32291
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: COALESCE.png, coalesce.png, repartition.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.range(100).createTempView("t1")
> spark.range(200).createTempView("t2")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
> t2.id)").show
> {code}
> The dag is:
>  !COALESCE.png! 
> A real case:
>  !coalesce.png! 
>  !repartition.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join



 [ 
https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32291:


Assignee: (was: Apache Spark)

> COALESCE should not reduce the child parallelism if it is Join
> --
>
> Key: SPARK-32291
> URL: https://issues.apache.org/jira/browse/SPARK-32291
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: COALESCE.png, coalesce.png, repartition.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.range(100).createTempView("t1")
> spark.range(200).createTempView("t2")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
> t2.id)").show
> {code}
> The dag is:
>  !COALESCE.png! 
> A real case:
>  !coalesce.png! 
>  !repartition.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join