[jira] [Commented] (SPARK-32330) Preserve shuffled hash join build side partitioning
[ https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158906#comment-17158906 ] Apache Spark commented on SPARK-32330: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/29130 > Preserve shuffled hash join build side partitioning > --- > > Key: SPARK-32330 > URL: https://issues.apache.org/jira/browse/SPARK-32330 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > Currently `ShuffledHashJoin.outputPartitioning` inherits from > `HashJoin.outputPartitioning`, which only preserves stream side partitioning: > `HashJoin.scala` > {code:java} > override def outputPartitioning: Partitioning = > streamedPlan.outputPartitioning > {code} > This loses build side partitioning information, and causes extra shuffle if > there's another join / group-by after this join. > Example: > > {code:java} > // code placeholder > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "2", > SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { > val df1 = spark.range(10).select($"id".as("k1")) > val df2 = spark.range(30).select($"id".as("k2")) > Seq("inner", "cross").foreach(joinType => { > val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() > .queryExecution.executedPlan > assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) > // No extra shuffle before aggregate > assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) > }) > }{code} > > Current physical plan (having an extra shuffle on `k1` before aggregate) > > {code:java} > *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] >+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) > +- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] >+- *(2) Project [id#222L AS k2#224L] > +- *(2) Range (0, 30, step=1, splits=2){code} > > Ideal physical plan (no shuffle on `k1` before aggregate) > {code:java} > *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) >+- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] > +- *(2) Project [id#222L AS k2#224L] >+- *(2) Range (0, 30, step=1, splits=2){code} > > This can be fixed by overriding `outputPartitioning` method in > `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32330) Preserve shuffled hash join build side partitioning
[ https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32330: Assignee: Apache Spark > Preserve shuffled hash join build side partitioning > --- > > Key: SPARK-32330 > URL: https://issues.apache.org/jira/browse/SPARK-32330 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Trivial > > Currently `ShuffledHashJoin.outputPartitioning` inherits from > `HashJoin.outputPartitioning`, which only preserves stream side partitioning: > `HashJoin.scala` > {code:java} > override def outputPartitioning: Partitioning = > streamedPlan.outputPartitioning > {code} > This loses build side partitioning information, and causes extra shuffle if > there's another join / group-by after this join. > Example: > > {code:java} > // code placeholder > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "2", > SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { > val df1 = spark.range(10).select($"id".as("k1")) > val df2 = spark.range(30).select($"id".as("k2")) > Seq("inner", "cross").foreach(joinType => { > val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() > .queryExecution.executedPlan > assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) > // No extra shuffle before aggregate > assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) > }) > }{code} > > Current physical plan (having an extra shuffle on `k1` before aggregate) > > {code:java} > *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] >+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) > +- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] >+- *(2) Project [id#222L AS k2#224L] > +- *(2) Range (0, 30, step=1, splits=2){code} > > Ideal physical plan (no shuffle on `k1` before aggregate) > {code:java} > *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) >+- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] > +- *(2) Project [id#222L AS k2#224L] >+- *(2) Range (0, 30, step=1, splits=2){code} > > This can be fixed by overriding `outputPartitioning` method in > `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32330) Preserve shuffled hash join build side partitioning
[ https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32330: Assignee: (was: Apache Spark) > Preserve shuffled hash join build side partitioning > --- > > Key: SPARK-32330 > URL: https://issues.apache.org/jira/browse/SPARK-32330 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > Currently `ShuffledHashJoin.outputPartitioning` inherits from > `HashJoin.outputPartitioning`, which only preserves stream side partitioning: > `HashJoin.scala` > {code:java} > override def outputPartitioning: Partitioning = > streamedPlan.outputPartitioning > {code} > This loses build side partitioning information, and causes extra shuffle if > there's another join / group-by after this join. > Example: > > {code:java} > // code placeholder > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "2", > SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { > val df1 = spark.range(10).select($"id".as("k1")) > val df2 = spark.range(30).select($"id".as("k2")) > Seq("inner", "cross").foreach(joinType => { > val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() > .queryExecution.executedPlan > assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) > // No extra shuffle before aggregate > assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) > }) > }{code} > > Current physical plan (having an extra shuffle on `k1` before aggregate) > > {code:java} > *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] >+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) > +- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] >+- *(2) Project [id#222L AS k2#224L] > +- *(2) Range (0, 30, step=1, splits=2){code} > > Ideal physical plan (no shuffle on `k1` before aggregate) > {code:java} > *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) >+- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] > +- *(2) Project [id#222L AS k2#224L] >+- *(2) Range (0, 30, step=1, splits=2){code} > > This can be fixed by overriding `outputPartitioning` method in > `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32330) Preserve shuffled hash join build side partitioning
[ https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158905#comment-17158905 ] Apache Spark commented on SPARK-32330: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/29130 > Preserve shuffled hash join build side partitioning > --- > > Key: SPARK-32330 > URL: https://issues.apache.org/jira/browse/SPARK-32330 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > Currently `ShuffledHashJoin.outputPartitioning` inherits from > `HashJoin.outputPartitioning`, which only preserves stream side partitioning: > `HashJoin.scala` > {code:java} > override def outputPartitioning: Partitioning = > streamedPlan.outputPartitioning > {code} > This loses build side partitioning information, and causes extra shuffle if > there's another join / group-by after this join. > Example: > > {code:java} > // code placeholder > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "2", > SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { > val df1 = spark.range(10).select($"id".as("k1")) > val df2 = spark.range(30).select($"id".as("k2")) > Seq("inner", "cross").foreach(joinType => { > val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() > .queryExecution.executedPlan > assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) > // No extra shuffle before aggregate > assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) > }) > }{code} > > Current physical plan (having an extra shuffle on `k1` before aggregate) > > {code:java} > *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] >+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) > +- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] >+- *(2) Project [id#222L AS k2#224L] > +- *(2) Range (0, 30, step=1, splits=2){code} > > Ideal physical plan (no shuffle on `k1` before aggregate) > {code:java} > *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) >+- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] > +- *(2) Project [id#222L AS k2#224L] >+- *(2) Range (0, 30, step=1, splits=2){code} > > This can be fixed by overriding `outputPartitioning` method in > `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32325) JSON predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-32325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158902#comment-17158902 ] Maxim Gekk commented on SPARK-32325: The JIRA ticket was opened while addressing [~dongjoon] comments in the PR https://github.com/apache/spark/pull/27366 but the PR has not merged yet. > JSON predicate pushdown for nested fields > - > > Key: SPARK-32325 > URL: https://issues.apache.org/jira/browse/SPARK-32325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > SPARK-30648 should support filters pushdown to JSON datasource but it > supports only filters that refer to top-level fields. The ticket aims to > support nested fields as well. See the needed changes: > https://github.com/apache/spark/pull/27366#discussion_r443340603 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32330) Preserve shuffled hash join build side partitioning
[ https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32330: - Description: Currently `ShuffledHashJoin.outputPartitioning` inherits from `HashJoin.outputPartitioning`, which only preserves stream side partitioning: `HashJoin.scala` {code:java} override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning {code} This loses build side partitioning information, and causes extra shuffle if there's another join / group-by after this join. Example: {code:java} // code placeholder withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", SQLConf.SHUFFLE_PARTITIONS.key -> "2", SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(30).select($"id".as("k2")) Seq("inner", "cross").foreach(joinType => { val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() .queryExecution.executedPlan assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) // No extra shuffle before aggregate assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) }) }{code} Current physical plan (having an extra shuffle on `k1` before aggregate) {code:java} *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- *(3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] : +- *(1) Project [id#218L AS k1#220L] : +- *(1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] +- *(2) Project [id#222L AS k2#224L] +- *(2) Range (0, 30, step=1, splits=2){code} Ideal physical plan (no shuffle on `k1` before aggregate) {code:java} *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- *(3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] : +- *(1) Project [id#218L AS k1#220L] : +- *(1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] +- *(2) Project [id#222L AS k2#224L] +- *(2) Range (0, 30, step=1, splits=2){code} This can be fixed by overriding `outputPartitioning` method in `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. was: Currently `ShuffledHashJoin.outputPartitioning` inherits from `HashJoin.outputPartitioning`, which only preserves stream side partitioning: `HashJoin.scala` {code:java} override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning {code} This loses build side partitioning information, and causes extra shuffle if there's another join / group-by after this join. Example: {code:java} // code placeholder withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", SQLConf.SHUFFLE_PARTITIONS.key -> "2", SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(30).select($"id".as("k2")) Seq("inner", "cross").foreach(joinType => { val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() .queryExecution.executedPlan assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) // No extra shuffle before aggregate assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) }) }{code} Current physical plan (having an extra shuffle on `k1` before aggregate) {code:java} *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- *(3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] : +- *(1) Project [id#218L AS k1#220L] : +- *(1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] +- *(2) Project [id#222L AS k2#224L] +- *(2) Range (0, 30, step=1, splits=2){code} Ideal physical plan (no shuffle on `k1` before aggregate) {code:java} *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)],
[jira] [Created] (SPARK-32330) Preserve shuffled hash join build side partitioning
Cheng Su created SPARK-32330: Summary: Preserve shuffled hash join build side partitioning Key: SPARK-32330 URL: https://issues.apache.org/jira/browse/SPARK-32330 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Cheng Su Currently `ShuffledHashJoin.outputPartitioning` inherits from `HashJoin.outputPartitioning`, which only preserves stream side partitioning: `HashJoin.scala` {code:java} override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning {code} This loses build side partitioning information, and causes extra shuffle if there's another join / group-by after this join. Example: {code:java} // code placeholder withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", SQLConf.SHUFFLE_PARTITIONS.key -> "2", SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(30).select($"id".as("k2")) Seq("inner", "cross").foreach(joinType => { val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() .queryExecution.executedPlan assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) // No extra shuffle before aggregate assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) }) }{code} Current physical plan (having an extra shuffle on `k1` before aggregate) {code:java} *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- *(3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] : +- *(1) Project [id#218L AS k1#220L] : +- *(1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] +- *(2) Project [id#222L AS k2#224L] +- *(2) Range (0, 30, step=1, splits=2){code} Ideal physical plan (no shuffle on `k1` before aggregate) {code:java} *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- *(3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] : +- *(1) Project [id#218L AS k1#220L] : +- *(1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] +- *(2) Project [id#222L AS k2#224L] +- *(2) Range (0, 30, step=1, splits=2){code} This can be fixed by overriding `outputPartitioning` method in `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. ` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31831) Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)
[ https://issues.apache.org/jira/browse/SPARK-31831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158886#comment-17158886 ] Apache Spark commented on SPARK-31831: -- User 'frankyin-factual' has created a pull request for this issue: https://github.com/apache/spark/pull/29129 > Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It > is not a test it is a sbt.testing.SuiteSelector) > > > Key: SPARK-31831 > URL: https://issues.apache.org/jira/browse/SPARK-31831 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Frank Yin >Priority: Major > Fix For: 3.1.0 > > > I've seen the failures two times (not in a row but closely) which seems to > require investigation. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123147/testReport > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123150/testReport > {noformat} > org.mockito.exceptions.base.MockitoException: ClassCastException occurred > while creating the mockito mock : class to mock : > 'org.apache.hive.service.cli.session.SessionManager', loaded by classloader : > 'sun.misc.Launcher$AppClassLoader@483bf400' created class : > 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by > classloader : > 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' proxy > instance class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', > loaded by classloader : > 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' instance > creation by : ObjenesisInstantiator You might experience classloading > issues, please ask the mockito mailing-list. > Stack Trace > sbt.ForkMain$ForkError: org.mockito.exceptions.base.MockitoException: > ClassCastException occurred while creating the mockito mock : > class to mock : 'org.apache.hive.service.cli.session.SessionManager', > loaded by classloader : 'sun.misc.Launcher$AppClassLoader@483bf400' > created class : > 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by > classloader : > 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' > proxy instance class : > 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by > classloader : > 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' > instance creation by : ObjenesisInstantiator > You might experience classloading issues, please ask the mockito mailing-list. > at > org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.beforeAll(HiveSessionImplSuite.scala:44) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: java.lang.ClassCastException: > org.mockito.codegen.SessionManager$MockitoMock$1696557705 cannot be cast to > org.mockito.internal.creation.bytebuddy.MockAccess > at > org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:48) > at > org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25) > at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35) > at org.mockito.internal.MockitoCore.mock(MockitoCore.java:63) > at org.mockito.Mockito.mock(Mockito.java:1908) > at org.mockito.Mockito.mock(Mockito.java:1817) > ... 13 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31831) Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)
[ https://issues.apache.org/jira/browse/SPARK-31831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158885#comment-17158885 ] Apache Spark commented on SPARK-31831: -- User 'frankyin-factual' has created a pull request for this issue: https://github.com/apache/spark/pull/29129 > Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It > is not a test it is a sbt.testing.SuiteSelector) > > > Key: SPARK-31831 > URL: https://issues.apache.org/jira/browse/SPARK-31831 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Frank Yin >Priority: Major > Fix For: 3.1.0 > > > I've seen the failures two times (not in a row but closely) which seems to > require investigation. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123147/testReport > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123150/testReport > {noformat} > org.mockito.exceptions.base.MockitoException: ClassCastException occurred > while creating the mockito mock : class to mock : > 'org.apache.hive.service.cli.session.SessionManager', loaded by classloader : > 'sun.misc.Launcher$AppClassLoader@483bf400' created class : > 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by > classloader : > 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' proxy > instance class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', > loaded by classloader : > 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' instance > creation by : ObjenesisInstantiator You might experience classloading > issues, please ask the mockito mailing-list. > Stack Trace > sbt.ForkMain$ForkError: org.mockito.exceptions.base.MockitoException: > ClassCastException occurred while creating the mockito mock : > class to mock : 'org.apache.hive.service.cli.session.SessionManager', > loaded by classloader : 'sun.misc.Launcher$AppClassLoader@483bf400' > created class : > 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by > classloader : > 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' > proxy instance class : > 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by > classloader : > 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' > instance creation by : ObjenesisInstantiator > You might experience classloading issues, please ask the mockito mailing-list. > at > org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.beforeAll(HiveSessionImplSuite.scala:44) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: java.lang.ClassCastException: > org.mockito.codegen.SessionManager$MockitoMock$1696557705 cannot be cast to > org.mockito.internal.creation.bytebuddy.MockAccess > at > org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:48) > at > org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25) > at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35) > at org.mockito.internal.MockitoCore.mock(MockitoCore.java:63) > at org.mockito.Mockito.mock(Mockito.java:1908) > at org.mockito.Mockito.mock(Mockito.java:1817) > ... 13 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32329) Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
[ https://issues.apache.org/jira/browse/SPARK-32329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32329: Assignee: (was: Apache Spark) > Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES > > > Key: SPARK-32329 > URL: https://issues.apache.org/jira/browse/SPARK-32329 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32329) Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
[ https://issues.apache.org/jira/browse/SPARK-32329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32329: Assignee: Apache Spark > Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES > > > Key: SPARK-32329 > URL: https://issues.apache.org/jira/browse/SPARK-32329 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32329) Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
[ https://issues.apache.org/jira/browse/SPARK-32329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158881#comment-17158881 ] Apache Spark commented on SPARK-32329: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/29128 > Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES > > > Key: SPARK-32329 > URL: https://issues.apache.org/jira/browse/SPARK-32329 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32329) Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES
William Hyun created SPARK-32329: Summary: Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES Key: SPARK-32329 URL: https://issues.apache.org/jira/browse/SPARK-32329 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.1.0 Reporter: William Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32325) JSON predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-32325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158878#comment-17158878 ] pavithra ramachandran commented on SPARK-32325: --- i would like to work on this > JSON predicate pushdown for nested fields > - > > Key: SPARK-32325 > URL: https://issues.apache.org/jira/browse/SPARK-32325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > SPARK-30648 should support filters pushdown to JSON datasource but it > supports only filters that refer to top-level fields. The ticket aims to > support nested fields as well. See the needed changes: > https://github.com/apache/spark/pull/27366#discussion_r443340603 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32328) Avro predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-32328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158877#comment-17158877 ] pavithra ramachandran commented on SPARK-32328: --- i would like to work on this > Avro predicate pushdown for nested fields > - > > Key: SPARK-32328 > URL: https://issues.apache.org/jira/browse/SPARK-32328 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API
[ https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-32125: -- Assignee: Zhongwei Zhu > [UI] Support get taskList by status in Web UI and SHS Rest API > -- > > Key: SPARK-32125 > URL: https://issues.apache.org/jira/browse/SPARK-32125 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > > Support fetching taskList by status as below: > /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API
[ https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-32125. Resolution: Fixed This issue is resolved in https://github.com/apache/spark/pull/28942 > [UI] Support get taskList by status in Web UI and SHS Rest API > -- > > Key: SPARK-32125 > URL: https://issues.apache.org/jira/browse/SPARK-32125 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Support fetching taskList by status as below: > /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158863#comment-17158863 ] Ankit Raj Boudh commented on SPARK-32306: - [~seanmalory], i will raise the pr for this soon > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32324) Fix error messages during using PIVOT and lateral view
[ https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] philipse updated SPARK-32324: - Description: Currently when we use `lateral view` and `pivot` together in from clause, if `lateral view` is before `pivot`, the error message is "LATERAL cannot be used together with PIVOT in FROM clause".if if `lateral view` is after `pivot`,the query will be normal ,So the error messages "LATERAL cannot be used together with PIVOT in FROM clause" is not accurate, we may improve it. Steps to reproduce: {code:java} CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING); INSERT INTO person VALUES (100, 'John', 30, 1, 'Street 1'), (200, 'Mary', NULL, 1, 'Street 2'), (300, 'Mike', 80, 3, 'Street 3'), (400, 'Dan', 50, 4, 'Street 4'); {code} Query1: {code:java} SELECT * FROM person lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age PIVOT ( count(distinct age) as a for name in ('Mary','John') ) {code} Result 1: {code:java} Error: org.apache.spark.sql.catalyst.parser.ParseException: LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9) == SQL == SELECT * FROM person -^^^ lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age PIVOT ( count(distinct age) as a for name in ('Mary','John') ) (state=,code=0) {code} Query2: {code:java} SELECT * FROM person PIVOT ( count(distinct age) as a for name in ('Mary','John') ) lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age {code} Reuslt2: +---+--++---++ |id|Mary|John|c_age|d_age| +---+--++---++ |300|NULL|NULL|30|40| |300|NULL|NULL|30|80| |300|NULL|NULL|60|40| |300|NULL|NULL|60|80| |100|0|NULL|30|40| |100|0|NULL|30|80| |100|0|NULL|60|40| |100|0|NULL|60|80| |400|NULL|NULL|30|40| |400|NULL|NULL|30|80| |400|NULL|NULL|60|40| |400|NULL|NULL|60|80| |200|NULL|1|30|40| |200|NULL|1|30|80| |200|NULL|1|60|40| |200|NULL|1|60|80| +---+--++---++ ``` was: Currently when we use `lateral view` and `pivot` together in from clause, if `lateral view` is before `pivot`, the error message is "LATERAL cannot be used together with PIVOT in FROM clause".if if `lateral view` is after `pivot`,the query will be normal ,So the error messages "LATERAL cannot be used together with PIVOT in FROM clause" is not accurate, we may improve it. Steps to reproduce: ``` CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING); INSERT INTO person VALUES (100, 'John', 30, 1, 'Street 1'), (200, 'Mary', NULL, 1, 'Street 2'), (300, 'Mike', 80, 3, 'Street 3'), (400, 'Dan', 50, 4, 'Street 4'); ``` Query1: ``` SELECT * FROM person lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age PIVOT ( count(distinct age) as a for name in ('Mary','John') ) ``` Result 1: ``` Error: org.apache.spark.sql.catalyst.parser.ParseException: LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9) == SQL == SELECT * FROM person -^^^ lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age PIVOT ( count(distinct age) as a for name in ('Mary','John') ) (state=,code=0) ``` Query2: ``` SELECT * FROM person PIVOT ( count(distinct age) as a for name in ('Mary','John') ) lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age ``` Reuslt2: ``` +--+---+---+++ | id | Mary | John | c_age | d_age | +--+---+---+++ | 300 | NULL | NULL | 30 | 40 | | 300 | NULL | NULL | 30 | 80 | | 300 | NULL | NULL | 60 | 40 | | 300 | NULL | NULL | 60 | 80 | | 100 | 0 | NULL | 30 | 40 | | 100 | 0 | NULL | 30 | 80 | | 100 | 0 | NULL | 60 | 40 | | 100 | 0 | NULL | 60 | 80 | | 400 | NULL | NULL | 30 | 40 | | 400 | NULL | NULL | 30 | 80 | | 400 | NULL | NULL | 60 | 40 | | 400 | NULL | NULL | 60 | 80 | | 200 | NULL | 1 | 30 | 40 | | 200 | NULL | 1 | 30 | 80 | | 200 | NULL | 1 | 60 | 40 | | 200 | NULL | 1 | 60 | 80 | +--+---+---+++ ``` > Fix error messages during using PIVOT and lateral view > -- > > Key: SPARK-32324 > URL: https://issues.apache.org/jira/browse/SPARK-32324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: philipse >Priority: Minor > > Currently when we use `lateral view` and `pivot` together in from clause, if > `lateral view` is before `pivot`, the error message is "LATERAL cannot be > used together with PIVOT in FROM clau
[jira] [Created] (SPARK-32328) Avro predicate pushdown for nested fields
jobit mathew created SPARK-32328: Summary: Avro predicate pushdown for nested fields Key: SPARK-32328 URL: https://issues.apache.org/jira/browse/SPARK-32328 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: jobit mathew -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32307) Aggression that use map type input UDF as group expression can fail
[ https://issues.apache.org/jira/browse/SPARK-32307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158814#comment-17158814 ] Dongjoon Hyun commented on SPARK-32307: --- Hi, [~Ngone51]. It seems that we need to pass a full Jenkins run on branch-4.0. Could you make a backporting PR please? > Aggression that use map type input UDF as group expression can fail > --- > > Key: SPARK-32307 > URL: https://issues.apache.org/jira/browse/SPARK-32307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > {code:java} > spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt)) > Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t") > checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil) > [info] org.apache.spark.sql.AnalysisException: expression 't.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8] > [info] +- SubqueryAlias t > [info]+- Project [value#3 AS a#6] > [info] +- LocalRelation [value#3] > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) > [info] at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > [info] at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) > [info] at scala.collection.immutable.List.foreach(List.scala:392) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13(CheckAnalysis.scala:286) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13$adapted(CheckAnalysis.scala:286) > [info] at scala.collection.immutable.List.foreach(List.scala:392) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:286) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153) > [info] at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:70) > [info] at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > [info] at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:135) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > [info] at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:135) > [info] at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70) > [info] at > org.apach
[jira] [Updated] (SPARK-32307) Aggression that use map type input UDF as group expression can fail
[ https://issues.apache.org/jira/browse/SPARK-32307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32307: -- Fix Version/s: (was: 3.0.1) > Aggression that use map type input UDF as group expression can fail > --- > > Key: SPARK-32307 > URL: https://issues.apache.org/jira/browse/SPARK-32307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > {code:java} > spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt)) > Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t") > checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil) > [info] org.apache.spark.sql.AnalysisException: expression 't.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8] > [info] +- SubqueryAlias t > [info]+- Project [value#3 AS a#6] > [info] +- LocalRelation [value#3] > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) > [info] at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > [info] at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) > [info] at scala.collection.immutable.List.foreach(List.scala:392) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13(CheckAnalysis.scala:286) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13$adapted(CheckAnalysis.scala:286) > [info] at scala.collection.immutable.List.foreach(List.scala:392) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:286) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153) > [info] at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:70) > [info] at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > [info] at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:135) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > [info] at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:135) > [info] at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70) > [info] at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68) > [info] at > org.apache.spark.sql.execution.QueryExecution.assertAna
[jira] [Commented] (SPARK-32307) Aggression that use map type input UDF as group expression can fail
[ https://issues.apache.org/jira/browse/SPARK-32307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158813#comment-17158813 ] Dongjoon Hyun commented on SPARK-32307: --- This is reverted from `branch-3.0` due to the UT failure. > Aggression that use map type input UDF as group expression can fail > --- > > Key: SPARK-32307 > URL: https://issues.apache.org/jira/browse/SPARK-32307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > {code:java} > spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt)) > Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t") > checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil) > [info] org.apache.spark.sql.AnalysisException: expression 't.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8] > [info] +- SubqueryAlias t > [info]+- Project [value#3 AS a#6] > [info] +- LocalRelation [value#3] > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) > [info] at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > [info] at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) > [info] at scala.collection.immutable.List.foreach(List.scala:392) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13(CheckAnalysis.scala:286) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$13$adapted(CheckAnalysis.scala:286) > [info] at scala.collection.immutable.List.foreach(List.scala:392) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:286) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92) > [info] at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153) > [info] at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:70) > [info] at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > [info] at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:135) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > [info] at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:135) > [info] at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70) > [info] at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.sc
[jira] [Assigned] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view
[ https://issues.apache.org/jira/browse/SPARK-32327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32327: Assignee: Apache Spark > Introduce UnresolvedTableOrPermanentView for commands that support a > table/view but not a temporary view > > > Key: SPARK-32327 > URL: https://issues.apache.org/jira/browse/SPARK-32327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Major > > We should have UnresolvedTableOrPermanentView for commands that do support a > table or a view, but not a temporary view, such that an analysis can fail if > an identifier is resolved to a temporary view for those commands > > For example, SHOW TBLPROPERTIES should not support a temp view since it > always returns an empty result, which could be misleading. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view
[ https://issues.apache.org/jira/browse/SPARK-32327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158745#comment-17158745 ] Apache Spark commented on SPARK-32327: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/29127 > Introduce UnresolvedTableOrPermanentView for commands that support a > table/view but not a temporary view > > > Key: SPARK-32327 > URL: https://issues.apache.org/jira/browse/SPARK-32327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Major > > We should have UnresolvedTableOrPermanentView for commands that do support a > table or a view, but not a temporary view, such that an analysis can fail if > an identifier is resolved to a temporary view for those commands > > For example, SHOW TBLPROPERTIES should not support a temp view since it > always returns an empty result, which could be misleading. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view
[ https://issues.apache.org/jira/browse/SPARK-32327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32327: Assignee: (was: Apache Spark) > Introduce UnresolvedTableOrPermanentView for commands that support a > table/view but not a temporary view > > > Key: SPARK-32327 > URL: https://issues.apache.org/jira/browse/SPARK-32327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Major > > We should have UnresolvedTableOrPermanentView for commands that do support a > table or a view, but not a temporary view, such that an analysis can fail if > an identifier is resolved to a temporary view for those commands > > For example, SHOW TBLPROPERTIES should not support a temp view since it > always returns an empty result, which could be misleading. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view
[ https://issues.apache.org/jira/browse/SPARK-32327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158744#comment-17158744 ] Apache Spark commented on SPARK-32327: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/29127 > Introduce UnresolvedTableOrPermanentView for commands that support a > table/view but not a temporary view > > > Key: SPARK-32327 > URL: https://issues.apache.org/jira/browse/SPARK-32327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Major > > We should have UnresolvedTableOrPermanentView for commands that do support a > table or a view, but not a temporary view, such that an analysis can fail if > an identifier is resolved to a temporary view for those commands > > For example, SHOW TBLPROPERTIES should not support a temp view since it > always returns an empty result, which could be misleading. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation
[ https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158739#comment-17158739 ] Thomas Graves commented on SPARK-32037: --- Any other opinions on what we should go with here? > Rename blacklisting feature to avoid language with racist connotation > - > > Key: SPARK-32037 > URL: https://issues.apache.org/jira/browse/SPARK-32037 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Minor > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist". > While it seems to me that there is some valid debate as to whether this term > has racist origins, the cultural connotations are inescapable in today's > world. > I've created a separate task, SPARK-32036, to remove references outside of > this feature. Given the large surface area of this feature and the > public-facing UI / configs / etc., more care will need to be taken here. > I'd like to start by opening up debate on what the best replacement name > would be. Reject-/deny-/ignore-/block-list are common replacements for > "blacklist", but I'm not sure that any of them work well for this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32327) Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view
Terry Kim created SPARK-32327: - Summary: Introduce UnresolvedTableOrPermanentView for commands that support a table/view but not a temporary view Key: SPARK-32327 URL: https://issues.apache.org/jira/browse/SPARK-32327 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Terry Kim We should have UnresolvedTableOrPermanentView for commands that do support a table or a view, but not a temporary view, such that an analysis can fail if an identifier is resolved to a temporary view for those commands For example, SHOW TBLPROPERTIES should not support a temp view since it always returns an empty result, which could be misleading. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32326) R version is too old on Jenkins k8s PRB
Holden Karau created SPARK-32326: Summary: R version is too old on Jenkins k8s PRB Key: SPARK-32326 URL: https://issues.apache.org/jira/browse/SPARK-32326 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.1.0 Reporter: Holden Karau Assignee: Shane Knapp I'm seeing a consistent failure indicating the R version is out of date - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/30513/console] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32324) Fix error messages during using PIVOT and lateral view
[ https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158584#comment-17158584 ] Apache Spark commented on SPARK-32324: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/29126 > Fix error messages during using PIVOT and lateral view > -- > > Key: SPARK-32324 > URL: https://issues.apache.org/jira/browse/SPARK-32324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: philipse >Priority: Minor > > Currently when we use `lateral view` and `pivot` together in from clause, if > `lateral view` is before `pivot`, the error message is "LATERAL cannot be > used together with PIVOT in FROM clause".if if `lateral view` is after > `pivot`,the query will be normal ,So the error messages "LATERAL cannot be > used together with PIVOT in FROM clause" is not accurate, we may improve it. > > Steps to reproduce: > ``` > CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING); > INSERT INTO person VALUES > (100, 'John', 30, 1, 'Street 1'), > (200, 'Mary', NULL, 1, 'Street 2'), > (300, 'Mike', 80, 3, 'Street 3'), > (400, 'Dan', 50, 4, 'Street 4'); > ``` > Query1: > ``` > SELECT * FROM person > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) > ``` > Result 1: > ``` > Error: org.apache.spark.sql.catalyst.parser.ParseException: > LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9) > == SQL == > SELECT * FROM person > -^^^ > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) (state=,code=0) > ``` > > Query2: > ``` > SELECT * FROM person > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > ``` > Reuslt2: > ``` > +--+---+---+++ > | id | Mary | John | c_age | d_age | > +--+---+---+++ > | 300 | NULL | NULL | 30 | 40 | > | 300 | NULL | NULL | 30 | 80 | > | 300 | NULL | NULL | 60 | 40 | > | 300 | NULL | NULL | 60 | 80 | > | 100 | 0 | NULL | 30 | 40 | > | 100 | 0 | NULL | 30 | 80 | > | 100 | 0 | NULL | 60 | 40 | > | 100 | 0 | NULL | 60 | 80 | > | 400 | NULL | NULL | 30 | 40 | > | 400 | NULL | NULL | 30 | 80 | > | 400 | NULL | NULL | 60 | 40 | > | 400 | NULL | NULL | 60 | 80 | > | 200 | NULL | 1 | 30 | 40 | > | 200 | NULL | 1 | 30 | 80 | > | 200 | NULL | 1 | 60 | 40 | > | 200 | NULL | 1 | 60 | 80 | > +--+---+---+++ > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32324) Fix error messages during using PIVOT and lateral view
[ https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32324: Assignee: (was: Apache Spark) > Fix error messages during using PIVOT and lateral view > -- > > Key: SPARK-32324 > URL: https://issues.apache.org/jira/browse/SPARK-32324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: philipse >Priority: Minor > > Currently when we use `lateral view` and `pivot` together in from clause, if > `lateral view` is before `pivot`, the error message is "LATERAL cannot be > used together with PIVOT in FROM clause".if if `lateral view` is after > `pivot`,the query will be normal ,So the error messages "LATERAL cannot be > used together with PIVOT in FROM clause" is not accurate, we may improve it. > > Steps to reproduce: > ``` > CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING); > INSERT INTO person VALUES > (100, 'John', 30, 1, 'Street 1'), > (200, 'Mary', NULL, 1, 'Street 2'), > (300, 'Mike', 80, 3, 'Street 3'), > (400, 'Dan', 50, 4, 'Street 4'); > ``` > Query1: > ``` > SELECT * FROM person > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) > ``` > Result 1: > ``` > Error: org.apache.spark.sql.catalyst.parser.ParseException: > LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9) > == SQL == > SELECT * FROM person > -^^^ > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) (state=,code=0) > ``` > > Query2: > ``` > SELECT * FROM person > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > ``` > Reuslt2: > ``` > +--+---+---+++ > | id | Mary | John | c_age | d_age | > +--+---+---+++ > | 300 | NULL | NULL | 30 | 40 | > | 300 | NULL | NULL | 30 | 80 | > | 300 | NULL | NULL | 60 | 40 | > | 300 | NULL | NULL | 60 | 80 | > | 100 | 0 | NULL | 30 | 40 | > | 100 | 0 | NULL | 30 | 80 | > | 100 | 0 | NULL | 60 | 40 | > | 100 | 0 | NULL | 60 | 80 | > | 400 | NULL | NULL | 30 | 40 | > | 400 | NULL | NULL | 30 | 80 | > | 400 | NULL | NULL | 60 | 40 | > | 400 | NULL | NULL | 60 | 80 | > | 200 | NULL | 1 | 30 | 40 | > | 200 | NULL | 1 | 30 | 80 | > | 200 | NULL | 1 | 60 | 40 | > | 200 | NULL | 1 | 60 | 80 | > +--+---+---+++ > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32324) Fix error messages during using PIVOT and lateral view
[ https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158582#comment-17158582 ] Apache Spark commented on SPARK-32324: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/29126 > Fix error messages during using PIVOT and lateral view > -- > > Key: SPARK-32324 > URL: https://issues.apache.org/jira/browse/SPARK-32324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: philipse >Priority: Minor > > Currently when we use `lateral view` and `pivot` together in from clause, if > `lateral view` is before `pivot`, the error message is "LATERAL cannot be > used together with PIVOT in FROM clause".if if `lateral view` is after > `pivot`,the query will be normal ,So the error messages "LATERAL cannot be > used together with PIVOT in FROM clause" is not accurate, we may improve it. > > Steps to reproduce: > ``` > CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING); > INSERT INTO person VALUES > (100, 'John', 30, 1, 'Street 1'), > (200, 'Mary', NULL, 1, 'Street 2'), > (300, 'Mike', 80, 3, 'Street 3'), > (400, 'Dan', 50, 4, 'Street 4'); > ``` > Query1: > ``` > SELECT * FROM person > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) > ``` > Result 1: > ``` > Error: org.apache.spark.sql.catalyst.parser.ParseException: > LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9) > == SQL == > SELECT * FROM person > -^^^ > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) (state=,code=0) > ``` > > Query2: > ``` > SELECT * FROM person > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > ``` > Reuslt2: > ``` > +--+---+---+++ > | id | Mary | John | c_age | d_age | > +--+---+---+++ > | 300 | NULL | NULL | 30 | 40 | > | 300 | NULL | NULL | 30 | 80 | > | 300 | NULL | NULL | 60 | 40 | > | 300 | NULL | NULL | 60 | 80 | > | 100 | 0 | NULL | 30 | 40 | > | 100 | 0 | NULL | 30 | 80 | > | 100 | 0 | NULL | 60 | 40 | > | 100 | 0 | NULL | 60 | 80 | > | 400 | NULL | NULL | 30 | 40 | > | 400 | NULL | NULL | 30 | 80 | > | 400 | NULL | NULL | 60 | 40 | > | 400 | NULL | NULL | 60 | 80 | > | 200 | NULL | 1 | 30 | 40 | > | 200 | NULL | 1 | 30 | 80 | > | 200 | NULL | 1 | 60 | 40 | > | 200 | NULL | 1 | 60 | 80 | > +--+---+---+++ > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32324) Fix error messages during using PIVOT and lateral view
[ https://issues.apache.org/jira/browse/SPARK-32324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32324: Assignee: Apache Spark > Fix error messages during using PIVOT and lateral view > -- > > Key: SPARK-32324 > URL: https://issues.apache.org/jira/browse/SPARK-32324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: philipse >Assignee: Apache Spark >Priority: Minor > > Currently when we use `lateral view` and `pivot` together in from clause, if > `lateral view` is before `pivot`, the error message is "LATERAL cannot be > used together with PIVOT in FROM clause".if if `lateral view` is after > `pivot`,the query will be normal ,So the error messages "LATERAL cannot be > used together with PIVOT in FROM clause" is not accurate, we may improve it. > > Steps to reproduce: > ``` > CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING); > INSERT INTO person VALUES > (100, 'John', 30, 1, 'Street 1'), > (200, 'Mary', NULL, 1, 'Street 2'), > (300, 'Mike', 80, 3, 'Street 3'), > (400, 'Dan', 50, 4, 'Street 4'); > ``` > Query1: > ``` > SELECT * FROM person > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) > ``` > Result 1: > ``` > Error: org.apache.spark.sql.catalyst.parser.ParseException: > LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9) > == SQL == > SELECT * FROM person > -^^^ > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) (state=,code=0) > ``` > > Query2: > ``` > SELECT * FROM person > PIVOT ( > count(distinct age) as a > for name in ('Mary','John') > ) > lateral view outer explode(array(30,60)) tabelName as c_age > lateral view explode(array(40,80)) as d_age > ``` > Reuslt2: > ``` > +--+---+---+++ > | id | Mary | John | c_age | d_age | > +--+---+---+++ > | 300 | NULL | NULL | 30 | 40 | > | 300 | NULL | NULL | 30 | 80 | > | 300 | NULL | NULL | 60 | 40 | > | 300 | NULL | NULL | 60 | 80 | > | 100 | 0 | NULL | 30 | 40 | > | 100 | 0 | NULL | 30 | 80 | > | 100 | 0 | NULL | 60 | 40 | > | 100 | 0 | NULL | 60 | 80 | > | 400 | NULL | NULL | 30 | 40 | > | 400 | NULL | NULL | 30 | 80 | > | 400 | NULL | NULL | 60 | 40 | > | 400 | NULL | NULL | 60 | 80 | > | 200 | NULL | 1 | 30 | 40 | > | 200 | NULL | 1 | 30 | 80 | > | 200 | NULL | 1 | 60 | 40 | > | 200 | NULL | 1 | 60 | 80 | > +--+---+---+++ > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32325) JSON predicate pushdown for nested fields
Maxim Gekk created SPARK-32325: -- Summary: JSON predicate pushdown for nested fields Key: SPARK-32325 URL: https://issues.apache.org/jira/browse/SPARK-32325 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk SPARK-30648 should support filters pushdown to JSON datasource but it supports only filters that refer to top-level fields. The ticket aims to support nested fields as well. See the needed changes: https://github.com/apache/spark/pull/27366#discussion_r443340603 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32324) Fix error messages during using PIVOT and lateral view
philipse created SPARK-32324: Summary: Fix error messages during using PIVOT and lateral view Key: SPARK-32324 URL: https://issues.apache.org/jira/browse/SPARK-32324 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: philipse Currently when we use `lateral view` and `pivot` together in from clause, if `lateral view` is before `pivot`, the error message is "LATERAL cannot be used together with PIVOT in FROM clause".if if `lateral view` is after `pivot`,the query will be normal ,So the error messages "LATERAL cannot be used together with PIVOT in FROM clause" is not accurate, we may improve it. Steps to reproduce: ``` CREATE TABLE person (id INT, name STRING, age INT, class int, address STRING); INSERT INTO person VALUES (100, 'John', 30, 1, 'Street 1'), (200, 'Mary', NULL, 1, 'Street 2'), (300, 'Mike', 80, 3, 'Street 3'), (400, 'Dan', 50, 4, 'Street 4'); ``` Query1: ``` SELECT * FROM person lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age PIVOT ( count(distinct age) as a for name in ('Mary','John') ) ``` Result 1: ``` Error: org.apache.spark.sql.catalyst.parser.ParseException: LATERAL cannot be used together with PIVOT in FROM clause(line 1, pos 9) == SQL == SELECT * FROM person -^^^ lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age PIVOT ( count(distinct age) as a for name in ('Mary','John') ) (state=,code=0) ``` Query2: ``` SELECT * FROM person PIVOT ( count(distinct age) as a for name in ('Mary','John') ) lateral view outer explode(array(30,60)) tabelName as c_age lateral view explode(array(40,80)) as d_age ``` Reuslt2: ``` +--+---+---+++ | id | Mary | John | c_age | d_age | +--+---+---+++ | 300 | NULL | NULL | 30 | 40 | | 300 | NULL | NULL | 30 | 80 | | 300 | NULL | NULL | 60 | 40 | | 300 | NULL | NULL | 60 | 80 | | 100 | 0 | NULL | 30 | 40 | | 100 | 0 | NULL | 30 | 80 | | 100 | 0 | NULL | 60 | 40 | | 100 | 0 | NULL | 60 | 80 | | 400 | NULL | NULL | 30 | 40 | | 400 | NULL | NULL | 30 | 80 | | 400 | NULL | NULL | 60 | 40 | | 400 | NULL | NULL | 60 | 80 | | 200 | NULL | 1 | 30 | 40 | | 200 | NULL | 1 | 30 | 80 | | 200 | NULL | 1 | 60 | 40 | | 200 | NULL | 1 | 60 | 80 | +--+---+---+++ ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158553#comment-17158553 ] Apache Spark commented on SPARK-32018: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29125 > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Priority: Major > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158548#comment-17158548 ] Apache Spark commented on SPARK-32018: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29125 > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Priority: Major > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32140) Add summary to FMClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-32140. Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28960 [https://github.com/apache/spark/pull/28960] > Add summary to FMClassificationModel > > > Key: SPARK-32140 > URL: https://issues.apache.org/jira/browse/SPARK-32140 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32140) Add summary to FMClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao reassigned SPARK-32140: -- Assignee: Huaxin Gao > Add summary to FMClassificationModel > > > Key: SPARK-32140 > URL: https://issues.apache.org/jira/browse/SPARK-32140 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32287) Flaky Test: ExecutorAllocationManagerSuite.add executors default profile
[ https://issues.apache.org/jira/browse/SPARK-32287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158521#comment-17158521 ] Thomas Graves commented on SPARK-32287: --- I'll try to reproduce and investigate locally > Flaky Test: ExecutorAllocationManagerSuite.add executors default profile > > > Key: SPARK-32287 > URL: https://issues.apache.org/jira/browse/SPARK-32287 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > This test becomes flaky in Github Actions, see: > https://github.com/apache/spark/pull/29072/checks?check_run_id=861689509 > {code:java} > [info] - add executors default profile *** FAILED *** (33 milliseconds) > [info] 4 did not equal 2 (ExecutorAllocationManagerSuite.scala:132) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) > [info] at > org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$7(ExecutorAllocationManagerSuite.scala:132) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32036) Remove references to "blacklist"/"whitelist" language (outside of blacklisting feature)
[ https://issues.apache.org/jira/browse/SPARK-32036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-32036: - Assignee: Erik Krogen > Remove references to "blacklist"/"whitelist" language (outside of > blacklisting feature) > --- > > Key: SPARK-32036 > URL: https://issues.apache.org/jira/browse/SPARK-32036 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Fix For: 3.1.0 > > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist" and > "whitelist". While it seems to me that there is some valid debate as to > whether these terms have racist origins, the cultural connotations are > inescapable in today's world. > Renaming the entire blacklisting feature would be a large effort with lots of > care needed to maintain public-facing APIs and configurations. Though I think > this will be a very rewarding effort for which I've filed SPARK-32037, I'd > like to start by tackling all of the other references to such terminology in > the codebase, of which there are still dozens or hundreds beyond the > blacklisting feature. > I'm not sure what the best "Component" is for this so I put Spark Core for > now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32036) Remove references to "blacklist"/"whitelist" language (outside of blacklisting feature)
[ https://issues.apache.org/jira/browse/SPARK-32036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-32036. --- Fix Version/s: 3.1.0 Resolution: Fixed > Remove references to "blacklist"/"whitelist" language (outside of > blacklisting feature) > --- > > Key: SPARK-32036 > URL: https://issues.apache.org/jira/browse/SPARK-32036 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Minor > Fix For: 3.1.0 > > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist" and > "whitelist". While it seems to me that there is some valid debate as to > whether these terms have racist origins, the cultural connotations are > inescapable in today's world. > Renaming the entire blacklisting feature would be a large effort with lots of > care needed to maintain public-facing APIs and configurations. Though I think > this will be a very rewarding effort for which I've filed SPARK-32037, I'd > like to start by tackling all of the other references to such terminology in > the codebase, of which there are still dozens or hundreds beyond the > blacklisting feature. > I'm not sure what the best "Component" is for this so I put Spark Core for > now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved
[ https://issues.apache.org/jira/browse/SPARK-32321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158484#comment-17158484 ] Gabor Somogyi commented on SPARK-32321: --- [~zsxwing] since you're the author and written the following: {quote}In addition, to avoid other unknown issues, we also use the previous known offsets to audit the latest offsets returned by Kafka.{quote} What do you think about this extra safety feature? Should we keep it or drop it? > Rollback SPARK-26267 workaround since KAFKA-7703 resolved > - > > Key: SPARK-32321 > URL: https://issues.apache.org/jira/browse/SPARK-32321 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved
[ https://issues.apache.org/jira/browse/SPARK-32321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158475#comment-17158475 ] Gabor Somogyi commented on SPARK-32321: --- Hope it will make KafkaOffsetReader area more clear because couple of users has hit SPARK-28367 > Rollback SPARK-26267 workaround since KAFKA-7703 resolved > - > > Key: SPARK-32321 > URL: https://issues.apache.org/jira/browse/SPARK-32321 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32318) Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE BY
[ https://issues.apache.org/jira/browse/SPARK-32318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32318. --- Fix Version/s: 3.1.0 2.4.7 3.0.1 Resolution: Fixed Issue resolved by pull request 29118 [https://github.com/apache/spark/pull/29118] > Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE > BY > --- > > Key: SPARK-32318 > URL: https://issues.apache.org/jira/browse/SPARK-32318 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.1, 2.4.7, 3.1.0 > > > This is found during reviewing SPARK-32276. > *AFTER SPARK-32276* > {code} > scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, > x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") > scala> sql("select * from (select * from t order by b) distribute by > a").write.orc("/tmp/SPARK-32276") > $ ls -al /tmp/SPARK-32276/ > total 632 > drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:08 ./ > drwxrwxrwt 14 root wheel 448 Jul 14 22:08 ../ > -rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:08 ._SUCCESS.crc > -rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:08 > .part-0-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel1188 Jul 14 22:08 > .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel1188 Jul 14 22:08 > .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:08 _SUCCESS > -rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:08 > part-0-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc > -rw-r--r-- 1 dongjoon wheel 150735 Jul 14 22:08 > part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc > -rw-r--r-- 1 dongjoon wheel 150741 Jul 14 22:08 > part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc > {code} > *BEFORE* > {code} > scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, > x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") > scala> sql("select * from (select * from t order by b) distribute by > a").write.orc("/tmp/master") > $ ls -al /tmp/master/ > total 56 > drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:12 ./ > drwxrwxrwt 15 root wheel 480 Jul 14 22:12 ../ > -rw-r--r-- 1 dongjoon wheel8 Jul 14 22:12 ._SUCCESS.crc > -rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:12 > .part-0-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 > .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 > .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel0 Jul 14 22:12 _SUCCESS > -rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:12 > part-0-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc > -rw-r--r-- 1 dongjoon wheel 932 Jul 14 22:12 > part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc > -rw-r--r-- 1 dongjoon wheel 939 Jul 14 22:12 > part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32318) Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE BY
[ https://issues.apache.org/jira/browse/SPARK-32318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32318: - Assignee: Dongjoon Hyun > Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE > BY > --- > > Key: SPARK-32318 > URL: https://issues.apache.org/jira/browse/SPARK-32318 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > This is found during reviewing SPARK-32276. > *AFTER SPARK-32276* > {code} > scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, > x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") > scala> sql("select * from (select * from t order by b) distribute by > a").write.orc("/tmp/SPARK-32276") > $ ls -al /tmp/SPARK-32276/ > total 632 > drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:08 ./ > drwxrwxrwt 14 root wheel 448 Jul 14 22:08 ../ > -rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:08 ._SUCCESS.crc > -rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:08 > .part-0-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel1188 Jul 14 22:08 > .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel1188 Jul 14 22:08 > .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:08 _SUCCESS > -rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:08 > part-0-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc > -rw-r--r-- 1 dongjoon wheel 150735 Jul 14 22:08 > part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc > -rw-r--r-- 1 dongjoon wheel 150741 Jul 14 22:08 > part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc > {code} > *BEFORE* > {code} > scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, > x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") > scala> sql("select * from (select * from t order by b) distribute by > a").write.orc("/tmp/master") > $ ls -al /tmp/master/ > total 56 > drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:12 ./ > drwxrwxrwt 15 root wheel 480 Jul 14 22:12 ../ > -rw-r--r-- 1 dongjoon wheel8 Jul 14 22:12 ._SUCCESS.crc > -rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:12 > .part-0-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 > .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 > .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc > -rw-r--r-- 1 dongjoon wheel0 Jul 14 22:12 _SUCCESS > -rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:12 > part-0-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc > -rw-r--r-- 1 dongjoon wheel 932 Jul 14 22:12 > part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc > -rw-r--r-- 1 dongjoon wheel 939 Jul 14 22:12 > part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32323) Javascript/HTML bug in spark application UI
[ https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ihor Bobak updated SPARK-32323: --- Description: I attached screeenshot - everything is written on it. This appeared in Spark 3.0.0 in the Firefox browser (latest version) was: I attached screeenshot - everything is written on it. This appeared in Spark 3.0.0 > Javascript/HTML bug in spark application UI > --- > > Key: SPARK-32323 > URL: https://issues.apache.org/jira/browse/SPARK-32323 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 > Environment: Ubuntu 18, Spark 3.0.0 standalone cluster >Reporter: Ihor Bobak >Priority: Major > Attachments: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png > > > I attached screeenshot - everything is written on it. > This appeared in Spark 3.0.0 in the Firefox browser (latest version) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32323) Javascript/HTML bug in spark application UI
[ https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ihor Bobak updated SPARK-32323: --- Attachment: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png > Javascript/HTML bug in spark application UI > --- > > Key: SPARK-32323 > URL: https://issues.apache.org/jira/browse/SPARK-32323 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 > Environment: Ubuntu 18, Spark 3.0.0 standalone cluster >Reporter: Ihor Bobak >Priority: Major > Attachments: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png > > > I attached screeenshot - everything is written on it. > This appeared in Spark 3.0.0 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32323) Javascript/HTML bug in spark application UI
[ https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ihor Bobak updated SPARK-32323: --- Description: I attached screeenshot - everything is written on it. This appeared in Spark 3.0.0 was: I attached screeenshot - everything is written on it. This appeared in Spark 3.0.0 !image-2020-07-15-16-40-42-328.png! > Javascript/HTML bug in spark application UI > --- > > Key: SPARK-32323 > URL: https://issues.apache.org/jira/browse/SPARK-32323 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 > Environment: Ubuntu 18, Spark 3.0.0 standalone cluster >Reporter: Ihor Bobak >Priority: Major > > I attached screeenshot - everything is written on it. > This appeared in Spark 3.0.0 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32323) Javascript/HTML bug in spark application UI
Ihor Bobak created SPARK-32323: -- Summary: Javascript/HTML bug in spark application UI Key: SPARK-32323 URL: https://issues.apache.org/jira/browse/SPARK-32323 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 3.0.0 Environment: Ubuntu 18, Spark 3.0.0 standalone cluster Reporter: Ihor Bobak I attached screeenshot - everything is written on it. This appeared in Spark 3.0.0 !image-2020-07-15-16-40-42-328.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32322) Pyspark not launching in Spark IPV6 environment
[ https://issues.apache.org/jira/browse/SPARK-32322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158165#comment-17158165 ] pavithra ramachandran commented on SPARK-32322: --- i would like to check this > Pyspark not launching in Spark IPV6 environment > --- > > Key: SPARK-32322 > URL: https://issues.apache.org/jira/browse/SPARK-32322 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > > pyspark is not launching in Spark IPV6 environment. > Initial analysis looks like python is not supporting IPV6. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32322) Pyspark not launching in Spark IPV6 environment
jobit mathew created SPARK-32322: Summary: Pyspark not launching in Spark IPV6 environment Key: SPARK-32322 URL: https://issues.apache.org/jira/browse/SPARK-32322 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.0 Reporter: jobit mathew pyspark is not launching in Spark IPV6 environment. Initial analysis looks like python is not supporting IPV6. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved
[ https://issues.apache.org/jira/browse/SPARK-32321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158161#comment-17158161 ] Gabor Somogyi commented on SPARK-32321: --- I've started to work on this > Rollback SPARK-26267 workaround since KAFKA-7703 resolved > - > > Key: SPARK-32321 > URL: https://issues.apache.org/jira/browse/SPARK-32321 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved
[ https://issues.apache.org/jira/browse/SPARK-32321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158160#comment-17158160 ] Gabor Somogyi commented on SPARK-32321: --- FYI [~zsxwing] > Rollback SPARK-26267 workaround since KAFKA-7703 resolved > - > > Key: SPARK-32321 > URL: https://issues.apache.org/jira/browse/SPARK-32321 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32321) Rollback SPARK-26267 workaround since KAFKA-7703 resolved
Gabor Somogyi created SPARK-32321: - Summary: Rollback SPARK-26267 workaround since KAFKA-7703 resolved Key: SPARK-32321 URL: https://issues.apache.org/jira/browse/SPARK-32321 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32281) Spark wipes out SORTED spec in metastore when DESC is used
[ https://issues.apache.org/jira/browse/SPARK-32281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158125#comment-17158125 ] Ankit Raj Boudh commented on SPARK-32281: - [~bersprockets], I will raise PR for this soon. > Spark wipes out SORTED spec in metastore when DESC is used > -- > > Key: SPARK-32281 > URL: https://issues.apache.org/jira/browse/SPARK-32281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > When altering a Hive bucketed table or updating its statistics, Spark will > wipe out the SORTED specification in the metastore if the specification uses > DESC. > For example: > {noformat} > 0: jdbc:hive2://localhost:1> -- in beeline > 0: jdbc:hive2://localhost:1> create table bucketed (a int, b int, c int, > d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets; > No rows affected (0.045 seconds) > 0: jdbc:hive2://localhost:1> show create table bucketed; > ++ > | createtab_stmt | > ++ > | CREATE TABLE `bucketed`( | > | `a` int, | > | `b` int, | > | `c` int, | > | `d` int) | > | CLUSTERED BY ( | > | c) | > | SORTED BY (| > | c ASC, | > | d DESC) | > | INTO 10 BUCKETS| > | ROW FORMAT SERDE | > | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | > | STORED AS INPUTFORMAT | > | 'org.apache.hadoop.mapred.TextInputFormat' | > | OUTPUTFORMAT | > | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | > | LOCATION | > | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' | > | TBLPROPERTIES (| > | 'transient_lastDdlTime'='1594488043')| > ++ > 21 rows selected (0.042 seconds) > 0: jdbc:hive2://localhost:1> > - > - > - > scala> // in spark > scala> sql("alter table bucketed set tblproperties ('foo'='bar')") > 20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does > not exist > 20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > res0: org.apache.spark.sql.DataFrame = [] > scala> > - > - > - > 0: jdbc:hive2://localhost:1> -- back in beeline > 0: jdbc:hive2://localhost:1> show create table bucketed; > ++ > | createtab_stmt | > ++ > | CREATE TABLE `bucketed`( | > | `a` int, | > | `b` int, | > | `c` int, | > | `d` int) | > | CLUSTERED BY ( | > | c) | > | INTO 10 BUCKETS| > | ROW FORMAT SERDE | > | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | > | STORED AS INPUTFORMAT | > | 'org.apache.hadoop.mapred.TextInputFormat' | > | OUTPUTFORMAT | > | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | > | LOCATION | > | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' | > | TBLPROPERTIES (| > | 'foo'='bar', | > | 'spark.sql.partitionProvider'='catalog', | > | 'transient_lastDdlTime'='1594488098')| > ++ > 20 rows selected (0.038 seconds) > 0: jdbc:hive2://localhost:1> > {noformat} > Note that the SORTED specification disappears. > Another example, this time using insert: > {noformat} > 0: jdbc:hive2://localhost:1> -- in beeline > 0: jdbc:hive2://localhost:1> create table bucketed (a int,
[jira] [Assigned] (SPARK-31168) Upgrade Scala to 2.12.12
[ https://issues.apache.org/jira/browse/SPARK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31168: Assignee: Apache Spark > Upgrade Scala to 2.12.12 > > > Key: SPARK-31168 > URL: https://issues.apache.org/jira/browse/SPARK-31168 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > h2. Highlights > * Performance improvements in the collections library: algorithmic > improvements and changes to avoid unnecessary allocations ([list of > PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+label%3Alibrary%3Acollections+label%3Aperformance]) > * Performance improvements in the compiler ([list of > PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+-label%3Alibrary%3Acollections+label%3Aperformance+], > minor [effects in our > benchmarks|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1567985515850&to=1584355915694&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench@scalabench@]) > * Improvements to {{-Yrepl-class-based}}, an alternative internal REPL > encoding that avoids deadlocks (details on > [#8712|https://github.com/scala/scala/pull/8712]) > * A new {{-Yrepl-use-magic-imports}} flag that avoids deep class nesting in > the REPL, which can lead to deteriorating performance in long sessions > ([#8576|https://github.com/scala/scala/pull/8576]) > * Fix some {{toX}} methods that could expose the underlying mutability of a > {{ListBuffer}}-generated collection > ([#8674|https://github.com/scala/scala/pull/8674]) > h3. JDK 9+ support > * ASM was upgraded to 7.3.1, allowing the optimizer to run on JDK 13+ > ([#8676|https://github.com/scala/scala/pull/8676]) > * {{:javap}} in the REPL now works on JDK 9+ > ([#8400|https://github.com/scala/scala/pull/8400]) > h3. Other changes > * Support new labels for creating durations for consistency: > {{Duration("1m")}}, {{Duration("3 hrs")}} > ([#8325|https://github.com/scala/scala/pull/8325], > [#8450|https://github.com/scala/scala/pull/8450]) > * Fix memory leak in runtime reflection's {{TypeTag}} caches > ([#8470|https://github.com/scala/scala/pull/8470]) and some thread safety > issues in runtime reflection > ([#8433|https://github.com/scala/scala/pull/8433]) > * When using compiler plugins, the ordering of compiler phases may change > due to [#8427|https://github.com/scala/scala/pull/8427] > For more details, see [https://github.com/scala/scala/releases/tag/v2.12.11]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31168) Upgrade Scala to 2.12.12
[ https://issues.apache.org/jira/browse/SPARK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31168: Assignee: (was: Apache Spark) > Upgrade Scala to 2.12.12 > > > Key: SPARK-31168 > URL: https://issues.apache.org/jira/browse/SPARK-31168 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h2. Highlights > * Performance improvements in the collections library: algorithmic > improvements and changes to avoid unnecessary allocations ([list of > PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+label%3Alibrary%3Acollections+label%3Aperformance]) > * Performance improvements in the compiler ([list of > PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+-label%3Alibrary%3Acollections+label%3Aperformance+], > minor [effects in our > benchmarks|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1567985515850&to=1584355915694&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench@scalabench@]) > * Improvements to {{-Yrepl-class-based}}, an alternative internal REPL > encoding that avoids deadlocks (details on > [#8712|https://github.com/scala/scala/pull/8712]) > * A new {{-Yrepl-use-magic-imports}} flag that avoids deep class nesting in > the REPL, which can lead to deteriorating performance in long sessions > ([#8576|https://github.com/scala/scala/pull/8576]) > * Fix some {{toX}} methods that could expose the underlying mutability of a > {{ListBuffer}}-generated collection > ([#8674|https://github.com/scala/scala/pull/8674]) > h3. JDK 9+ support > * ASM was upgraded to 7.3.1, allowing the optimizer to run on JDK 13+ > ([#8676|https://github.com/scala/scala/pull/8676]) > * {{:javap}} in the REPL now works on JDK 9+ > ([#8400|https://github.com/scala/scala/pull/8400]) > h3. Other changes > * Support new labels for creating durations for consistency: > {{Duration("1m")}}, {{Duration("3 hrs")}} > ([#8325|https://github.com/scala/scala/pull/8325], > [#8450|https://github.com/scala/scala/pull/8450]) > * Fix memory leak in runtime reflection's {{TypeTag}} caches > ([#8470|https://github.com/scala/scala/pull/8470]) and some thread safety > issues in runtime reflection > ([#8433|https://github.com/scala/scala/pull/8433]) > * When using compiler plugins, the ordering of compiler phases may change > due to [#8427|https://github.com/scala/scala/pull/8427] > For more details, see [https://github.com/scala/scala/releases/tag/v2.12.11]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31168) Upgrade Scala to 2.12.12
[ https://issues.apache.org/jira/browse/SPARK-31168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158056#comment-17158056 ] Apache Spark commented on SPARK-31168: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/29124 > Upgrade Scala to 2.12.12 > > > Key: SPARK-31168 > URL: https://issues.apache.org/jira/browse/SPARK-31168 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h2. Highlights > * Performance improvements in the collections library: algorithmic > improvements and changes to avoid unnecessary allocations ([list of > PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+label%3Alibrary%3Acollections+label%3Aperformance]) > * Performance improvements in the compiler ([list of > PRs|https://github.com/scala/scala/pulls?q=is%3Apr+milestone%3A2.12.11+is%3Aclosed+sort%3Acreated-desc+-label%3Alibrary%3Acollections+label%3Aperformance+], > minor [effects in our > benchmarks|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1567985515850&to=1584355915694&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench@scalabench@]) > * Improvements to {{-Yrepl-class-based}}, an alternative internal REPL > encoding that avoids deadlocks (details on > [#8712|https://github.com/scala/scala/pull/8712]) > * A new {{-Yrepl-use-magic-imports}} flag that avoids deep class nesting in > the REPL, which can lead to deteriorating performance in long sessions > ([#8576|https://github.com/scala/scala/pull/8576]) > * Fix some {{toX}} methods that could expose the underlying mutability of a > {{ListBuffer}}-generated collection > ([#8674|https://github.com/scala/scala/pull/8674]) > h3. JDK 9+ support > * ASM was upgraded to 7.3.1, allowing the optimizer to run on JDK 13+ > ([#8676|https://github.com/scala/scala/pull/8676]) > * {{:javap}} in the REPL now works on JDK 9+ > ([#8400|https://github.com/scala/scala/pull/8400]) > h3. Other changes > * Support new labels for creating durations for consistency: > {{Duration("1m")}}, {{Duration("3 hrs")}} > ([#8325|https://github.com/scala/scala/pull/8325], > [#8450|https://github.com/scala/scala/pull/8450]) > * Fix memory leak in runtime reflection's {{TypeTag}} caches > ([#8470|https://github.com/scala/scala/pull/8470]) and some thread safety > issues in runtime reflection > ([#8433|https://github.com/scala/scala/pull/8433]) > * When using compiler plugins, the ordering of compiler phases may change > due to [#8427|https://github.com/scala/scala/pull/8427] > For more details, see [https://github.com/scala/scala/releases/tag/v2.12.11]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
[ https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal reassigned SPARK-31480: Assignee: Dilip Biswal > Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node > --- > > Key: SPARK-31480 > URL: https://issues.apache.org/jira/browse/SPARK-31480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > > Below is the EXPLAIN OUTPUT when using the *DSV2* > *Output of EXPLAIN EXTENDED* > {code:java} > +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), > (col.dots#39L = 500)], Location: > InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc..., > PartitionFilters: [], ReadSchema: struct > {code} > *Output of EXPLAIN FORMATTED* > {code:java} > (1) BatchScan > Output [1]: [col.dots#39L] > Arguments: [col.dots#39L], > JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L), > (col.dots#39L = 500))) > {code} > When using *DSV1*, the output is much cleaner than the output of DSV2, > especially for EXPLAIN FORMATTED. > *Output of EXPLAIN EXTENDED* > {code:java} > +- FileScan json [col.dots#37L] Batched: false, DataFilters: > [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: > InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59..., > PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), > EqualTo(`col.dots`,500)], ReadSchema: struct > {code} > *Output of EXPLAIN FORMATTED* > {code:java} > (1) Scan json > Output [1]: [col.dots#37L] > Batched: false > Location: InMemoryFileIndex > [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0] > PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)] > ReadSchema: struct{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
[ https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal resolved SPARK-31480. -- Resolution: Fixed > Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node > --- > > Key: SPARK-31480 > URL: https://issues.apache.org/jira/browse/SPARK-31480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > > Below is the EXPLAIN OUTPUT when using the *DSV2* > *Output of EXPLAIN EXTENDED* > {code:java} > +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), > (col.dots#39L = 500)], Location: > InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc..., > PartitionFilters: [], ReadSchema: struct > {code} > *Output of EXPLAIN FORMATTED* > {code:java} > (1) BatchScan > Output [1]: [col.dots#39L] > Arguments: [col.dots#39L], > JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L), > (col.dots#39L = 500))) > {code} > When using *DSV1*, the output is much cleaner than the output of DSV2, > especially for EXPLAIN FORMATTED. > *Output of EXPLAIN EXTENDED* > {code:java} > +- FileScan json [col.dots#37L] Batched: false, DataFilters: > [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: > InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59..., > PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), > EqualTo(`col.dots`,500)], ReadSchema: struct > {code} > *Output of EXPLAIN FORMATTED* > {code:java} > (1) Scan json > Output [1]: [col.dots#37L] > Batched: false > Location: InMemoryFileIndex > [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0] > PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)] > ReadSchema: struct{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32283) Multiple Kryo registrators can't be used anymore
[ https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32283: Assignee: (was: Apache Spark) > Multiple Kryo registrators can't be used anymore > > > Key: SPARK-32283 > URL: https://issues.apache.org/jira/browse/SPARK-32283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lorenz Bühmann >Priority: Minor > > This is a regression in Spark 3.0 as it is working with Spark 2. > According to the docs, it should be possible to register multiple Kryo > registrators via Spark config option spark.kryo.registrator . > In Spark 3.0 the code to parse Kryo config options has been refactored into > Scala class > [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala]. > The code to parse the registrators is in [Line > 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32] > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .createOptional > {code} > but it should be > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .toSequence > .createOptional > {code} > to split the comma seprated list. > In previous Spark 2.x it was done differently directly in [KryoSerializer > Line > 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79] > {code:scala} > private val userRegistrators = conf.get("spark.kryo.registrator", "") > .split(',').map(_.trim) > .filter(!_.isEmpty) > {code} > Hope this helps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32283) Multiple Kryo registrators can't be used anymore
[ https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32283: Assignee: Apache Spark > Multiple Kryo registrators can't be used anymore > > > Key: SPARK-32283 > URL: https://issues.apache.org/jira/browse/SPARK-32283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lorenz Bühmann >Assignee: Apache Spark >Priority: Minor > > This is a regression in Spark 3.0 as it is working with Spark 2. > According to the docs, it should be possible to register multiple Kryo > registrators via Spark config option spark.kryo.registrator . > In Spark 3.0 the code to parse Kryo config options has been refactored into > Scala class > [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala]. > The code to parse the registrators is in [Line > 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32] > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .createOptional > {code} > but it should be > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .toSequence > .createOptional > {code} > to split the comma seprated list. > In previous Spark 2.x it was done differently directly in [KryoSerializer > Line > 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79] > {code:scala} > private val userRegistrators = conf.get("spark.kryo.registrator", "") > .split(',').map(_.trim) > .filter(!_.isEmpty) > {code} > Hope this helps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32283) Multiple Kryo registrators can't be used anymore
[ https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158034#comment-17158034 ] Apache Spark commented on SPARK-32283: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/29123 > Multiple Kryo registrators can't be used anymore > > > Key: SPARK-32283 > URL: https://issues.apache.org/jira/browse/SPARK-32283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lorenz Bühmann >Priority: Minor > > This is a regression in Spark 3.0 as it is working with Spark 2. > According to the docs, it should be possible to register multiple Kryo > registrators via Spark config option spark.kryo.registrator . > In Spark 3.0 the code to parse Kryo config options has been refactored into > Scala class > [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala]. > The code to parse the registrators is in [Line > 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32] > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .createOptional > {code} > but it should be > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .toSequence > .createOptional > {code} > to split the comma seprated list. > In previous Spark 2.x it was done differently directly in [KryoSerializer > Line > 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79] > {code:scala} > private val userRegistrators = conf.get("spark.kryo.registrator", "") > .split(',').map(_.trim) > .filter(!_.isEmpty) > {code} > Hope this helps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32283) Multiple Kryo registrators can't be used anymore
[ https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158018#comment-17158018 ] Lantao Jin commented on SPARK-32283: Thanks for reporting this. Will file a patch. > Multiple Kryo registrators can't be used anymore > > > Key: SPARK-32283 > URL: https://issues.apache.org/jira/browse/SPARK-32283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lorenz Bühmann >Priority: Minor > > This is a regression in Spark 3.0 as it is working with Spark 2. > According to the docs, it should be possible to register multiple Kryo > registrators via Spark config option spark.kryo.registrator . > In Spark 3.0 the code to parse Kryo config options has been refactored into > Scala class > [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala]. > The code to parse the registrators is in [Line > 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32] > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .createOptional > {code} > but it should be > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .toSequence > .createOptional > {code} > to split the comma seprated list. > In previous Spark 2.x it was done differently directly in [KryoSerializer > Line > 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79] > {code:scala} > private val userRegistrators = conf.get("spark.kryo.registrator", "") > .split(',').map(_.trim) > .filter(!_.isEmpty) > {code} > Hope this helps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-28367: -- Affects Version/s: 3.1.0 > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0 >Reporter: Gabor Somogyi >Priority: Critical > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). > I've created a small standalone application to test it and the alternatives: > https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32320) Remove mutable default arguments
[ https://issues.apache.org/jira/browse/SPARK-32320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157996#comment-17157996 ] Apache Spark commented on SPARK-32320: -- User 'Fokko' has created a pull request for this issue: https://github.com/apache/spark/pull/29122 > Remove mutable default arguments > > > Key: SPARK-32320 > URL: https://issues.apache.org/jira/browse/SPARK-32320 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32320) Remove mutable default arguments
[ https://issues.apache.org/jira/browse/SPARK-32320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32320: Assignee: Apache Spark > Remove mutable default arguments > > > Key: SPARK-32320 > URL: https://issues.apache.org/jira/browse/SPARK-32320 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32320) Remove mutable default arguments
[ https://issues.apache.org/jira/browse/SPARK-32320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32320: Assignee: (was: Apache Spark) > Remove mutable default arguments > > > Key: SPARK-32320 > URL: https://issues.apache.org/jira/browse/SPARK-32320 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32320) Remove mutable default arguments
[ https://issues.apache.org/jira/browse/SPARK-32320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157995#comment-17157995 ] Apache Spark commented on SPARK-32320: -- User 'Fokko' has created a pull request for this issue: https://github.com/apache/spark/pull/29122 > Remove mutable default arguments > > > Key: SPARK-32320 > URL: https://issues.apache.org/jira/browse/SPARK-32320 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32320) Remove mutable default arguments
Fokko Driesprong created SPARK-32320: Summary: Remove mutable default arguments Key: SPARK-32320 URL: https://issues.apache.org/jira/browse/SPARK-32320 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Fokko Driesprong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32271) Add option for k-fold cross-validation to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-32271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Jordan updated SPARK-32271: -- Summary: Add option for k-fold cross-validation to CrossValidator (was: Update CrossValidator to parallelize fit method across folds) > Add option for k-fold cross-validation to CrossValidator > > > Key: SPARK-32271 > URL: https://issues.apache.org/jira/browse/SPARK-32271 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Austin Jordan >Priority: Minor > > *What changes were proposed in this pull request?* > I have added a `method` parameter to `CrossValidator.scala` to allow the user > to choose between repeated random sub-sampling cross-validation (current > behavior) and _k_-fold cross-validation (optional new behavior). The default > method is random sub-sampling cross-validation. > If _k_-fold cross-validation is chosen, the new behavior is as follows: > # Instead of splitting the input dataset into _k_ training and validation > sets, I split them into _k_ folds; for each fold of training, one of the _k_ > splits is selected for validation, and the others are unioned together for > training. > # Instead of caching each training and validation set _k_ times, I cache > each of the folds once. > # Instead of waiting for every model to finish training on fold _n_ before > moving on to fold _n+1_, new fold/model combinations will be trained as soon > as resources are available. > # Instead of creating one `Future` per model for each fold in series, all > `Future`s for each fold & parameter grid pair are created and trained in > parallel. > # A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` > instead of `Future[Double]`) in order to keep track of which `Future` belongs > to which parameter grid. > *Why are the changes needed?* > These changes allow the user to choose between repeated random sub-sampling > cross-validation (current behavior) and _k_-fold cross-validation (optional > new behavior). These changes: > 1. allow the user to choose between two types of cross-validation. > 2. (If _k_-fold is chosen) only require caching the entire dataset once > (instead of _k_ times in repeated random sub-sampling cross-validation, as it > does now). > 3. (if _k_-fold is chosen) free resources to train new model/fold > combinations as soon as the previous one finishes. Currently, a model can > only train one fold at a time. If _k_-fold is chosen, the added functionality > will allow the `fit` to train multiple folds at once for the same model, and, > in the case of a grid search, allow it to train multiple model/fold > combinations at once, without needing to wait for the slowest model to fit > the first fold before moving onto the second. > *Does this PR introduce _any_ user-facing change?* > Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the > `method` parameter is not set, the behavior will be the same as it has always > been. > *How was this patch tested?* > Unit tests will be added. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32271) Update CrossValidator to parallelize fit method across folds
[ https://issues.apache.org/jira/browse/SPARK-32271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Jordan updated SPARK-32271: -- Description: *What changes were proposed in this pull request?* I have added a `method` parameter to `CrossValidator.scala` to allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and _k_-fold cross-validation (optional new behavior). The default method is random sub-sampling cross-validation. If _k_-fold cross-validation is chosen, the new behavior is as follows: # Instead of splitting the input dataset into _k_ training and validation sets, I split them into _k_ folds; for each fold of training, one of the _k_ splits is selected for validation, and the others are unioned together for training. # Instead of caching each training and validation set _k_ times, I cache each of the folds once. # Instead of waiting for every model to finish training on fold _n_ before moving on to fold _n+1_, new fold/model combinations will be trained as soon as resources are available. # Instead of creating one `Future` per model for each fold in series, all `Future`s for each fold & parameter grid pair are created and trained in parallel. # A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` instead of `Future[Double]`) in order to keep track of which `Future` belongs to which parameter grid. *Why are the changes needed?* These changes allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and _k_-fold cross-validation (optional new behavior). These changes: 1. allow the user to choose between two types of cross-validation. 2. (If _k_-fold is chosen) only require caching the entire dataset once (instead of _k_ times in repeated random sub-sampling cross-validation, as it does now). 3. (if _k_-fold is chosen) free resources to train new model/fold combinations as soon as the previous one finishes. Currently, a model can only train one fold at a time. If _k_-fold is chosen, the added functionality will allow the `fit` to train multiple folds at once for the same model, and, in the case of a grid search, allow it to train multiple model/fold combinations at once, without needing to wait for the slowest model to fit the first fold before moving onto the second. *Does this PR introduce _any_ user-facing change?* Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the `method` parameter is not set, the behavior will be the same as it has always been. *How was this patch tested?* Unit tests will be added. was: ### What changes were proposed in this pull request? I have added a `method` parameter to `CrossValidator.scala` to allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and _k_-fold cross-validation (optional new behavior). The default method is random sub-sampling cross-validation. If _k_-fold cross-validation is chosen, the new behavior is as follows: 1. Instead of splitting the input dataset into _k_ training and validation sets, I split them into _k_ folds; for each fold of training, one of the _k_ splits is selected for validation, and the others are unioned together for training. 2. Instead of caching each training and validation set _k_ times, I cache each of the folds once. 3. Instead of waiting for every model to finish training on fold _n_ before moving on to fold _n+1_, new fold/model combinations will be trained as soon as resources are available. 4. Instead of creating one `Future` per model for each fold in series, all `Future`s for each fold & parameter grid pair are created and trained in parallel. 5. A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` instead of `Future[Double]`) in order to keep track of which `Future` belongs to which parameter grid. ### Why are the changes needed? These changes allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and _k_-fold cross-validation (optional new behavior). These changes: 1. allow the user to choose between two types of cross-validation. 2. (If _k_-fold is chosen) only require caching the entire dataset once (instead of _k_ times in repeated random sub-sampling cross-validation, as it does now). 3. (if _k_-fold is chosen) free resources to train new model/fold combinations as soon as the previous one finishes. Currently, a model can only train one fold at a time. If _k_-fold is chosen, the added functionality will allow the `fit` to train multiple folds at once for the same model, and, in the case of a grid search, allow it to train multiple model/fold combinations at once, without needing to wait for the slowest model to fit the first fold before moving onto the second. ### Does this PR introduce _any_ user-facing change? Yes. This PR introduces the `
[jira] [Updated] (SPARK-32271) Update CrossValidator to parallelize fit method across folds
[ https://issues.apache.org/jira/browse/SPARK-32271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Jordan updated SPARK-32271: -- Description: ### What changes were proposed in this pull request? I have added a `method` parameter to `CrossValidator.scala` to allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and _k_-fold cross-validation (optional new behavior). The default method is random sub-sampling cross-validation. If _k_-fold cross-validation is chosen, the new behavior is as follows: 1. Instead of splitting the input dataset into _k_ training and validation sets, I split them into _k_ folds; for each fold of training, one of the _k_ splits is selected for validation, and the others are unioned together for training. 2. Instead of caching each training and validation set _k_ times, I cache each of the folds once. 3. Instead of waiting for every model to finish training on fold _n_ before moving on to fold _n+1_, new fold/model combinations will be trained as soon as resources are available. 4. Instead of creating one `Future` per model for each fold in series, all `Future`s for each fold & parameter grid pair are created and trained in parallel. 5. A new `Int` parameter is added to the `Future` (now `Future[Int, Double]` instead of `Future[Double]`) in order to keep track of which `Future` belongs to which parameter grid. ### Why are the changes needed? These changes allow the user to choose between repeated random sub-sampling cross-validation (current behavior) and _k_-fold cross-validation (optional new behavior). These changes: 1. allow the user to choose between two types of cross-validation. 2. (If _k_-fold is chosen) only require caching the entire dataset once (instead of _k_ times in repeated random sub-sampling cross-validation, as it does now). 3. (if _k_-fold is chosen) free resources to train new model/fold combinations as soon as the previous one finishes. Currently, a model can only train one fold at a time. If _k_-fold is chosen, the added functionality will allow the `fit` to train multiple folds at once for the same model, and, in the case of a grid search, allow it to train multiple model/fold combinations at once, without needing to wait for the slowest model to fit the first fold before moving onto the second. ### Does this PR introduce _any_ user-facing change? Yes. This PR introduces the `setMethod` method to `CrossValidator`. If the `method` parameter is not set, the behavior will be the same as it has always been. ### How was this patch tested? Unit tests will be added. was: Currently, fitting a CrossValidator is only parallelized across models. This means that a CrossValidator will only fit as quickly as the slowest-to-train model would fit by itself. If a 2x2x3 parameter grid is provided for 10-fold cross validation, all 12 models will begin training on the first fold. However, if 6 of these models will train for 1 hour/fold and the other 6 will train for 3 hours/fold (e.g. when tuning number of early stopping rounds in XGBoost), the first 6 models will not move on to the second fold until the last 6 are finished. If fitting was parallelized across folds, the first 6 models would finish after 10 hours, freeing up cluster resources to run multiple folds for the last 6 models in parallel. Changes to be made: * Instead of splitting data into multiple training and validation sets, split into the folds. * Cache each of the folds (so each fold only ends up getting cached once, instead of 10 times how it is now). * For each fold index, form the training and validation sets by selecting the current fold as the validation set and unioning the rest into the training set. * Make associated changes to calculate fold metrics, now that folds are being parallelized as well. > Update CrossValidator to parallelize fit method across folds > > > Key: SPARK-32271 > URL: https://issues.apache.org/jira/browse/SPARK-32271 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Austin Jordan >Priority: Minor > > ### What changes were proposed in this pull request? > I have added a `method` parameter to `CrossValidator.scala` to allow the user > to choose between repeated random sub-sampling cross-validation (current > behavior) and _k_-fold cross-validation (optional new behavior). The default > method is random sub-sampling cross-validation. > If _k_-fold cross-validation is chosen, the new behavior is as follows: > 1. Instead of splitting the input dataset into _k_ training and validation > sets, I split them into _k_ folds; for each fold of training, one of the _k_ > splits is selected for validation, and the others are unio
[jira] [Commented] (SPARK-32319) Remove unused imports
[ https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157983#comment-17157983 ] Apache Spark commented on SPARK-32319: -- User 'Fokko' has created a pull request for this issue: https://github.com/apache/spark/pull/29121 > Remove unused imports > - > > Key: SPARK-32319 > URL: https://issues.apache.org/jira/browse/SPARK-32319 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > > We don't want to import stuff that we're not going to use, to reduce the > memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32319) Remove unused imports
[ https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157980#comment-17157980 ] Apache Spark commented on SPARK-32319: -- User 'Fokko' has created a pull request for this issue: https://github.com/apache/spark/pull/29121 > Remove unused imports > - > > Key: SPARK-32319 > URL: https://issues.apache.org/jira/browse/SPARK-32319 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > > We don't want to import stuff that we're not going to use, to reduce the > memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32319) Remove unused imports
[ https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32319: Assignee: (was: Apache Spark) > Remove unused imports > - > > Key: SPARK-32319 > URL: https://issues.apache.org/jira/browse/SPARK-32319 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > > We don't want to import stuff that we're not going to use, to reduce the > memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32319) Remove unused imports
[ https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32319: Assignee: Apache Spark > Remove unused imports > - > > Key: SPARK-32319 > URL: https://issues.apache.org/jira/browse/SPARK-32319 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Assignee: Apache Spark >Priority: Major > > We don't want to import stuff that we're not going to use, to reduce the > memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32319) Remove unused imports
Fokko Driesprong created SPARK-32319: Summary: Remove unused imports Key: SPARK-32319 URL: https://issues.apache.org/jira/browse/SPARK-32319 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Fokko Driesprong We don't want to import stuff that we're not going to use, to reduce the memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
[ https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32291: Assignee: Apache Spark > COALESCE should not reduce the child parallelism if it is Join > -- > > Key: SPARK-32291 > URL: https://issues.apache.org/jira/browse/SPARK-32291 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > Attachments: COALESCE.png, coalesce.png, repartition.png > > > How to reproduce this issue: > {code:scala} > spark.range(100).createTempView("t1") > spark.range(200).createTempView("t2") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = > t2.id)").show > {code} > The dag is: > !COALESCE.png! > A real case: > !coalesce.png! > !repartition.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
[ https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32291: Assignee: (was: Apache Spark) > COALESCE should not reduce the child parallelism if it is Join > -- > > Key: SPARK-32291 > URL: https://issues.apache.org/jira/browse/SPARK-32291 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: COALESCE.png, coalesce.png, repartition.png > > > How to reproduce this issue: > {code:scala} > spark.range(100).createTempView("t1") > spark.range(200).createTempView("t2") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = > t2.id)").show > {code} > The dag is: > !COALESCE.png! > A real case: > !coalesce.png! > !repartition.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
[ https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157969#comment-17157969 ] Apache Spark commented on SPARK-32291: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/29120 > COALESCE should not reduce the child parallelism if it is Join > -- > > Key: SPARK-32291 > URL: https://issues.apache.org/jira/browse/SPARK-32291 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: COALESCE.png, coalesce.png, repartition.png > > > How to reproduce this issue: > {code:scala} > spark.range(100).createTempView("t1") > spark.range(200).createTempView("t2") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = > t2.id)").show > {code} > The dag is: > !COALESCE.png! > A real case: > !coalesce.png! > !repartition.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org