[jira] [Updated] (SPARK-36057) SPIP: Support Customized Kubernetes Schedulers
[ https://issues.apache.org/jira/browse/SPARK-36057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-36057: Labels: SPIP (was: ) > SPIP: Support Customized Kubernetes Schedulers > -- > > Key: SPARK-36057 > URL: https://issues.apache.org/jira/browse/SPARK-36057 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Priority: Major > Labels: SPIP > > This is an umbrella issue for tracking the work for supporting Volcano & > Yunikorn on Kubernetes. These schedulers provide more YARN like features > (such as queues and minimum resources before scheduling jobs) that many folks > want on Kubernetes. > > Yunikorn is an ASF project & Volcano is a CNCF project (sig-batch). > > They've taken slightly different approaches to solving the same problem, but > from Spark's point of view we should be able to share much of the code. > > See the initial brainstorming discussion in SPARK-35623. > > DISCUSSION: [https://lists.apache.org/thread/zv3o62xrob4dvgkbftbv5w5wy75hkbxg] > VOTE: [https://lists.apache.org/thread/cz3cpp8q4pgmh7h35h6lvkwf6g3lwhcd] > VOTE Result: > [https://lists.apache.org/thread/nvwfo0yo0q8997vs86o7wkjyby4tbp0m] > Design DOC: > [https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg] > Recap slide: > [https://lists.apache.org/thread/mwswfwkycj71npwz8gmv1r5nrvpwj77s] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36057) SPIP: Support Customized Kubernetes Schedulers
[ https://issues.apache.org/jira/browse/SPARK-36057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-36057: Summary: SPIP: Support Customized Kubernetes Schedulers (was: Support Customized Kubernetes Schedulers) > SPIP: Support Customized Kubernetes Schedulers > -- > > Key: SPARK-36057 > URL: https://issues.apache.org/jira/browse/SPARK-36057 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Priority: Major > > This is an umbrella issue for tracking the work for supporting Volcano & > Yunikorn on Kubernetes. These schedulers provide more YARN like features > (such as queues and minimum resources before scheduling jobs) that many folks > want on Kubernetes. > > Yunikorn is an ASF project & Volcano is a CNCF project (sig-batch). > > They've taken slightly different approaches to solving the same problem, but > from Spark's point of view we should be able to share much of the code. > > See the initial brainstorming discussion in SPARK-35623. > > DISCUSSION: [https://lists.apache.org/thread/zv3o62xrob4dvgkbftbv5w5wy75hkbxg] > VOTE: [https://lists.apache.org/thread/cz3cpp8q4pgmh7h35h6lvkwf6g3lwhcd] > VOTE Result: > [https://lists.apache.org/thread/nvwfo0yo0q8997vs86o7wkjyby4tbp0m] > Design DOC: > [https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg] > Recap slide: > [https://lists.apache.org/thread/mwswfwkycj71npwz8gmv1r5nrvpwj77s] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38542) UnsafeHashedRelation should serialize numKeys out
[ https://issues.apache.org/jira/browse/SPARK-38542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38542. - Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 35836 [https://github.com/apache/spark/pull/35836] > UnsafeHashedRelation should serialize numKeys out > - > > Key: SPARK-38542 > URL: https://issues.apache.org/jira/browse/SPARK-38542 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Critical > Fix For: 3.3.0, 3.2.2 > > > At present, UnsafeHashedRelation does not write out numKeys during > serialization, so the numKeys of UnsafeHashedRelation obtained by > deserialization is equal to 0. The numFields of UnsafeRows returned by > UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect > data. > > For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is > called. > {code:java} > val broadcastRelation = child.executeBroadcast[HashedRelation]().value > val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) { > (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index)) > } else { > (broadcastRelation.keys(), > BoundReference(index, buildKeys(index).dataType, > buildKeys(index).nullable)) > }{code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval
[ https://issues.apache.org/jira/browse/SPARK-38324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chong updated SPARK-38324: -- Description: [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] * SECOND, seconds within minutes and possibly fractions of a second [0..59.99]}}{}}} {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}} But testing shows 99 second is valid: {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}} {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to second]{}}}}}{}}} Meanwhile, minute range check is ok, see below: >>> spark.sql("select INTERVAL '10 01:60:01' DAY TO SECOND") requirement failed: {color:#de350b}*minute 60 outside range [0, 59]*{color}(line 1, pos 16) == SQL == select INTERVAL '10 01:60:01' DAY TO SECOND ^^^ was: [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] * SECOND, seconds within minutes and possibly fractions of a second [0..59.99]{{{}{}}} {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}} But testing shows 99 second is valid: {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}} {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to second]{}}}{{{}{}}} > The second range is not [0, 59] in the day time ANSI interval > - > > Key: SPARK-38324 > URL: https://issues.apache.org/jira/browse/SPARK-38324 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.0 > Environment: Spark 3.3.0 snapshot >Reporter: chong >Priority: Major > > [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] > * SECOND, seconds within minutes and possibly fractions of a second > [0..59.99]}}{}}} > {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}} > > But testing shows 99 second is valid: > {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}} > {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to > second]{}}}}}{}}} > > Meanwhile, minute range check is ok, see below: > >>> spark.sql("select INTERVAL '10 01:60:01' DAY TO SECOND") > requirement failed: {color:#de350b}*minute 60 outside range [0, > 59]*{color}(line 1, pos 16) > == SQL == > select INTERVAL '10 01:60:01' DAY TO SECOND > ^^^ > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38558) Remove unnecessary casts between IntegerType and IntDecimal
[ https://issues.apache.org/jira/browse/SPARK-38558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-38558: Fix Version/s: 3.3.0 (was: 3.4.0) > Remove unnecessary casts between IntegerType and IntDecimal > --- > > Key: SPARK-38558 > URL: https://issues.apache.org/jira/browse/SPARK-38558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: David Cashman >Priority: Minor > Fix For: 3.3.0 > > > In {{{}NTile{}}}, the number of rows per bucket is computed as {{{}n / > buckets{}}}, where {{n}} is the partition size, and {{buckets}} is the > argument to {{NTile}} (number of buckets). The code currently casts the > arguments to IntDecimal, then casts the result back to IntegerType. This is > unnecessary, since it is equivalent to just doing integer division, i.e. > {{{}n div buckets{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38558) Remove unnecessary casts between IntegerType and IntDecimal
[ https://issues.apache.org/jira/browse/SPARK-38558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38558: --- Assignee: David Cashman > Remove unnecessary casts between IntegerType and IntDecimal > --- > > Key: SPARK-38558 > URL: https://issues.apache.org/jira/browse/SPARK-38558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: David Cashman >Assignee: David Cashman >Priority: Minor > Fix For: 3.3.0 > > > In {{{}NTile{}}}, the number of rows per bucket is computed as {{{}n / > buckets{}}}, where {{n}} is the partition size, and {{buckets}} is the > argument to {{NTile}} (number of buckets). The code currently casts the > arguments to IntDecimal, then casts the result back to IntegerType. This is > unnecessary, since it is equivalent to just doing integer division, i.e. > {{{}n div buckets{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38558) Remove unnecessary casts between IntegerType and IntDecimal
[ https://issues.apache.org/jira/browse/SPARK-38558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38558. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35863 [https://github.com/apache/spark/pull/35863] > Remove unnecessary casts between IntegerType and IntDecimal > --- > > Key: SPARK-38558 > URL: https://issues.apache.org/jira/browse/SPARK-38558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: David Cashman >Priority: Minor > Fix For: 3.4.0 > > > In {{{}NTile{}}}, the number of rows per bucket is computed as {{{}n / > buckets{}}}, where {{n}} is the partition size, and {{buckets}} is the > argument to {{NTile}} (number of buckets). The code currently casts the > arguments to IntDecimal, then casts the result back to IntegerType. This is > unnecessary, since it is equivalent to just doing integer division, i.e. > {{{}n div buckets{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38564) Support collecting metrics from streaming sinks
[ https://issues.apache.org/jira/browse/SPARK-38564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38564: Assignee: Apache Spark > Support collecting metrics from streaming sinks > --- > > Key: SPARK-38564 > URL: https://issues.apache.org/jira/browse/SPARK-38564 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: Boyang Jerry Peng >Assignee: Apache Spark >Priority: Major > > Currently, only streaming sources have the capability to return custom > metrics but not sinks. Allow streaming sinks to also return custom metrics is > very useful. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38564) Support collecting metrics from streaming sinks
[ https://issues.apache.org/jira/browse/SPARK-38564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38564: Assignee: (was: Apache Spark) > Support collecting metrics from streaming sinks > --- > > Key: SPARK-38564 > URL: https://issues.apache.org/jira/browse/SPARK-38564 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: Boyang Jerry Peng >Priority: Major > > Currently, only streaming sources have the capability to return custom > metrics but not sinks. Allow streaming sinks to also return custom metrics is > very useful. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38564) Support collecting metrics from streaming sinks
[ https://issues.apache.org/jira/browse/SPARK-38564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507359#comment-17507359 ] Apache Spark commented on SPARK-38564: -- User 'jerrypeng' has created a pull request for this issue: https://github.com/apache/spark/pull/35872 > Support collecting metrics from streaming sinks > --- > > Key: SPARK-38564 > URL: https://issues.apache.org/jira/browse/SPARK-38564 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: Boyang Jerry Peng >Priority: Major > > Currently, only streaming sources have the capability to return custom > metrics but not sinks. Allow streaming sinks to also return custom metrics is > very useful. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38564) Support collecting metrics from streaming sinks
Boyang Jerry Peng created SPARK-38564: - Summary: Support collecting metrics from streaming sinks Key: SPARK-38564 URL: https://issues.apache.org/jira/browse/SPARK-38564 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.2.1 Reporter: Boyang Jerry Peng Currently, only streaming sources have the capability to return custom metrics but not sinks. Allow streaming sinks to also return custom metrics is very useful. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38563) Upgrade to Py4J 0.10.9.4
[ https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507354#comment-17507354 ] Apache Spark commented on SPARK-38563: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35871 > Upgrade to Py4J 0.10.9.4 > > > Key: SPARK-38563 > URL: https://issues.apache.org/jira/browse/SPARK-38563 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Critical > > There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We > should upgrade Py4J to 0.10.9.4 to fix this -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38563) Upgrade to Py4J 0.10.9.4
[ https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38563: Assignee: (was: Apache Spark) > Upgrade to Py4J 0.10.9.4 > > > Key: SPARK-38563 > URL: https://issues.apache.org/jira/browse/SPARK-38563 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Critical > > There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We > should upgrade Py4J to 0.10.9.4 to fix this -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38563) Upgrade to Py4J 0.10.9.4
[ https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38563: Assignee: Apache Spark > Upgrade to Py4J 0.10.9.4 > > > Key: SPARK-38563 > URL: https://issues.apache.org/jira/browse/SPARK-38563 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Critical > > There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We > should upgrade Py4J to 0.10.9.4 to fix this -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38563) Upgrade to Py4J 0.10.9.4
[ https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507352#comment-17507352 ] Apache Spark commented on SPARK-38563: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35871 > Upgrade to Py4J 0.10.9.4 > > > Key: SPARK-38563 > URL: https://issues.apache.org/jira/browse/SPARK-38563 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Critical > > There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We > should upgrade Py4J to 0.10.9.4 to fix this -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38563) Upgrade to Py4J 0.10.9.4
Hyukjin Kwon created SPARK-38563: Summary: Upgrade to Py4J 0.10.9.4 Key: SPARK-38563 URL: https://issues.apache.org/jira/browse/SPARK-38563 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.1, 3.3.0 Reporter: Hyukjin Kwon There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We should upgrade Py4J to 0.10.9.4 to fix this -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38562) Add doc for Volcano scheduler
[ https://issues.apache.org/jira/browse/SPARK-38562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38562: Assignee: Apache Spark > Add doc for Volcano scheduler > - > > Key: SPARK-38562 > URL: https://issues.apache.org/jira/browse/SPARK-38562 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38562) Add doc for Volcano scheduler
[ https://issues.apache.org/jira/browse/SPARK-38562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507334#comment-17507334 ] Apache Spark commented on SPARK-38562: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/35870 > Add doc for Volcano scheduler > - > > Key: SPARK-38562 > URL: https://issues.apache.org/jira/browse/SPARK-38562 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38561) Add doc for "Customized Kubernetes Schedulers"
[ https://issues.apache.org/jira/browse/SPARK-38561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507335#comment-17507335 ] Apache Spark commented on SPARK-38561: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/35869 > Add doc for "Customized Kubernetes Schedulers" > -- > > Key: SPARK-38561 > URL: https://issues.apache.org/jira/browse/SPARK-38561 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38562) Add doc for Volcano scheduler
[ https://issues.apache.org/jira/browse/SPARK-38562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507336#comment-17507336 ] Apache Spark commented on SPARK-38562: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/35870 > Add doc for Volcano scheduler > - > > Key: SPARK-38562 > URL: https://issues.apache.org/jira/browse/SPARK-38562 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38562) Add doc for Volcano scheduler
[ https://issues.apache.org/jira/browse/SPARK-38562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38562: Assignee: (was: Apache Spark) > Add doc for Volcano scheduler > - > > Key: SPARK-38562 > URL: https://issues.apache.org/jira/browse/SPARK-38562 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38561) Add doc for "Customized Kubernetes Schedulers"
[ https://issues.apache.org/jira/browse/SPARK-38561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38561: Assignee: (was: Apache Spark) > Add doc for "Customized Kubernetes Schedulers" > -- > > Key: SPARK-38561 > URL: https://issues.apache.org/jira/browse/SPARK-38561 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38561) Add doc for "Customized Kubernetes Schedulers"
[ https://issues.apache.org/jira/browse/SPARK-38561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507333#comment-17507333 ] Apache Spark commented on SPARK-38561: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/35869 > Add doc for "Customized Kubernetes Schedulers" > -- > > Key: SPARK-38561 > URL: https://issues.apache.org/jira/browse/SPARK-38561 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38561) Add doc for "Customized Kubernetes Schedulers"
[ https://issues.apache.org/jira/browse/SPARK-38561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38561: Assignee: Apache Spark > Add doc for "Customized Kubernetes Schedulers" > -- > > Key: SPARK-38561 > URL: https://issues.apache.org/jira/browse/SPARK-38561 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38562) Add doc for Volcano scheduler
Yikun Jiang created SPARK-38562: --- Summary: Add doc for Volcano scheduler Key: SPARK-38562 URL: https://issues.apache.org/jira/browse/SPARK-38562 Project: Spark Issue Type: Sub-task Components: Documentation, Kubernetes Affects Versions: 3.3.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38561) Add doc for "Customized Kubernetes Schedulers"
Yikun Jiang created SPARK-38561: --- Summary: Add doc for "Customized Kubernetes Schedulers" Key: SPARK-38561 URL: https://issues.apache.org/jira/browse/SPARK-38561 Project: Spark Issue Type: Sub-task Components: Documentation, Kubernetes Affects Versions: 3.3.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38424) Disallow unused casts and ignores
[ https://issues.apache.org/jira/browse/SPARK-38424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38424. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35740 [https://github.com/apache/spark/pull/35740] > Disallow unused casts and ignores > - > > Key: SPARK-38424 > URL: https://issues.apache.org/jira/browse/SPARK-38424 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Now, when we have almost full typing coverage, we should consider setting the > following mypy options: > {code} > warn_unused_ignores = True > warn_redundant_casts = True > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38424) Disallow unused casts and ignores
[ https://issues.apache.org/jira/browse/SPARK-38424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38424: Assignee: Maciej Szymkiewicz > Disallow unused casts and ignores > - > > Key: SPARK-38424 > URL: https://issues.apache.org/jira/browse/SPARK-38424 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Now, when we have almost full typing coverage, we should consider setting the > following mypy options: > {code} > warn_unused_ignores = True > warn_redundant_casts = True > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38560) If `Sum`, `Count` with distinct, cannot do partial agg push down.
jiaan.geng created SPARK-38560: -- Summary: If `Sum`, `Count` with distinct, cannot do partial agg push down. Key: SPARK-38560 URL: https://issues.apache.org/jira/browse/SPARK-38560 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng Spark could partial push down sum(distinct col), count(distinct col) if data source have multiple partitions, and Spark will sum the value again. So the result may not correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507323#comment-17507323 ] Apache Spark commented on SPARK-38559: -- User 'caican00' has created a pull request for this issue: https://github.com/apache/spark/pull/35867 > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before updated the ui: > !image-2022-03-16-10-56-46-446.png! > After updated the ui, display the number of empty partitions: > !image-2022-03-16-11-07-39-182.png! > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507325#comment-17507325 ] Apache Spark commented on SPARK-38559: -- User 'caican00' has created a pull request for this issue: https://github.com/apache/spark/pull/35867 > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before updated the ui: > !image-2022-03-16-10-56-46-446.png! > After updated the ui, display the number of empty partitions: > !image-2022-03-16-11-07-39-182.png! > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38559: Assignee: (was: Apache Spark) > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before updated the ui: > !image-2022-03-16-10-56-46-446.png! > After updated the ui, display the number of empty partitions: > !image-2022-03-16-11-07-39-182.png! > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38559: Assignee: Apache Spark > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Assignee: Apache Spark >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before updated the ui: > !image-2022-03-16-10-56-46-446.png! > After updated the ui, display the number of empty partitions: > !image-2022-03-16-11-07-39-182.png! > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before updated the ui: !image-2022-03-16-10-56-46-446.png! After updated the ui, display the number of empty partitions: !image-2022-03-16-11-07-39-182.png! was: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before updated the ui:) !image-2022-03-16-10-56-46-446.png! After updated the ui, display the number of empty partitions:) !image-2022-03-16-11-07-39-182.png! > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before updated the ui: > !image-2022-03-16-10-56-46-446.png! > After updated the ui, display the number of empty partitions: > !image-2022-03-16-11-07-39-182.png! > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before updated the ui:) !image-2022-03-16-10-56-46-446.png! After updated the ui, display the number of empty partitions:) !image-2022-03-16-11-07-39-182.png! was: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before modify the ui: !image-2022-03-16-10-56-46-446.png! > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before updated the ui:) > !image-2022-03-16-10-56-46-446.png! > After updated the ui, display the number of empty partitions:) > !image-2022-03-16-11-07-39-182.png! > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: image-2022-03-16-11-07-39-182.png > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before modify the ui: !image-2022-03-16-10-56-46-446.png! was: When demoting join from broadcast-hash to smj, i think it is necessary to display number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before modify the ui: !image-2022-03-16-10-56-46-446.png! > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Summary: display the number of empty partitions on spark ui (was: display number of empty partitions on spark ui) > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: (was: ui.png) > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before modify the ui: !image-2022-03-16-10-56-46-446.png! was: When demoting join from broadcast-hash to smj, i think it is necessary to display number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, ui.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: image-2022-03-16-10-56-46-446.png > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, ui.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: ui.png > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, ui.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: (was: 小米办公20220316-105510.png) > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: 小米办公20220316-105510.png > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Summary: display number of empty partitions on spark ui (was: display number of empty partitions on spark ui when demoting join from broadcast-hash to smj) > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38527) Set the minimum Volcano version
[ https://issues.apache.org/jira/browse/SPARK-38527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-38527: Labels: release-notes (was: ) > Set the minimum Volcano version > --- > > Key: SPARK-38527 > URL: https://issues.apache.org/jira/browse/SPARK-38527 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: release-notes > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui when demoting join from broadcast-hash to smj
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. was: When demoting join from broadcast-hash to smj, i think it is necessary to show number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. > display number of empty partitions on spark ui when demoting join from > broadcast-hash to smj > > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38559) show number of empty partitions on spark ui when demoting join from broadcast-hash to smj
caican created SPARK-38559: -- Summary: show number of empty partitions on spark ui when demoting join from broadcast-hash to smj Key: SPARK-38559 URL: https://issues.apache.org/jira/browse/SPARK-38559 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.1.2 Reporter: caican When demoting join from broadcast-hash to smj, i think it is necessary to show number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui when demoting join from broadcast-hash to smj
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Summary: display number of empty partitions on spark ui when demoting join from broadcast-hash to smj (was: show number of empty partitions on spark ui when demoting join from broadcast-hash to smj) > display number of empty partitions on spark ui when demoting join from > broadcast-hash to smj > > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > show number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38508) Volcano feature doesn't work on EKS graviton instances
[ https://issues.apache.org/jira/browse/SPARK-38508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38508. --- Fix Version/s: 3.3.0 Assignee: Yikun Jiang Resolution: Fixed > Volcano feature doesn't work on EKS graviton instances > -- > > Key: SPARK-38508 > URL: https://issues.apache.org/jira/browse/SPARK-38508 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38508) Volcano feature doesn't work on EKS graviton instances
[ https://issues.apache.org/jira/browse/SPARK-38508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507314#comment-17507314 ] Dongjoon Hyun commented on SPARK-38508: --- Thanks! > Volcano feature doesn't work on EKS graviton instances > -- > > Key: SPARK-38508 > URL: https://issues.apache.org/jira/browse/SPARK-38508 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38515) Volcano queue is not deleted
[ https://issues.apache.org/jira/browse/SPARK-38515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38515. --- Fix Version/s: 3.3.0 Assignee: Yikun Jiang Resolution: Fixed > Volcano queue is not deleted > > > Key: SPARK-38515 > URL: https://issues.apache.org/jira/browse/SPARK-38515 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Yikun Jiang >Priority: Critical > Fix For: 3.3.0 > > > {code} > $ k delete queue queue0 > Error from server: admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue0` state > is `Open` > {code} > {code} > [info] org.apache.spark.deploy.k8s.integrationtest.VolcanoSuite *** ABORTED > *** (7 minutes, 40 seconds) > [info] io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: DELETE at: > https://44bea09e70a5147f6b5b347ec26de85f.gr7.us-west-2.eks.amazonaws.com/apis/scheduling.volcano.sh/v1beta1/queues/queue-2u-3g. > Message: admission webhook "validatequeue.volcano.sh" denied the request: > only queue with state `Closed` can be deleted, queue `queue-2u-3g` state is > `Open`. Received status: Status(apiVersion=v1, code=400, details=null, > kind=Status, message=admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue-2u-3g` > state is `Open`, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, > status=Failure, additionalProperties={}). > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38515) Volcano queue is not deleted
[ https://issues.apache.org/jira/browse/SPARK-38515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507313#comment-17507313 ] Dongjoon Hyun commented on SPARK-38515: --- Thanks! > Volcano queue is not deleted > > > Key: SPARK-38515 > URL: https://issues.apache.org/jira/browse/SPARK-38515 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Yikun Jiang >Priority: Critical > Fix For: 3.3.0 > > > {code} > $ k delete queue queue0 > Error from server: admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue0` state > is `Open` > {code} > {code} > [info] org.apache.spark.deploy.k8s.integrationtest.VolcanoSuite *** ABORTED > *** (7 minutes, 40 seconds) > [info] io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: DELETE at: > https://44bea09e70a5147f6b5b347ec26de85f.gr7.us-west-2.eks.amazonaws.com/apis/scheduling.volcano.sh/v1beta1/queues/queue-2u-3g. > Message: admission webhook "validatequeue.volcano.sh" denied the request: > only queue with state `Closed` can be deleted, queue `queue-2u-3g` state is > `Open`. Received status: Status(apiVersion=v1, code=400, details=null, > kind=Status, message=admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue-2u-3g` > state is `Open`, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, > status=Failure, additionalProperties={}). > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38397) Support Kueue: K8s-native Job Queueing
[ https://issues.apache.org/jira/browse/SPARK-38397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38397: -- Affects Version/s: 3.4.0 (was: 3.3.0) > Support Kueue: K8s-native Job Queueing > -- > > Key: SPARK-38397 > URL: https://issues.apache.org/jira/browse/SPARK-38397 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > There are several ways to run Spark on K8s including vanilla `spark-submit` > with built-in `KubernetesClusterManager`, `spark-submit` with custom > `ExternalClusterManager`, CRD-based operators (like spark-on-k8s-operator), > custom K8s `schedulers`, custom `standalone pod definitions`, and so on. > This issue is tracking K8s-native Job Queueing related work. > * [https://github.com/kubernetes-sigs/kueue] > {code} > metadata: > generateName: sample-job- > annotations: > kueue.k8s.io/queue-name: main > {code} > The best case is Apache Spark users use it in the future via pod templates or > existing configuration. In other words, we don't need to do anything and > close this JIRA without any patches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38558) Remove unnecessary casts between IntegerType and IntDecimal
[ https://issues.apache.org/jira/browse/SPARK-38558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Cashman updated SPARK-38558: -- Priority: Minor (was: Major) > Remove unnecessary casts between IntegerType and IntDecimal > --- > > Key: SPARK-38558 > URL: https://issues.apache.org/jira/browse/SPARK-38558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: David Cashman >Priority: Minor > > In {{{}NTile{}}}, the number of rows per bucket is computed as {{{}n / > buckets{}}}, where {{n}} is the partition size, and {{buckets}} is the > argument to {{NTile}} (number of buckets). The code currently casts the > arguments to IntDecimal, then casts the result back to IntegerType. This is > unnecessary, since it is equivalent to just doing integer division, i.e. > {{{}n div buckets{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38558) Remove unnecessary casts between IntegerType and IntDecimal
[ https://issues.apache.org/jira/browse/SPARK-38558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38558: Assignee: (was: Apache Spark) > Remove unnecessary casts between IntegerType and IntDecimal > --- > > Key: SPARK-38558 > URL: https://issues.apache.org/jira/browse/SPARK-38558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: David Cashman >Priority: Major > > In {{{}NTile{}}}, the number of rows per bucket is computed as {{{}n / > buckets{}}}, where {{n}} is the partition size, and {{buckets}} is the > argument to {{NTile}} (number of buckets). The code currently casts the > arguments to IntDecimal, then casts the result back to IntegerType. This is > unnecessary, since it is equivalent to just doing integer division, i.e. > {{{}n div buckets{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38558) Remove unnecessary casts between IntegerType and IntDecimal
[ https://issues.apache.org/jira/browse/SPARK-38558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38558: Assignee: Apache Spark > Remove unnecessary casts between IntegerType and IntDecimal > --- > > Key: SPARK-38558 > URL: https://issues.apache.org/jira/browse/SPARK-38558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: David Cashman >Assignee: Apache Spark >Priority: Major > > In {{{}NTile{}}}, the number of rows per bucket is computed as {{{}n / > buckets{}}}, where {{n}} is the partition size, and {{buckets}} is the > argument to {{NTile}} (number of buckets). The code currently casts the > arguments to IntDecimal, then casts the result back to IntegerType. This is > unnecessary, since it is equivalent to just doing integer division, i.e. > {{{}n div buckets{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38558) Remove unnecessary casts between IntegerType and IntDecimal
[ https://issues.apache.org/jira/browse/SPARK-38558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507300#comment-17507300 ] Apache Spark commented on SPARK-38558: -- User 'cashmand' has created a pull request for this issue: https://github.com/apache/spark/pull/35863 > Remove unnecessary casts between IntegerType and IntDecimal > --- > > Key: SPARK-38558 > URL: https://issues.apache.org/jira/browse/SPARK-38558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: David Cashman >Priority: Major > > In {{{}NTile{}}}, the number of rows per bucket is computed as {{{}n / > buckets{}}}, where {{n}} is the partition size, and {{buckets}} is the > argument to {{NTile}} (number of buckets). The code currently casts the > arguments to IntDecimal, then casts the result back to IntegerType. This is > unnecessary, since it is equivalent to just doing integer division, i.e. > {{{}n div buckets{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38558) Remove unnecessary casts between IntegerType and IntDecimal
David Cashman created SPARK-38558: - Summary: Remove unnecessary casts between IntegerType and IntDecimal Key: SPARK-38558 URL: https://issues.apache.org/jira/browse/SPARK-38558 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: David Cashman In {{{}NTile{}}}, the number of rows per bucket is computed as {{{}n / buckets{}}}, where {{n}} is the partition size, and {{buckets}} is the argument to {{NTile}} (number of buckets). The code currently casts the arguments to IntDecimal, then casts the result back to IntegerType. This is unnecessary, since it is equivalent to just doing integer division, i.e. {{{}n div buckets{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38529) Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators
[ https://issues.apache.org/jira/browse/SPARK-38529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-38529: Summary: Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators (was: GeneratorNestedColumnAliasing works incorrectly for non-Explode generators) > Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators > - > > Key: SPARK-38529 > URL: https://issues.apache.org/jira/browse/SPARK-38529 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > The Project(_, g: Generate) branch in GeneratorNestedColumnAliasing is only > supposed to work for ExplodeBase generators but we do not explicitly return > for other types like Inline. Currently the bug is not trigger because there > is another bug in the "prune unrequired child" branch in the ColumnPruning > which makes other generators like Inline always go to that branch even if it > is not applicable. > > An easy example to show the bug: > Input: int>, field2 int>>> > Project(field1.field1 as ...) > - Generate(Inline(col2), ..., field1, field2) > > We will try to incorrectly push the .field1 on field1 into the input of the > Inline (col2). > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38529) Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators
[ https://issues.apache.org/jira/browse/SPARK-38529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-38529: Issue Type: Improvement (was: Bug) > Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators > - > > Key: SPARK-38529 > URL: https://issues.apache.org/jira/browse/SPARK-38529 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > The Project(_, g: Generate) branch in GeneratorNestedColumnAliasing is only > supposed to work for ExplodeBase generators but we do not explicitly return > for other types like Inline. Currently the bug is not trigger because there > is another bug in the "prune unrequired child" branch in the ColumnPruning > which makes other generators like Inline always go to that branch even if it > is not applicable. > > An easy example to show the bug: > Input: int>, field2 int>>> > Project(field1.field1 as ...) > - Generate(Inline(col2), ..., field1, field2) > > We will try to incorrectly push the .field1 on field1 into the input of the > Inline (col2). > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38530) GeneratorNestedColumnAliasing does not work correctly for some expressions
[ https://issues.apache.org/jira/browse/SPARK-38530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507294#comment-17507294 ] Apache Spark commented on SPARK-38530: -- User 'minyyy' has created a pull request for this issue: https://github.com/apache/spark/pull/35866 > GeneratorNestedColumnAliasing does not work correctly for some expressions > -- > > Key: SPARK-38530 > URL: https://issues.apache.org/jira/browse/SPARK-38530 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Priority: Major > > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L226] > The code to collect ExtractValue expressions is wrong. We should do it in a > bottom up way instead of only check 2 levels. It can cause incorrect result > if the expression looks like ExtractValue(ExtractValue(some_other_expr)). > > An example to trigger the bug is: > > input: , b: > int > > Project(ExtractValue(ExtractValue(CaseWhen([col.a == 1, col.b]), "a"), "a") > - Generate(Explode(col1)) > > We will try to incorrectly push down the whole expression into the input of > the Explode, now the input of CaseWhen has array<...> as input so we will get > wrong result. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38530) GeneratorNestedColumnAliasing does not work correctly for some expressions
[ https://issues.apache.org/jira/browse/SPARK-38530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38530: Assignee: (was: Apache Spark) > GeneratorNestedColumnAliasing does not work correctly for some expressions > -- > > Key: SPARK-38530 > URL: https://issues.apache.org/jira/browse/SPARK-38530 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Priority: Major > > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L226] > The code to collect ExtractValue expressions is wrong. We should do it in a > bottom up way instead of only check 2 levels. It can cause incorrect result > if the expression looks like ExtractValue(ExtractValue(some_other_expr)). > > An example to trigger the bug is: > > input: , b: > int > > Project(ExtractValue(ExtractValue(CaseWhen([col.a == 1, col.b]), "a"), "a") > - Generate(Explode(col1)) > > We will try to incorrectly push down the whole expression into the input of > the Explode, now the input of CaseWhen has array<...> as input so we will get > wrong result. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38530) GeneratorNestedColumnAliasing does not work correctly for some expressions
[ https://issues.apache.org/jira/browse/SPARK-38530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38530: Assignee: Apache Spark > GeneratorNestedColumnAliasing does not work correctly for some expressions > -- > > Key: SPARK-38530 > URL: https://issues.apache.org/jira/browse/SPARK-38530 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Assignee: Apache Spark >Priority: Major > > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L226] > The code to collect ExtractValue expressions is wrong. We should do it in a > bottom up way instead of only check 2 levels. It can cause incorrect result > if the expression looks like ExtractValue(ExtractValue(some_other_expr)). > > An example to trigger the bug is: > > input: , b: > int > > Project(ExtractValue(ExtractValue(CaseWhen([col.a == 1, col.b]), "a"), "a") > - Generate(Explode(col1)) > > We will try to incorrectly push down the whole expression into the input of > the Explode, now the input of CaseWhen has array<...> as input so we will get > wrong result. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29091) spark-shell don't support added jar's class as Serde class
[ https://issues.apache.org/jira/browse/SPARK-29091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507293#comment-17507293 ] leesf commented on SPARK-29091: --- any updates here? we also encountered this problem. > spark-shell don't support added jar's class as Serde class > --- > > Key: SPARK-29091 > URL: https://issues.apache.org/jira/browse/SPARK-29091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_191) > Type in expressions to have them evaluated. > Type :help for more information.scala> spark.sql("add jar > /Users/angerszhu/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar") > 19/09/16 07:38:01 main WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 19/09/16 07:38:01 main WARN ObjectStore: Failed to get database default, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [result: int]scala> spark.sql("CREATE > TABLE addJar27(key string) ROW FORMAT SERDE > 'org.apache.hive.hcatalog.data.JsonSerDe'") > 19/09/16 07:38:05 main WARN HiveMetaStore: Location: > file:/Users/angerszhu/Documents/project/AngersZhu/spark/spark-warehouse/addjar27 > specified for non-external table:addjar27 > res1: org.apache.spark.sql.DataFrame = []scala> spark.sql("select * from > addJar27").show > 19/09/16 07:38:08 main WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > java.lang.RuntimeException: java.lang.ClassNotFoundException: > org.apache.hive.hcatalog.data.JsonSerDe > at > org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:74) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:421) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3382) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2509) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3372) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3368) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2509) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2716) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:290) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:327) > at org.apache.spark.sql.Dataset.show(Dataset.scala:792) > at org.apache.spark.sql.Dataset.show(Dataset.scala:751) > at org.apache.spark.sql.Dataset.show(Dataset.scala:760) > ... 47 elided > Caused by: java.lang.ClassNotFoundException: > org.apache.hive.hcatalog.data.JsonSerDe > at > scala.reflect.internal.util.AbstractFileClassLoade
[jira] [Assigned] (SPARK-38106) Use error classes in the parsing errors of functions
[ https://issues.apache.org/jira/browse/SPARK-38106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38106: Assignee: Apache Spark > Use error classes in the parsing errors of functions > > > Key: SPARK-38106 > URL: https://issues.apache.org/jira/browse/SPARK-38106 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * functionNameUnsupportedError > * showFunctionsUnsupportedError > * showFunctionsInvalidPatternError > * createFuncWithBothIfNotExistsAndReplaceError > * defineTempFuncWithIfNotExistsError > * unsupportedFunctionNameError > * specifyingDBInCreateTempFuncError > * invalidNameForDropTempFunc > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38106) Use error classes in the parsing errors of functions
[ https://issues.apache.org/jira/browse/SPARK-38106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38106: Assignee: (was: Apache Spark) > Use error classes in the parsing errors of functions > > > Key: SPARK-38106 > URL: https://issues.apache.org/jira/browse/SPARK-38106 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * functionNameUnsupportedError > * showFunctionsUnsupportedError > * showFunctionsInvalidPatternError > * createFuncWithBothIfNotExistsAndReplaceError > * defineTempFuncWithIfNotExistsError > * unsupportedFunctionNameError > * specifyingDBInCreateTempFuncError > * invalidNameForDropTempFunc > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38106) Use error classes in the parsing errors of functions
[ https://issues.apache.org/jira/browse/SPARK-38106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507280#comment-17507280 ] Apache Spark commented on SPARK-38106: -- User 'ivoson' has created a pull request for this issue: https://github.com/apache/spark/pull/35865 > Use error classes in the parsing errors of functions > > > Key: SPARK-38106 > URL: https://issues.apache.org/jira/browse/SPARK-38106 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * functionNameUnsupportedError > * showFunctionsUnsupportedError > * showFunctionsInvalidPatternError > * createFuncWithBothIfNotExistsAndReplaceError > * defineTempFuncWithIfNotExistsError > * unsupportedFunctionNameError > * specifyingDBInCreateTempFuncError > * invalidNameForDropTempFunc > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38508) Volcano feature doesn't work on EKS graviton instances
[ https://issues.apache.org/jira/browse/SPARK-38508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507278#comment-17507278 ] Yikun Jiang commented on SPARK-38508: - Resolved by https://github.com/apache/spark/pull/35819 > Volcano feature doesn't work on EKS graviton instances > -- > > Key: SPARK-38508 > URL: https://issues.apache.org/jira/browse/SPARK-38508 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38515) Volcano queue is not deleted
[ https://issues.apache.org/jira/browse/SPARK-38515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507279#comment-17507279 ] Yikun Jiang commented on SPARK-38515: - Resolved by https://github.com/apache/spark/pull/35819 > Volcano queue is not deleted > > > Key: SPARK-38515 > URL: https://issues.apache.org/jira/browse/SPARK-38515 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Critical > > {code} > $ k delete queue queue0 > Error from server: admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue0` state > is `Open` > {code} > {code} > [info] org.apache.spark.deploy.k8s.integrationtest.VolcanoSuite *** ABORTED > *** (7 minutes, 40 seconds) > [info] io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: DELETE at: > https://44bea09e70a5147f6b5b347ec26de85f.gr7.us-west-2.eks.amazonaws.com/apis/scheduling.volcano.sh/v1beta1/queues/queue-2u-3g. > Message: admission webhook "validatequeue.volcano.sh" denied the request: > only queue with state `Closed` can be deleted, queue `queue-2u-3g` state is > `Open`. Received status: Status(apiVersion=v1, code=400, details=null, > kind=Status, message=admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue-2u-3g` > state is `Open`, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, > status=Failure, additionalProperties={}). > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38508) Volcano feature doesn't work on EKS graviton instances
[ https://issues.apache.org/jira/browse/SPARK-38508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507276#comment-17507276 ] Yikun Jiang edited comment on SPARK-38508 at 3/16/22, 12:42 AM: [~dongjoon] Yes: [https://github.com/volcano-sh/volcano/releases/tag/v1.5.1] [1] bug fix: regenerate installer/volcano-development-arm64.yaml to fix arm64 deploy [2] https://github.com/volcano-sh/volcano/commit/42fd4883189e47d2555f71b26182ae5e13651931 was (Author: yikunkero): [~dongjoon] Yes > Volcano feature doesn't work on EKS graviton instances > -- > > Key: SPARK-38508 > URL: https://issues.apache.org/jira/browse/SPARK-38508 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38515) Volcano queue is not deleted
[ https://issues.apache.org/jira/browse/SPARK-38515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507277#comment-17507277 ] Yikun Jiang edited comment on SPARK-38515 at 3/16/22, 12:40 AM: [~dongjoon] Yes [1] bug fix: {{Open}} state queue can be deleted [https://github.com/volcano-[1]sh/volcano/releases/tag/v1.5.1|https://github.com/volcano-sh/volcano/releases/tag/v1.5.1] [[2]https://github.com/volcano-sh/volcano/pull/2077/commits/54446650eca749594fc21949223c14fb7cabc8de|https://github.com/volcano-sh/volcano/pull/2077/commits/54446650eca749594fc21949223c14fb7cabc8de] was (Author: yikunkero): [~dongjoon] Yes > Volcano queue is not deleted > > > Key: SPARK-38515 > URL: https://issues.apache.org/jira/browse/SPARK-38515 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Critical > > {code} > $ k delete queue queue0 > Error from server: admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue0` state > is `Open` > {code} > {code} > [info] org.apache.spark.deploy.k8s.integrationtest.VolcanoSuite *** ABORTED > *** (7 minutes, 40 seconds) > [info] io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: DELETE at: > https://44bea09e70a5147f6b5b347ec26de85f.gr7.us-west-2.eks.amazonaws.com/apis/scheduling.volcano.sh/v1beta1/queues/queue-2u-3g. > Message: admission webhook "validatequeue.volcano.sh" denied the request: > only queue with state `Closed` can be deleted, queue `queue-2u-3g` state is > `Open`. Received status: Status(apiVersion=v1, code=400, details=null, > kind=Status, message=admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue-2u-3g` > state is `Open`, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, > status=Failure, additionalProperties={}). > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38515) Volcano queue is not deleted
[ https://issues.apache.org/jira/browse/SPARK-38515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507277#comment-17507277 ] Yikun Jiang commented on SPARK-38515: - [~dongjoon] Yes > Volcano queue is not deleted > > > Key: SPARK-38515 > URL: https://issues.apache.org/jira/browse/SPARK-38515 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Critical > > {code} > $ k delete queue queue0 > Error from server: admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue0` state > is `Open` > {code} > {code} > [info] org.apache.spark.deploy.k8s.integrationtest.VolcanoSuite *** ABORTED > *** (7 minutes, 40 seconds) > [info] io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: DELETE at: > https://44bea09e70a5147f6b5b347ec26de85f.gr7.us-west-2.eks.amazonaws.com/apis/scheduling.volcano.sh/v1beta1/queues/queue-2u-3g. > Message: admission webhook "validatequeue.volcano.sh" denied the request: > only queue with state `Closed` can be deleted, queue `queue-2u-3g` state is > `Open`. Received status: Status(apiVersion=v1, code=400, details=null, > kind=Status, message=admission webhook "validatequeue.volcano.sh" denied the > request: only queue with state `Closed` can be deleted, queue `queue-2u-3g` > state is `Open`, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, > status=Failure, additionalProperties={}). > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38508) Volcano feature doesn't work on EKS graviton instances
[ https://issues.apache.org/jira/browse/SPARK-38508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507276#comment-17507276 ] Yikun Jiang commented on SPARK-38508: - [~dongjoon] Yes > Volcano feature doesn't work on EKS graviton instances > -- > > Key: SPARK-38508 > URL: https://issues.apache.org/jira/browse/SPARK-38508 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition
[ https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507268#comment-17507268 ] Apache Spark commented on SPARK-38531: -- User 'minyyy' has created a pull request for this issue: https://github.com/apache/spark/pull/35864 > "Prune unrequired child index" branch of ColumnPruning has wrong condition > -- > > Key: SPARK-38531 > URL: https://issues.apache.org/jira/browse/SPARK-38531 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Priority: Minor > > The "prune unrequired references" branch has the condition: > {code:java} > case p @ Project(_, g: Generate) if p.references != g.outputSet => {code} > This is wrong as generators like Inline will always enter this branch as long > as it does not use all the generator output. > > Example: > > input: , b: int>>> > > Project(a.a as x) > - Generate(Inline(col1), ..., a, b) > > p.references is [a] > g.outputSet is [a, b] > > This bug makes us never enter the GeneratorNestedColumnAliasing branch below > thus miss some optimization opportunities. The condition should be > {code:java} > g.requiredChildOutput.contains(!p.references.contains(_)) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition
[ https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38531: Assignee: (was: Apache Spark) > "Prune unrequired child index" branch of ColumnPruning has wrong condition > -- > > Key: SPARK-38531 > URL: https://issues.apache.org/jira/browse/SPARK-38531 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Priority: Minor > > The "prune unrequired references" branch has the condition: > {code:java} > case p @ Project(_, g: Generate) if p.references != g.outputSet => {code} > This is wrong as generators like Inline will always enter this branch as long > as it does not use all the generator output. > > Example: > > input: , b: int>>> > > Project(a.a as x) > - Generate(Inline(col1), ..., a, b) > > p.references is [a] > g.outputSet is [a, b] > > This bug makes us never enter the GeneratorNestedColumnAliasing branch below > thus miss some optimization opportunities. The condition should be > {code:java} > g.requiredChildOutput.contains(!p.references.contains(_)) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition
[ https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507266#comment-17507266 ] Apache Spark commented on SPARK-38531: -- User 'minyyy' has created a pull request for this issue: https://github.com/apache/spark/pull/35864 > "Prune unrequired child index" branch of ColumnPruning has wrong condition > -- > > Key: SPARK-38531 > URL: https://issues.apache.org/jira/browse/SPARK-38531 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Priority: Minor > > The "prune unrequired references" branch has the condition: > {code:java} > case p @ Project(_, g: Generate) if p.references != g.outputSet => {code} > This is wrong as generators like Inline will always enter this branch as long > as it does not use all the generator output. > > Example: > > input: , b: int>>> > > Project(a.a as x) > - Generate(Inline(col1), ..., a, b) > > p.references is [a] > g.outputSet is [a, b] > > This bug makes us never enter the GeneratorNestedColumnAliasing branch below > thus miss some optimization opportunities. The condition should be > {code:java} > g.requiredChildOutput.contains(!p.references.contains(_)) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38531) "Prune unrequired child index" branch of ColumnPruning has wrong condition
[ https://issues.apache.org/jira/browse/SPARK-38531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38531: Assignee: Apache Spark > "Prune unrequired child index" branch of ColumnPruning has wrong condition > -- > > Key: SPARK-38531 > URL: https://issues.apache.org/jira/browse/SPARK-38531 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Min Yang >Assignee: Apache Spark >Priority: Minor > > The "prune unrequired references" branch has the condition: > {code:java} > case p @ Project(_, g: Generate) if p.references != g.outputSet => {code} > This is wrong as generators like Inline will always enter this branch as long > as it does not use all the generator output. > > Example: > > input: , b: int>>> > > Project(a.a as x) > - Generate(Inline(col1), ..., a, b) > > p.references is [a] > g.outputSet is [a, b] > > This bug makes us never enter the GeneratorNestedColumnAliasing branch below > thus miss some optimization opportunities. The condition should be > {code:java} > g.requiredChildOutput.contains(!p.references.contains(_)) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38329: - Affects Version/s: 3.2.1 (was: 2.4.6) > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 3.2.1 >Reporter: Neven Jovic >Priority: Major > Attachments: Screenshot from 2022-02-25 14-16-11.png, q.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38557) What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData and how to fix or work around this?
[ https://issues.apache.org/jira/browse/SPARK-38557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507263#comment-17507263 ] Dmitry Goldenberg commented on SPARK-38557: --- Likely a DUP of https://github.com/qubole/kinesis-sql/issues/57. > What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData > and how to fix or work around this? > > > Key: SPARK-38557 > URL: https://issues.apache.org/jira/browse/SPARK-38557 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 3.1.1 > Environment: Spark 3.1.1 > AWS EMR 6.3.0 > python 3.7.2 >Reporter: Dmitry Goldenberg >Priority: Major > > I'm seeing errors such as the below when executing structured Spark Streaming > app which streams data from AWS Kinesis. > > I've googled the error but can't tell what may be the cause. Is Spark running > out of disk space? something else? > {code:java} > // From the stderr log in EMR > 22/03/15 00:54:00 WARN HDFSMetadataCommitter: Error while fetching MetaData > [attempt = 1] > java.lang.IllegalStateException: > hdfs://ip-10-2-XXX-XXX.awsinternal.acme.com:8020/mnt/tmp/temporary-03b8fecf-32d5-422c-9375-4c3450ed0bb8/sources/0/shard-commit/0 > does not exist > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.$anonfun$get$1(HDFSMetadataCommitter.scala:163) > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.withRetry(HDFSMetadataCommitter.scala:229) > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.get(HDFSMetadataCommitter.scala:151) > at > org.apache.spark.sql.kinesis.KinesisSource.prevBatchShardInfo(KinesisSource.scala:275) > at > org.apache.spark.sql.kinesis.KinesisSource.getOffset(KinesisSource.scala:163) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:399) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:399) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.Map$Map1.foreach(Map.scala:128) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:382) > at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:613) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:378) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:211) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244){code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38557) What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData and how to fix or work around this?
Dmitry Goldenberg created SPARK-38557: - Summary: What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData and how to fix or work around this? Key: SPARK-38557 URL: https://issues.apache.org/jira/browse/SPARK-38557 Project: Spark Issue Type: Question Components: Structured Streaming Affects Versions: 3.1.1 Environment: Spark 3.1.1 AWS EMR 6.3.0 python 3.7.2 Reporter: Dmitry Goldenberg I'm seeing errors such as the below when executing structured Spark Streaming app which streams data from AWS Kinesis. I've googled the error but can't tell what may be the cause. Is Spark running out of disk space? something else? {code:java} // From the stderr log in EMR 22/03/15 00:54:00 WARN HDFSMetadataCommitter: Error while fetching MetaData [attempt = 1] java.lang.IllegalStateException: hdfs://ip-10-2-XXX-XXX.awsinternal.acme.com:8020/mnt/tmp/temporary-03b8fecf-32d5-422c-9375-4c3450ed0bb8/sources/0/shard-commit/0 does not exist at org.apache.spark.sql.kinesis.HDFSMetadataCommitter.$anonfun$get$1(HDFSMetadataCommitter.scala:163) at org.apache.spark.sql.kinesis.HDFSMetadataCommitter.withRetry(HDFSMetadataCommitter.scala:229) at org.apache.spark.sql.kinesis.HDFSMetadataCommitter.get(HDFSMetadataCommitter.scala:151) at org.apache.spark.sql.kinesis.KinesisSource.prevBatchShardInfo(KinesisSource.scala:275) at org.apache.spark.sql.kinesis.KinesisSource.getOffset(KinesisSource.scala:163) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:399) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:399) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.Map$Map1.foreach(Map.scala:128) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:382) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:613) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:378) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:211) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38388) Repartition + Stage retries could lead to incorrect data
[ https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507257#comment-17507257 ] Mridul Muralidharan commented on SPARK-38388: - Agree with [~jiangxb1987], either the computation should be repeatable (specify order'ing for example) or it should be marked as nondeterminate (if input source changing order of tuples or computation not being repeatable, etc). > Repartition + Stage retries could lead to incorrect data > - > > Key: SPARK-38388 > URL: https://issues.apache.org/jira/browse/SPARK-38388 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.1.1 > Environment: Spark 2.4 and 3.1 >Reporter: Jason Xu >Priority: Major > Labels: correctness, data-loss > > Spark repartition uses RoundRobinPartitioning, the generated results is > non-deterministic when data has some randomness and stage/task retries happen. > The bug can be triggered when upstream data has some randomness, a > repartition is called on them, then followed by result stage (could be more > stages). > As the pattern shows below: > upstream stage (data with randomness) -> (repartition shuffle) -> result stage > When one executor goes down at result stage, some tasks of that stage might > have finished, others would fail, shuffle files on that executor also get > lost, some tasks from previous stage (upstream data generation, repartition) > will need to rerun to generate dependent shuffle data files. > Because data has some randomness, regenerated data in upstream retried tasks > is slightly different, repartition then generates inconsistent ordering, then > tasks at result stage will be retried generating different data. > This is similar but different to > https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra > local sort to make the row ordering deterministic, the sorting algorithm it > uses simply compares row/record hash. But in this case, upstream data has > some randomness, the sorting algorithm doesn't help keep the order, thus > RoundRobinPartitioning introduced non-deterministic result. > The following code returns 986415, instead of 100: > {code:java} > import scala.sys.process._ > import org.apache.spark.TaskContext > case class TestObject(id: Long, value: Double) > val ds = spark.range(0, 1000 * 1000, 1).repartition(100, > $"id").withColumn("val", rand()).repartition(100).map { > row => if (TaskContext.get.stageAttemptNumber == 0 && > TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) { > throw new Exception("pkill -f java".!!) > } > TestObject(row.getLong(0), row.getDouble(1)) > } > ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table") > spark.sql("select count(distinct id) from tmp.test_table").show{code} > Command: > {code:java} > spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false > --conf spark.shuffle.service.enabled=false){code} > To simulate the issue, disable external shuffle service is needed (if it's > also enabled by default in your environment), this is to trigger shuffle > file loss and previous stage retries. > In our production, we have external shuffle service enabled, this data > correctness issue happened when there were node losses. > Although there's some non-deterministic factor in upstream data, user > wouldn't expect to see incorrect result. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507241#comment-17507241 ] Neven Jovic commented on SPARK-38329: - [~hyukjin.kwon] I updated Spark to 3.2.1, and I/O wait is still there. I used structured streaming monitoring tool and found out that my aggregated states in memory were continuously growing. I added watermark and that probably solved issue with State Store Provider (haven't seen that WARN message yet). About high I/O wait, I can assume that it comes from writing to efs. Here is screen shot of CPU Utilization with updated Spark and same load !100k_zbx_21.png! > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: 100k_zbx_21.png, Screenshot from 2022-02-25 14-16-11.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neven Jovic updated SPARK-38329: Attachment: Screenshot from 2022-02-25 14-16-11.png > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: Screenshot from 2022-02-25 14-16-11.png, q.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507241#comment-17507241 ] Neven Jovic edited comment on SPARK-38329 at 3/15/22, 9:41 PM: --- !q.png![~hyukjin.kwon] I updated Spark to 3.2.1, and I/O wait is still there. I used structured streaming monitoring tool and found out that my aggregated states in memory were continuously growing. I added watermark and that probably solved issue with State Store Provider (haven't seen that WARN message yet). About high I/O wait, I can assume that it comes from writing to efs. Here is screen shot of CPU Utilization with updated Spark and same load was (Author: JIRAUSER285811): [~hyukjin.kwon] I updated Spark to 3.2.1, and I/O wait is still there. I used structured streaming monitoring tool and found out that my aggregated states in memory were continuously growing. I added watermark and that probably solved issue with State Store Provider (haven't seen that WARN message yet). About high I/O wait, I can assume that it comes from writing to efs. Here is screen shot of CPU Utilization with updated Spark and same load > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: q.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neven Jovic updated SPARK-38329: Attachment: (was: 100k_zbx_21.png) > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: q.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neven Jovic updated SPARK-38329: Attachment: q.png > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: q.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507241#comment-17507241 ] Neven Jovic edited comment on SPARK-38329 at 3/15/22, 9:40 PM: --- [~hyukjin.kwon] I updated Spark to 3.2.1, and I/O wait is still there. I used structured streaming monitoring tool and found out that my aggregated states in memory were continuously growing. I added watermark and that probably solved issue with State Store Provider (haven't seen that WARN message yet). About high I/O wait, I can assume that it comes from writing to efs. Here is screen shot of CPU Utilization with updated Spark and same load was (Author: JIRAUSER285811): [~hyukjin.kwon] I updated Spark to 3.2.1, and I/O wait is still there. I used structured streaming monitoring tool and found out that my aggregated states in memory were continuously growing. I added watermark and that probably solved issue with State Store Provider (haven't seen that WARN message yet). About high I/O wait, I can assume that it comes from writing to efs. Here is screen shot of CPU Utilization with updated Spark and same load !100k_zbx_21.png! > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: 100k_zbx_21.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507241#comment-17507241 ] Neven Jovic edited comment on SPARK-38329 at 3/15/22, 9:40 PM: --- [~hyukjin.kwon] I updated Spark to 3.2.1, and I/O wait is still there. I used structured streaming monitoring tool and found out that my aggregated states in memory were continuously growing. I added watermark and that probably solved issue with State Store Provider (haven't seen that WARN message yet). About high I/O wait, I can assume that it comes from writing to efs. Here is screen shot of CPU Utilization with updated Spark and same load was (Author: JIRAUSER285811): [~hyukjin.kwon] I updated Spark to 3.2.1, and I/O wait is still there. I used structured streaming monitoring tool and found out that my aggregated states in memory were continuously growing. I added watermark and that probably solved issue with State Store Provider (haven't seen that WARN message yet). About high I/O wait, I can assume that it comes from writing to efs. Here is screen shot of CPU Utilization with updated Spark and same load > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: 100k_zbx_21.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neven Jovic updated SPARK-38329: Attachment: (was: Screenshot from 2022-02-25 14-16-11.png) > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: 100k_zbx_21.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS
[ https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neven Jovic updated SPARK-38329: Attachment: 100k_zbx_21.png > High I/O wait when Spark Structured Streaming checkpoint changed to EFS > --- > > Key: SPARK-38329 > URL: https://issues.apache.org/jira/browse/SPARK-38329 > Project: Spark > Issue Type: Question > Components: EC2, Input/Output, PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Neven Jovic >Priority: Major > Attachments: 100k_zbx_21.png > > > I'm currently running spark structured streaming application written in > python(pyspark) where my source is kafka topic and sink i mongodb. I changed > my checkpoint to Amazon EFS, which is distributed on all spark workers and > after that I got increased I/o wait, averaging 8% > > !Screenshot from 2022-02-25 14-16-11.png! > Currently I have 6000 messages coming to kafka every second, and I get every > once in a while a WARN message: > {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up > files for HDFSStateStoreProvider[id = (op=0,part=90),dir = > file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For > input string: "" > {quote} > I'm not quite sure if that message has anything to do with high I/O wait and > is this behavior expected, or something to be concerned about? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning
[ https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li resolved SPARK-38204. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35673 [https://github.com/apache/spark/pull/35673] > All state operators are at a risk of inconsistency between state partitioning > and operator partitioning > --- > > Key: SPARK-38204 > URL: https://issues.apache.org/jira/browse/SPARK-38204 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Blocker > Labels: correctness > Fix For: 3.3.0 > > > Except stream-stream join, all stateful operators use ClusteredDistribution > as a requirement of child distribution. > ClusteredDistribution is very relaxed one - any output partitioning can > satisfy the distribution if the partitioning can ensure all tuples having > same grouping keys are placed in same partition. > To illustrate an example, support we do streaming aggregation like below code: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > In the code, streaming aggregation operator will be involved in physical > plan, which would have ClusteredDistribution("group1", "group2", "window"). > The problem is, various output partitionings can satisfy this distribution: > * RangePartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination), with any sort order (asc/desc) > * HashPartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination) > * (upcoming Spark 3.3.0+) DataSourcePartitioning > ** output partitioning provided by data source will be able to satisfy > ClusteredDistribution, which will make things worse (assuming data source can > provide different output partitioning relatively easier) > e.g. even we only consider HashPartitioning, HashPartitioning("group1"), > HashPartitioning("group2"), HashPartitioning("group1", "group2"), > HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", > "window"), etc. > The requirement of state partitioning is much more strict, since we should > not change the partitioning once it is partitioned and built. *It should > ensure that all tuples having same grouping keys are placed in same partition > (same partition ID) across query lifetime.* > *The impedance of distribution requirement between ClusteredDistribution and > state partitioning leads correctness issue silently.* > For example, let's assume we have a streaming query like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group2") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group2") satisfies ClusteredDistribution("group1", "group2", > "window"), so Spark won't introduce additional shuffle there, and state > partitioning would be HashPartitioning("group2"). > we run this query for a while, and stop the query, and change the manual > partitioning like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group1") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group1") also satisfies ClusteredDistribution("group1", > "group2", "window"), so Spark won't introduce additional shuffle there. That > said, child output partitioning of streaming aggregation operator would be > HashPartitioning("group1"), whereas state partitioning is > HashPartitioning("group2"). > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query] > In SS guide doc we enumerate the unsupported modifications of the query > during the lifetime of streaming query, but there is no notion of this. > Making this worse, Spark doesn't store any information on state partitioning > (that said, there is no way to validate), so *Spark simply allows this change > and brings up correctness issue while the streaming query runs like no > problem at all.* The only way to indicate the correctness is from the result > of the query. > We have no idea whether end users already suffer from this in their queries > or not. *The only way to look into is to list up all state rows and apply > hash function with expected grouping keys, and confirm all rows provide the > exact partition ID where they are in.* If it turns out as broken, we will > hav
[jira] [Assigned] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning
[ https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li reassigned SPARK-38204: --- Assignee: Jungtaek Lim > All state operators are at a risk of inconsistency between state partitioning > and operator partitioning > --- > > Key: SPARK-38204 > URL: https://issues.apache.org/jira/browse/SPARK-38204 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Blocker > Labels: correctness > > Except stream-stream join, all stateful operators use ClusteredDistribution > as a requirement of child distribution. > ClusteredDistribution is very relaxed one - any output partitioning can > satisfy the distribution if the partitioning can ensure all tuples having > same grouping keys are placed in same partition. > To illustrate an example, support we do streaming aggregation like below code: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > In the code, streaming aggregation operator will be involved in physical > plan, which would have ClusteredDistribution("group1", "group2", "window"). > The problem is, various output partitionings can satisfy this distribution: > * RangePartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination), with any sort order (asc/desc) > * HashPartitioning > ** This accepts exact and subset of the grouping key, with any order of keys > (combination) > * (upcoming Spark 3.3.0+) DataSourcePartitioning > ** output partitioning provided by data source will be able to satisfy > ClusteredDistribution, which will make things worse (assuming data source can > provide different output partitioning relatively easier) > e.g. even we only consider HashPartitioning, HashPartitioning("group1"), > HashPartitioning("group2"), HashPartitioning("group1", "group2"), > HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", > "window"), etc. > The requirement of state partitioning is much more strict, since we should > not change the partitioning once it is partitioned and built. *It should > ensure that all tuples having same grouping keys are placed in same partition > (same partition ID) across query lifetime.* > *The impedance of distribution requirement between ClusteredDistribution and > state partitioning leads correctness issue silently.* > For example, let's assume we have a streaming query like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group2") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group2") satisfies ClusteredDistribution("group1", "group2", > "window"), so Spark won't introduce additional shuffle there, and state > partitioning would be HashPartitioning("group2"). > we run this query for a while, and stop the query, and change the manual > partitioning like below: > {code:java} > df > .withWatermark("timestamp", "30 minutes") > .repartition("group1") > .groupBy("group1", "group2", window("timestamp", "10 minutes")) > .agg(count("*")) {code} > repartition("group1") also satisfies ClusteredDistribution("group1", > "group2", "window"), so Spark won't introduce additional shuffle there. That > said, child output partitioning of streaming aggregation operator would be > HashPartitioning("group1"), whereas state partitioning is > HashPartitioning("group2"). > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query] > In SS guide doc we enumerate the unsupported modifications of the query > during the lifetime of streaming query, but there is no notion of this. > Making this worse, Spark doesn't store any information on state partitioning > (that said, there is no way to validate), so *Spark simply allows this change > and brings up correctness issue while the streaming query runs like no > problem at all.* The only way to indicate the correctness is from the result > of the query. > We have no idea whether end users already suffer from this in their queries > or not. *The only way to look into is to list up all state rows and apply > hash function with expected grouping keys, and confirm all rows provide the > exact partition ID where they are in.* If it turns out as broken, we will > have to have a tool to “re”partition the state correctly, or in worst case, > have to ask throwing out checkpoint and reprocess. > {*}
[jira] [Comment Edited] (SPARK-38483) Column name or alias as an attribute of the PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-38483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503782#comment-17503782 ] Brian Schaefer edited comment on SPARK-38483 at 3/15/22, 8:11 PM: -- Extracting the column name from the {{Column.__repr__}} method has been discussed on StackExchange: [https://stackoverflow.com/a/43150264|https://stackoverflow.com/a/43150264]. However, it would be useful to have the column name more easily accessible. was (Author: JIRAUSER286367): Extracting the column name from the {{Column.__repr__}} method has been discussed on StackExchange: [https://stackoverflow.com/a/43150264|https://stackoverflow.com/a/43150264.]. However, it would be useful to have the column name more easily accessible. > Column name or alias as an attribute of the PySpark Column class > > > Key: SPARK-38483 > URL: https://issues.apache.org/jira/browse/SPARK-38483 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Minor > Labels: starter > > Having the name of a column as an attribute of PySpark {{Column}} class > instances can enable some convenient patterns, for example: > Applying a function to a column and aliasing with the original name: > {code:java} > values = F.col("values") > # repeating the column name as an alias > distinct_values = F.array_distinct(values).alias("values") > # re-using the existing column name > distinct_values = F.array_distinct(values).alias(values._name){code} > Checking the column name inside a custom function and applying conditional > logic on the name: > {code:java} > def custom_function(col: Column) -> Column: > if col._name == "my_column": > return col.astype("int") > return col.astype("string"){code} > The proposal in this issue is to add a property {{Column.\_name}} that > obtains the name or alias of a column in a similar way as currently done in > the {{Column.\_\_repr\_\_}} method: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1062.] > The choice of {{_name}} intentionally avoids collision with the existing > {{Column.name}} method, which is an alias for {{{}Column.alias{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38483) Column name or alias as an attribute of the PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-38483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503782#comment-17503782 ] Brian Schaefer edited comment on SPARK-38483 at 3/15/22, 8:11 PM: -- Extracting the column name from the {{Column.__repr__}} method has been discussed on StackExchange: [https://stackoverflow.com/a/43150264|https://stackoverflow.com/a/43150264.]. However, it would be useful to have the column name more easily accessible. was (Author: JIRAUSER286367): Extracting the column name from the {{Column.\_\_repr\_\_}} method has been discussed on StackExchange: [https://stackoverflow.com/a/43150264.] However, it would be useful to have the column name more easily accessible. > Column name or alias as an attribute of the PySpark Column class > > > Key: SPARK-38483 > URL: https://issues.apache.org/jira/browse/SPARK-38483 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Minor > Labels: starter > > Having the name of a column as an attribute of PySpark {{Column}} class > instances can enable some convenient patterns, for example: > Applying a function to a column and aliasing with the original name: > {code:java} > values = F.col("values") > # repeating the column name as an alias > distinct_values = F.array_distinct(values).alias("values") > # re-using the existing column name > distinct_values = F.array_distinct(values).alias(values._name){code} > Checking the column name inside a custom function and applying conditional > logic on the name: > {code:java} > def custom_function(col: Column) -> Column: > if col._name == "my_column": > return col.astype("int") > return col.astype("string"){code} > The proposal in this issue is to add a property {{Column.\_name}} that > obtains the name or alias of a column in a similar way as currently done in > the {{Column.\_\_repr\_\_}} method: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1062.] > The choice of {{_name}} intentionally avoids collision with the existing > {{Column.name}} method, which is an alias for {{{}Column.alias{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38334) Implement support for DEFAULT values for columns in tables
[ https://issues.apache.org/jira/browse/SPARK-38334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38334: -- Affects Version/s: 3.4.0 (was: 3.3.0) > Implement support for DEFAULT values for columns in tables > --- > > Key: SPARK-38334 > URL: https://issues.apache.org/jira/browse/SPARK-38334 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > > This story tracks the implementation of DEFAULT values for columns in tables. > CREATE TABLE and ALTER TABLE invocations will support setting column default > values for future operations. Following INSERT, UPDATE, MERGE statements may > then reference the value using the DEFAULT keyword as needed. > Examples: > {code:sql} > CREATE TABLE T(a INT, b INT NOT NULL); > -- The default default is NULL > INSERT INTO T VALUES (DEFAULT, 0); > INSERT INTO T(b) VALUES (1); > SELECT * FROM T; > (NULL, 0) > (NULL, 1) > -- Adding a default to a table with rows, sets the values for the > -- existing rows (exist default) and new rows (current default). > ALTER TABLE T ADD COLUMN c INT DEFAULT 5; > INSERT INTO T VALUES (1, 2, DEFAULT); > SELECT * FROM T; > (NULL, 0, 5) > (NULL, 1, 5) > (1, 2, 5) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38335) Parser changes for DEFAULT column support
[ https://issues.apache.org/jira/browse/SPARK-38335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38335: -- Fix Version/s: 3.4.0 (was: 3.3.0) > Parser changes for DEFAULT column support > - > > Key: SPARK-38335 > URL: https://issues.apache.org/jira/browse/SPARK-38335 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.1 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38453) Add volcano section to K8s IT README.md
[ https://issues.apache.org/jira/browse/SPARK-38453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38453: -- Component/s: Documentation > Add volcano section to K8s IT README.md > --- > > Key: SPARK-38453 > URL: https://issues.apache.org/jira/browse/SPARK-38453 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38553) Bump minimum Volcano version to v1.5.1
[ https://issues.apache.org/jira/browse/SPARK-38553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507147#comment-17507147 ] Dongjoon Hyun commented on SPARK-38553: --- Since this is a documentation JIRA, I added the component, `Documentation`. > Bump minimum Volcano version to v1.5.1 > -- > > Key: SPARK-38553 > URL: https://issues.apache.org/jira/browse/SPARK-38553 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38553) Bump minimum Volcano version to v1.5.1
[ https://issues.apache.org/jira/browse/SPARK-38553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38553: -- Component/s: Documentation > Bump minimum Volcano version to v1.5.1 > -- > > Key: SPARK-38553 > URL: https://issues.apache.org/jira/browse/SPARK-38553 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org