[jira] [Updated] (SPARK-37967) ConstantFolding/ Literal.create support ObjectType
[ https://issues.apache.org/jira/browse/SPARK-37967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37967: -- Summary: ConstantFolding/ Literal.create support ObjectType (was: Literal support ObjectType) > ConstantFolding/ Literal.create support ObjectType > -- > > Key: SPARK-37967 > URL: https://issues.apache.org/jira/browse/SPARK-37967 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37967) Literal support ObjectType
angerszhu created SPARK-37967: - Summary: Literal support ObjectType Key: SPARK-37967 URL: https://issues.apache.org/jira/browse/SPARK-37967 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37948) Disable mapreduce.fileoutputcommitter.algorithm.version=2 by default
[ https://issues.apache.org/jira/browse/SPARK-37948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479096#comment-17479096 ] Hyukjin Kwon commented on SPARK-37948: -- The problem is that users might intentionally enable v2 protocol, and it makes less sense to warn and disable it. They might already know the risk, and enable v2. I personally think it's discouraged to assume that user's input is wrong. > Disable mapreduce.fileoutputcommitter.algorithm.version=2 by default > > > Key: SPARK-37948 > URL: https://issues.apache.org/jira/browse/SPARK-37948 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: hujiahua >Priority: Major > > The hadoop MR v2 commit algorithm had a correctness issue described by > SPARK-33019, and changed > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default. > But some spark users like me ware unaware of this correctness issue before > and had used v2 commit algorithm in spark 2.x for performance purposes. And > after upgrade to spark 3.x, we encountered this correctness issue in > production environment, caused a very serious failure.The trigger probability > of this issue was higher in new version spark 3.x, and I didn't delve into > the specific reasons. So I propose we should better disable > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 by default, if > users using v2 commit algorithm, then fail the job and warn users this > correctness issue. Or users can choose to force the v2 usage through a new > configuration. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37954) old columns should not be available after select or drop
[ https://issues.apache.org/jira/browse/SPARK-37954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37954: - Component/s: SQL > old columns should not be available after select or drop > > > Key: SPARK-37954 > URL: https://issues.apache.org/jira/browse/SPARK-37954 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.1 >Reporter: Jean Bon >Priority: Major > > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import col as col > spark = SparkSession.builder.appName('available_columns').getOrCreate() > df = spark.range(5).select((col("id")+10).alias("id2")) > assert df.columns==["id2"] #OK > try: > df.select("id") > error_raise = False > except: > error_raise = True > assert error_raise #OK > df = df.drop("id") #should raise an error > df.filter(col("id")!=2).count() #returns 4, should raise an error > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37839) DS V2 supports partial aggregate push-down AVG
[ https://issues.apache.org/jira/browse/SPARK-37839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37839. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35130 [https://github.com/apache/spark/pull/35130] > DS V2 supports partial aggregate push-down AVG > -- > > Key: SPARK-37839 > URL: https://issues.apache.org/jira/browse/SPARK-37839 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > Currently, DS V2 supports complete aggregate push-down AVG. But, supports > partial aggregate push-down for AVG is very useful. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37839) DS V2 supports partial aggregate push-down AVG
[ https://issues.apache.org/jira/browse/SPARK-37839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37839: --- Assignee: jiaan.geng > DS V2 supports partial aggregate push-down AVG > -- > > Key: SPARK-37839 > URL: https://issues.apache.org/jira/browse/SPARK-37839 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Currently, DS V2 supports complete aggregate push-down AVG. But, supports > partial aggregate push-down for AVG is very useful. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37966) Static insert should write _SUCCESS under partition path
[ https://issues.apache.org/jira/browse/SPARK-37966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37966: Assignee: Apache Spark > Static insert should write _SUCCESS under partition path > > > Key: SPARK-37966 > URL: https://issues.apache.org/jira/browse/SPARK-37966 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Currently, static insert write _SUCCESS file under table path when use > DataSource Insert, this file should be under partition path. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37966) Static insert should write _SUCCESS under partition path
[ https://issues.apache.org/jira/browse/SPARK-37966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37966: Assignee: (was: Apache Spark) > Static insert should write _SUCCESS under partition path > > > Key: SPARK-37966 > URL: https://issues.apache.org/jira/browse/SPARK-37966 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Currently, static insert write _SUCCESS file under table path when use > DataSource Insert, this file should be under partition path. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37966) Static insert should write _SUCCESS under partition path
[ https://issues.apache.org/jira/browse/SPARK-37966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479082#comment-17479082 ] Apache Spark commented on SPARK-37966: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/35254 > Static insert should write _SUCCESS under partition path > > > Key: SPARK-37966 > URL: https://issues.apache.org/jira/browse/SPARK-37966 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Currently, static insert write _SUCCESS file under table path when use > DataSource Insert, this file should be under partition path. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37965) Remove check field name when reading/writing existing data in ORC
[ https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479079#comment-17479079 ] Apache Spark commented on SPARK-37965: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/35253 > Remove check field name when reading/writing existing data in ORC > - > > Key: SPARK-37965 > URL: https://issues.apache.org/jira/browse/SPARK-37965 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Remove check field name when reading existing data in Orc -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37965) Remove check field name when reading/writing existing data in ORC
[ https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37965: Assignee: Apache Spark > Remove check field name when reading/writing existing data in ORC > - > > Key: SPARK-37965 > URL: https://issues.apache.org/jira/browse/SPARK-37965 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Remove check field name when reading existing data in Orc -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37965) Remove check field name when reading/writing existing data in ORC
[ https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37965: Assignee: (was: Apache Spark) > Remove check field name when reading/writing existing data in ORC > - > > Key: SPARK-37965 > URL: https://issues.apache.org/jira/browse/SPARK-37965 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Remove check field name when reading existing data in Orc -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37966) Static insert should write _SUCCESS under partition path
angerszhu created SPARK-37966: - Summary: Static insert should write _SUCCESS under partition path Key: SPARK-37966 URL: https://issues.apache.org/jira/browse/SPARK-37966 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu Currently, static insert write _SUCCESS file under table path when use DataSource Insert, this file should be under partition path. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37965) Remove check field name when reading/writing existing data in ORC
[ https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37965: -- Summary: Remove check field name when reading/writing existing data in ORC (was: Remove check field name when reading existing data in parquet) > Remove check field name when reading/writing existing data in ORC > - > > Key: SPARK-37965 > URL: https://issues.apache.org/jira/browse/SPARK-37965 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37965) Remove check field name when reading/writing existing data in ORC
[ https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37965: -- Description: Remove check field name when reading existing data in Orc > Remove check field name when reading/writing existing data in ORC > - > > Key: SPARK-37965 > URL: https://issues.apache.org/jira/browse/SPARK-37965 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Remove check field name when reading existing data in Orc -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37965) Remove check field name when reading existing data in parquet
angerszhu created SPARK-37965: - Summary: Remove check field name when reading existing data in parquet Key: SPARK-37965 URL: https://issues.apache.org/jira/browse/SPARK-37965 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37964) Replace usages of slaveTracker to workerTracker in MapOutputTrackerSuite
Venkata krishnan Sowrirajan created SPARK-37964: --- Summary: Replace usages of slaveTracker to workerTracker in MapOutputTrackerSuite Key: SPARK-37964 URL: https://issues.apache.org/jira/browse/SPARK-37964 Project: Spark Issue Type: Sub-task Components: Shuffle Affects Versions: 3.2.0 Reporter: Venkata krishnan Sowrirajan -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37957) Deterministic flag is not handled for V2 functions
[ https://issues.apache.org/jira/browse/SPARK-37957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37957: - Fix Version/s: 3.2.1 > Deterministic flag is not handled for V2 functions > -- > > Key: SPARK-37957 > URL: https://issues.apache.org/jira/browse/SPARK-37957 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.1, 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37957) Deterministic flag is not handled for V2 functions
[ https://issues.apache.org/jira/browse/SPARK-37957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37957. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35243 [https://github.com/apache/spark/pull/35243] > Deterministic flag is not handled for V2 functions > -- > > Key: SPARK-37957 > URL: https://issues.apache.org/jira/browse/SPARK-37957 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37154) Inline type hints for python/pyspark/rdd.py
[ https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37154: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/rdd.py > --- > > Key: SPARK-37154 > URL: https://issues.apache.org/jira/browse/SPARK-37154 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37154) Inline type hints for python/pyspark/rdd.py
[ https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37154: Assignee: Apache Spark > Inline type hints for python/pyspark/rdd.py > --- > > Key: SPARK-37154 > URL: https://issues.apache.org/jira/browse/SPARK-37154 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37154) Inline type hints for python/pyspark/rdd.py
[ https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478948#comment-17478948 ] Apache Spark commented on SPARK-37154: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35252 > Inline type hints for python/pyspark/rdd.py > --- > > Key: SPARK-37154 > URL: https://issues.apache.org/jira/browse/SPARK-37154 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode
[ https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-37910. - > Spark executor self-exiting due to driver disassociated in Kubernetes with > client deploy-mode > - > > Key: SPARK-37910 > URL: https://issues.apache.org/jira/browse/SPARK-37910 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Petri >Priority: Major > > I have Spark driver running in a Kubernetes pod with client deploy-mode and > it tries to start an executor. > Executor will fail with error: > \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", > "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", > "class":"dispatcher-Executor", > "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", > "log":"Executor self-exiting due to : Driver > 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting > down.\n"} > Then driver will attempt to start another executor which fails with same > error and this goes on and on. > In the driver pod, I see only following errors: > 22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on > 192.168.43.250: > 22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on > 192.168.43.233: > 22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on > 192.168.43.221: > 22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on > 192.168.43.217: > 22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on > 192.168.43.197: > 22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on > 192.168.43.237: > 22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on > 192.168.43.196: > 22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on > 192.168.43.228: > 22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on > 192.168.43.254: > 22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on > 192.168.43.204: > 22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on > 192.168.43.231: > What is wrong? And how can I get executors running correctly? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode
[ https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37910. --- Resolution: Invalid > Spark executor self-exiting due to driver disassociated in Kubernetes with > client deploy-mode > - > > Key: SPARK-37910 > URL: https://issues.apache.org/jira/browse/SPARK-37910 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Petri >Priority: Major > > I have Spark driver running in a Kubernetes pod with client deploy-mode and > it tries to start an executor. > Executor will fail with error: > \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", > "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", > "class":"dispatcher-Executor", > "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", > "log":"Executor self-exiting due to : Driver > 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting > down.\n"} > Then driver will attempt to start another executor which fails with same > error and this goes on and on. > In the driver pod, I see only following errors: > 22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on > 192.168.43.250: > 22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on > 192.168.43.233: > 22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on > 192.168.43.221: > 22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on > 192.168.43.217: > 22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on > 192.168.43.197: > 22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on > 192.168.43.237: > 22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on > 192.168.43.196: > 22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on > 192.168.43.228: > 22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on > 192.168.43.254: > 22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on > 192.168.43.204: > 22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on > 192.168.43.231: > What is wrong? And how can I get executors running correctly? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode
[ https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478934#comment-17478934 ] Dongjoon Hyun commented on SPARK-37910: --- Hi, [~Silen]. Apache Spark JIRA issue is not supposed to be used as Q&A. Could you use mailing list or StackOverflow? - https://spark.apache.org/community.html Let me close this first because this seems to be misused. > Spark executor self-exiting due to driver disassociated in Kubernetes with > client deploy-mode > - > > Key: SPARK-37910 > URL: https://issues.apache.org/jira/browse/SPARK-37910 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Petri >Priority: Major > > I have Spark driver running in a Kubernetes pod with client deploy-mode and > it tries to start an executor. > Executor will fail with error: > \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", > "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", > "class":"dispatcher-Executor", > "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", > "log":"Executor self-exiting due to : Driver > 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting > down.\n"} > Then driver will attempt to start another executor which fails with same > error and this goes on and on. > In the driver pod, I see only following errors: > 22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on > 192.168.43.250: > 22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on > 192.168.43.233: > 22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on > 192.168.43.221: > 22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on > 192.168.43.217: > 22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on > 192.168.43.197: > 22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on > 192.168.43.237: > 22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on > 192.168.43.196: > 22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on > 192.168.43.228: > 22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on > 192.168.43.254: > 22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on > 192.168.43.204: > 22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on > 192.168.43.231: > What is wrong? And how can I get executors running correctly? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478907#comment-17478907 ] Dongjoon Hyun commented on SPARK-24432: --- [~pralabhkumar]. What you are looking at is not a single PR. For the configuration, please check {code} spark.dynamicAllocation.* (including spark.dynamicAllocation.shuffleTracking.*) spark.decommission.* spark.storage.decommission.* {code} In addition, `master` branch is already for Apache Spark 3.3.0. It seems that you are using outdated Spark versions. bq. The K8s dynamic allocation with storage migration between executors is already in `master` branch for Apache Spark 3.1.0. If you didn't try to use the latest Apache Spark 3.2, please try Apache Spark 3.2.1 RC2. Although it's not Apache Spark 3.3.0-SNAPSHOT, it has most available features you need. - https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/ - https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/ > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37934) Upgrade Jetty version to 9.4.44
[ https://issues.apache.org/jira/browse/SPARK-37934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-37934: Assignee: Sajith A > Upgrade Jetty version to 9.4.44 > --- > > Key: SPARK-37934 > URL: https://issues.apache.org/jira/browse/SPARK-37934 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0, 3.3.0 >Reporter: Sajith A >Assignee: Sajith A >Priority: Minor > Fix For: 3.3.0 > > > Upgrade Jetty version to 9.4.44.v20210927 in current Spark master to bring-in > the fixes for the > [jetty#6973|https://github.com/eclipse/jetty.project/issues/6973] issue. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37934) Upgrade Jetty version to 9.4.44
[ https://issues.apache.org/jira/browse/SPARK-37934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-37934: - Issue Type: Improvement (was: Bug) > Upgrade Jetty version to 9.4.44 > --- > > Key: SPARK-37934 > URL: https://issues.apache.org/jira/browse/SPARK-37934 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0, 3.3.0 >Reporter: Sajith A >Priority: Minor > > Upgrade Jetty version to 9.4.44.v20210927 in current Spark master to bring-in > the fixes for the > [jetty#6973|https://github.com/eclipse/jetty.project/issues/6973] issue. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37934) Upgrade Jetty version to 9.4.44
[ https://issues.apache.org/jira/browse/SPARK-37934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37934. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35230 [https://github.com/apache/spark/pull/35230] > Upgrade Jetty version to 9.4.44 > --- > > Key: SPARK-37934 > URL: https://issues.apache.org/jira/browse/SPARK-37934 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0, 3.3.0 >Reporter: Sajith A >Priority: Minor > Fix For: 3.3.0 > > > Upgrade Jetty version to 9.4.44.v20210927 in current Spark master to bring-in > the fixes for the > [jetty#6973|https://github.com/eclipse/jetty.project/issues/6973] issue. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
[ https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478861#comment-17478861 ] Kiran commented on SPARK-37690: --- Got this issue with spark 3.2.0. Looking for workarounds but none worked as of now. > Recursive view `df` detected (cycle: `df` -> `df`) > -- > > Key: SPARK-37690 > URL: https://issues.apache.org/jira/browse/SPARK-37690 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Robin >Priority: Major > > In Spark 3.2.0, you can no longer reuse the same name for a temporary view. > This change is backwards incompatible, and means a common way of running > pipelines of SQL queries no longer works. The following is a simple > reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: > {code:python}from pyspark.context import SparkContext > from pyspark.sql import SparkSession > sc = SparkContext.getOrCreate() > spark = SparkSession(sc) > sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) {code} > The following error is now produced: > {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> > `df`) > {code} > I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a > lot of legacy code, and the `createOrReplaceTempView` method is named > explicitly such that replacing an existing view should be allowed. An > internet search suggests other users have run into a similar problems, e.g. > [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37928) Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
[ https://issues.apache.org/jira/browse/SPARK-37928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37928: Assignee: Yang Jie > Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark > -- > > Key: SPARK-37928 > URL: https://issues.apache.org/jira/browse/SPARK-37928 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37928) Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
[ https://issues.apache.org/jira/browse/SPARK-37928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37928. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35226 [https://github.com/apache/spark/pull/35226] > Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark > -- > > Key: SPARK-37928 > URL: https://issues.apache.org/jira/browse/SPARK-37928 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans
[ https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-37959. Fix Version/s: 3.2.1 3.3.0 Assignee: zhengruifeng (was: Apache Spark) Resolution: Fixed > Fix the UT of checking norm in KMeans & BiKMeans > > > Key: SPARK-37959 > URL: https://issues.apache.org/jira/browse/SPARK-37959 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > In KMeansSuite and BisectingKMeansSuite, there are some unused lines: > > {code:java} > model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code} > > For cosine distance, the norm of centering vector should be 1, so the norm > checking is meaningful; > For euclidean distance, the norm checking is meaningless; > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478698#comment-17478698 ] Wenchen Fan commented on SPARK-32165: - I'm closing this as it won't happen in the real world. There should only be one `SharedState` instance per driver JVM. > SessionState leaks SparkListener with multiple SparkSession > --- > > Key: SPARK-32165 > URL: https://issues.apache.org/jira/browse/SPARK-32165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianjin YE >Priority: Major > > Copied from > [https://github.com/apache/spark/pull/28128#issuecomment-653102770] > I'd like to point out that this pr > (https://github.com/apache/spark/pull/28128) doesn't fix the memory leaky > completely. Once {{SessionState}} is touched, it will add two more listeners > into the SparkContext, namely {{SQLAppStatusListener}} and > {{ExecutionListenerBus}} > It can be reproduced easily as > {code:java} > test("SPARK-31354: SparkContext only register one SparkSession > ApplicationEnd listener") { > val conf = new SparkConf() > .setMaster("local") > .setAppName("test-app-SPARK-31354-1") > val context = new SparkContext(conf) > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postFirstCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postSecondCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > assert(postFirstCreation == postSecondCreation) > } > {code} > The problem can be reproduced by the above code. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32165. - Resolution: Not A Problem > SessionState leaks SparkListener with multiple SparkSession > --- > > Key: SPARK-32165 > URL: https://issues.apache.org/jira/browse/SPARK-32165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianjin YE >Priority: Major > > Copied from > [https://github.com/apache/spark/pull/28128#issuecomment-653102770] > I'd like to point out that this pr > (https://github.com/apache/spark/pull/28128) doesn't fix the memory leaky > completely. Once {{SessionState}} is touched, it will add two more listeners > into the SparkContext, namely {{SQLAppStatusListener}} and > {{ExecutionListenerBus}} > It can be reproduced easily as > {code:java} > test("SPARK-31354: SparkContext only register one SparkSession > ApplicationEnd listener") { > val conf = new SparkConf() > .setMaster("local") > .setAppName("test-app-SPARK-31354-1") > val context = new SparkContext(conf) > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postFirstCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > SparkSession > .builder() > .sparkContext(context) > .master("local") > .getOrCreate() > .sessionState // this touches the sessionState > val postSecondCreation = context.listenerBus.listeners.size() > SparkSession.clearActiveSession() > SparkSession.clearDefaultSession() > assert(postFirstCreation == postSecondCreation) > } > {code} > The problem can be reproduced by the above code. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30661) KMeans blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478678#comment-17478678 ] Sean R. Owen commented on SPARK-30661: -- How much difference does it make? I'm weighing the cost of a new user parameter and more code vs benefit. I would, I suppose, not expect clustering input to be exceptionally sparse. Sparse often implies high dimensional, and everything is far from everything in high dimensions, so clustering makes less sense. If anything that is an argument for your change. I am just wondering out loud about even whether to change the default to the blocked impl, if this proceeds. > KMeans blockify input vectors > - > > Key: SPARK-30661 > URL: https://issues.apache.org/jira/browse/SPARK-30661 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37955) PartitioningAwareFileIndex->basePath incorrectly contains the partition filters
[ https://issues.apache.org/jira/browse/SPARK-37955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Chatzistergiou resolved SPARK-37955. Resolution: Not A Bug > PartitioningAwareFileIndex->basePath incorrectly contains the partition > filters > --- > > Key: SPARK-37955 > URL: https://issues.apache.org/jira/browse/SPARK-37955 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Andreas Chatzistergiou >Priority: Minor > > PartitioningAwareFileIndex.getBasePath method returns paths that contain the > partitioning directories. This violates the definition of the basePath per > FileIndex, i.e. the parent directory of a file path with all the partitioning > directories being stripped off. > This PR fixes the issue by separating the notion of the partitioningPaths and > the basePaths in the PartitioningAwareFileIndex. The basePaths are derived by > removing from the partitioningPaths any partitioning columns with the aid of > the PartitioningSchema. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog
[ https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478635#comment-17478635 ] Apache Spark commented on SPARK-37963: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/35251 > Need to update Partition URI after renaming table in InMemoryCatalog > > > Key: SPARK-37963 > URL: https://issues.apache.org/jira/browse/SPARK-37963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > After renaming a partitioned table, select from the new table from > InMemoryCatalog will get an empty result. > The following checkAnswer will fail as the result is empty. > {code:java} > sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)") > sql("insert into table foo partition(j=2) values (1)") > sql(s"alter table foo rename to bar") > checkAnswer(spark.table("bar"), Row(1, 2)) {code} > To fix the bug, we need to update Partition URI after renaming a table in > InMemoryCatalog > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog
[ https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37963: Assignee: Gengliang Wang (was: Apache Spark) > Need to update Partition URI after renaming table in InMemoryCatalog > > > Key: SPARK-37963 > URL: https://issues.apache.org/jira/browse/SPARK-37963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > After renaming a partitioned table, select from the new table from > InMemoryCatalog will get an empty result. > The following checkAnswer will fail as the result is empty. > {code:java} > sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)") > sql("insert into table foo partition(j=2) values (1)") > sql(s"alter table foo rename to bar") > checkAnswer(spark.table("bar"), Row(1, 2)) {code} > To fix the bug, we need to update Partition URI after renaming a table in > InMemoryCatalog > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog
[ https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37963: Assignee: Apache Spark (was: Gengliang Wang) > Need to update Partition URI after renaming table in InMemoryCatalog > > > Key: SPARK-37963 > URL: https://issues.apache.org/jira/browse/SPARK-37963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > After renaming a partitioned table, select from the new table from > InMemoryCatalog will get an empty result. > The following checkAnswer will fail as the result is empty. > {code:java} > sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)") > sql("insert into table foo partition(j=2) values (1)") > sql(s"alter table foo rename to bar") > checkAnswer(spark.table("bar"), Row(1, 2)) {code} > To fix the bug, we need to update Partition URI after renaming a table in > InMemoryCatalog > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog
Gengliang Wang created SPARK-37963: -- Summary: Need to update Partition URI after renaming table in InMemoryCatalog Key: SPARK-37963 URL: https://issues.apache.org/jira/browse/SPARK-37963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang After renaming a partitioned table, select from the new table from InMemoryCatalog will get an empty result. The following checkAnswer will fail as the result is empty. {code:java} sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)") sql("insert into table foo partition(j=2) values (1)") sql(s"alter table foo rename to bar") checkAnswer(spark.table("bar"), Row(1, 2)) {code} To fix the bug, we need to update Partition URI after renaming a table in InMemoryCatalog -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478618#comment-17478618 ] pralabhkumar commented on SPARK-24432: -- [~dongjoon] one quick question . - The K8s dynamic allocation with storage migration between executors is already in `master` branch for Apache Spark 3.1.0. If u can please provide the PR which is doing that , it would be really helpful > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478610#comment-17478610 ] Kevin Wallimann commented on SPARK-34805: - The problem happens in Scala as well. I attached a scala file [^nested_columns_metadata.scala] to demonstrate the issue. I tried it in the spark-shell of versions 2.4.7, 3.1.2 and 3.2.0, always with the same result. This behavior is a bug, because the documentation for {{StructField}} clearly says that the "metadata should be preserved during transformation if the content of the column is not modified, e.g, in selection" > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Attachments: jsonMetadataTest.py, nested_columns_metadata.scala > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wallimann updated SPARK-34805: Attachment: nested_columns_metadata.scala > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Attachments: jsonMetadataTest.py, nested_columns_metadata.scala > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view
[ https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478590#comment-17478590 ] Zhixiong Chen commented on SPARK-37932: --- I'm working on > Analyzer can fail when join left side and right side are the same view > -- > > Key: SPARK-37932 > URL: https://issues.apache.org/jira/browse/SPARK-37932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Feng Zhu >Priority: Major > Attachments: sql_and_exception > > > See the attachment for details, including SQL and the exception information. > * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side > subquery, Analyzer works as expected; > * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in > the right side subquery, Analyzer failed with "Resolved attribute(s) > LO_SUPPKEY#337 missing ...". > From the debug info, the problem seems to be occurred after the rule > DeduplicateRelations is applied. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37932) Analyzer can fail when join left side and right side are the same view
[ https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478542#comment-17478542 ] Feng Zhu edited comment on SPARK-37932 at 1/19/22, 10:31 AM: - test {code:scala} test("SPARK-37932: view join view self with having filter") { withTable("t") { withView("v1") { Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name") .write.format("parquet").saveAsTable("t") sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t") sql(""" |SELECT l1.id | FROM v1 l1 | INNER JOIN ( | SELECT id | FROM v1 | GROUP BY id | HAVING COUNT(DISTINCT name) > 1 | ) l2 | ON l1.id = l2.id | GROUP BY l1.name, l1.id; """.stripMargin) } } } {code} exception {code:java} org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.; Aggregate [name#25, id#24], [id#24] +- Join Inner, (id#24 = id#29) :- SubqueryAlias l1 : +- SubqueryAlias spark_catalog.default.v1 : +- View (`default`.`v1`, [id#24,name#25]) : +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS name#25] : +- Project [id#20, name#21] : +- SubqueryAlias spark_catalog.default.t : +- Relation default.t[id#20,name#21] parquet +- SubqueryAlias l2 +- Project [id#29] +- Filter (count(distinct name#25)#31L > cast(1 as bigint)) +- !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L] +- SubqueryAlias spark_catalog.default.v1 +- View (`default`.`v1`, [id#29,name#30]) +- Project [cast(id#26 as int) AS id#29, cast(name#27 as string) AS name#30] +- Project [id#26, name#27] +- SubqueryAlias spark_catalog.default.t +- Relation default.t[id#26,name#27] parquet at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50) {code} was (Author: fishcus): test {code:scala} test("SPARK-37932: view join view self with having filter") { withTable("t") { withView("v1") { Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name") .write.format("parquet").saveAsTable("t") sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t") sql(""" |SELECT l1.id | FROM v1 l1 | INNER JOIN ( | SELECT id | FROM v1 | GROUP BY id | HAVING COUNT(DISTINCT name) > 1 | ) l2 | ON l1.id = l2.id | GROUP BY l1.name, l1.id; """.stripMargin) } } } {code} exception {code:java} org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.; Aggregate [name#25, id#24], [id#24] +- Join Inner, (id#24 = id#29) :- SubqueryAlias l1 : +- SubqueryAlias spark_catalog.default.v1 : +- View (`default`.`v1`, [id#24,name#25]) : +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS name#25] : +- Project [id#20, name#21] : +- SubqueryAlias spark_catalog.default.t : +- Relation default.t[id#20,name#21] parquet +- SubqueryAlias l2 +- Project [id#29] +- Filter (count(distinct name#25)#31L > cast(1 as bigint)) +- !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L] +- SubqueryAlias spark_catalog.default.v1 +- View (`default`.`v1`, [id#29,name#30]) +- Project [cast(id#26 as int) AS id#29, cast(name#27 as string) AS name#30] +- Project [id#26, name#27] +- SubqueryAlias spark_catalog.default.t +- Relation default.t[id#26,name#27] parquet at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50) {code} > Analyzer can fail when join left side and right side are the same view > -- > > Key: SPARK-37932 > URL: https://issues.a
[jira] [Comment Edited] (SPARK-37932) Analyzer can fail when join left side and right side are the same view
[ https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478542#comment-17478542 ] Feng Zhu edited comment on SPARK-37932 at 1/19/22, 10:30 AM: - test {code:scala} test("SPARK-37932: view join view self with having filter") { withTable("t") { withView("v1") { Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name") .write.format("parquet").saveAsTable("t") sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t") sql(""" |SELECT l1.id | FROM v1 l1 | INNER JOIN ( | SELECT id | FROM v1 | GROUP BY id | HAVING COUNT(DISTINCT name) > 1 | ) l2 | ON l1.id = l2.id | GROUP BY l1.name, l1.id; """.stripMargin) } } } {code} exception {code:java} org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.; Aggregate [name#25, id#24], [id#24] +- Join Inner, (id#24 = id#29) :- SubqueryAlias l1 : +- SubqueryAlias spark_catalog.default.v1 : +- View (`default`.`v1`, [id#24,name#25]) : +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS name#25] : +- Project [id#20, name#21] : +- SubqueryAlias spark_catalog.default.t : +- Relation default.t[id#20,name#21] parquet +- SubqueryAlias l2 +- Project [id#29] +- Filter (count(distinct name#25)#31L > cast(1 as bigint)) +- !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L] +- SubqueryAlias spark_catalog.default.v1 +- View (`default`.`v1`, [id#29,name#30]) +- Project [cast(id#26 as int) AS id#29, cast(name#27 as string) AS name#30] +- Project [id#26, name#27] +- SubqueryAlias spark_catalog.default.t +- Relation default.t[id#26,name#27] parquet at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50) {code} was (Author: fishcus): test {code:scala} test("SPARK-37932: view join view self with having filter") { withTable("t") { withView("v1") { Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name") .write.format("parquet").saveAsTable("t") sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t") sql(""" |SELECT l1.id | FROM v1 l1 | INNER JOIN ( | SELECT id | FROM v1 | GROUP BY id | HAVING COUNT(DISTINCT name) > 1 | ) l2 | ON l1.id = l2.id | GROUP BY l1.name, l1.id; """.stripMargin) } } } {code} exception {code} org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.; Aggregate [name#25, id#24], [id#24] +- Join Inner, (id#24 = id#29) :- SubqueryAlias l1 : +- SubqueryAlias spark_catalog.default.v1 : +- View (`default`.`v1`, [id#24,name#25]) : +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS name#25] : +- Project [id#20, name#21] : +- SubqueryAlias spark_catalog.default.t : +- Relation default.t[id#20,name#21] parquet +- SubqueryAlias l2 +- Project [id#29] +- Filter (count(distinct name#25)#31L > cast(1 as bigint)) +- !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L] +- SubqueryAlias spark_catalog.default.v1 +- View (`default`.`v1`, [id#29,name#30]) +- Project [cast(id#26 as int) AS id#29, cast(name#27 as string) AS name#30] +- Project [id#26, name#27] +- SubqueryAlias spark_catalog.default.t +- Relation default.t[id#26,name#27] parquet at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50) {code} > Analyzer can fail when join left side and right side are the same view > -- > > Key: SPARK-37932 > URL: https://issues.apache.org/jira/browse/SPARK-37932 > Project: Spark > Issue Type: Bug > Components: SQL >Aff
[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view
[ https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478542#comment-17478542 ] Feng Zhu commented on SPARK-37932: -- test {code:scala} test("SPARK-37932: view join view self with having filter") { withTable("t") { withView("v1") { Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name") .write.format("parquet").saveAsTable("t") sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t") sql(""" |SELECT l1.id | FROM v1 l1 | INNER JOIN ( | SELECT id | FROM v1 | GROUP BY id | HAVING COUNT(DISTINCT name) > 1 | ) l2 | ON l1.id = l2.id | GROUP BY l1.name, l1.id; """.stripMargin) } } } {code} exception {code} org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.; Aggregate [name#25, id#24], [id#24] +- Join Inner, (id#24 = id#29) :- SubqueryAlias l1 : +- SubqueryAlias spark_catalog.default.v1 : +- View (`default`.`v1`, [id#24,name#25]) : +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS name#25] : +- Project [id#20, name#21] : +- SubqueryAlias spark_catalog.default.t : +- Relation default.t[id#20,name#21] parquet +- SubqueryAlias l2 +- Project [id#29] +- Filter (count(distinct name#25)#31L > cast(1 as bigint)) +- !Aggregate [id#29], [id#29, count(distinct name#25) AS count(distinct name#25)#31L] +- SubqueryAlias spark_catalog.default.v1 +- View (`default`.`v1`, [id#29,name#30]) +- Project [cast(id#26 as int) AS id#29, cast(name#27 as string) AS name#30] +- Project [id#26, name#27] +- SubqueryAlias spark_catalog.default.t +- Relation default.t[id#26,name#27] parquet at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50) {code} > Analyzer can fail when join left side and right side are the same view > -- > > Key: SPARK-37932 > URL: https://issues.apache.org/jira/browse/SPARK-37932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Feng Zhu >Priority: Major > Attachments: sql_and_exception > > > See the attachment for details, including SQL and the exception information. > * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side > subquery, Analyzer works as expected; > * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in > the right side subquery, Analyzer failed with "Resolved attribute(s) > LO_SUPPKEY#337 missing ...". > From the debug info, the problem seems to be occurred after the rule > DeduplicateRelations is applied. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37932) Analyzer can fail when join left side and right side are the same view
[ https://issues.apache.org/jira/browse/SPARK-37932 ] Feng Zhu deleted comment on SPARK-37932: -- was (Author: fishcus): {code:scala} {code} > Analyzer can fail when join left side and right side are the same view > -- > > Key: SPARK-37932 > URL: https://issues.apache.org/jira/browse/SPARK-37932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Feng Zhu >Priority: Major > Attachments: sql_and_exception > > > See the attachment for details, including SQL and the exception information. > * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side > subquery, Analyzer works as expected; > * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in > the right side subquery, Analyzer failed with "Resolved attribute(s) > LO_SUPPKEY#337 missing ...". > From the debug info, the problem seems to be occurred after the rule > DeduplicateRelations is applied. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view
[ https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478537#comment-17478537 ] Feng Zhu commented on SPARK-37932: -- {code:scala} {code} > Analyzer can fail when join left side and right side are the same view > -- > > Key: SPARK-37932 > URL: https://issues.apache.org/jira/browse/SPARK-37932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Feng Zhu >Priority: Major > Attachments: sql_and_exception > > > See the attachment for details, including SQL and the exception information. > * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side > subquery, Analyzer works as expected; > * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in > the right side subquery, Analyzer failed with "Resolved attribute(s) > LO_SUPPKEY#337 missing ...". > From the debug info, the problem seems to be occurred after the rule > DeduplicateRelations is applied. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode
[ https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478482#comment-17478482 ] Petri commented on SPARK-37910: --- Also the error message we get in the executor is pretty vague: Executor self-exiting due to : Driver 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting down. It raises questions: * What does the disassociation mean? Is it anything related to disconnection or what? * Why the executor must self-exit? Would it be possible to retry driver association? It would be good to improve the error message and related documentation. > Spark executor self-exiting due to driver disassociated in Kubernetes with > client deploy-mode > - > > Key: SPARK-37910 > URL: https://issues.apache.org/jira/browse/SPARK-37910 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Petri >Priority: Major > > I have Spark driver running in a Kubernetes pod with client deploy-mode and > it tries to start an executor. > Executor will fail with error: > \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", > "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", > "class":"dispatcher-Executor", > "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", > "log":"Executor self-exiting due to : Driver > 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting > down.\n"} > Then driver will attempt to start another executor which fails with same > error and this goes on and on. > In the driver pod, I see only following errors: > 22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on > 192.168.43.250: > 22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on > 192.168.43.233: > 22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on > 192.168.43.221: > 22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on > 192.168.43.217: > 22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on > 192.168.43.197: > 22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on > 192.168.43.237: > 22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on > 192.168.43.196: > 22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on > 192.168.43.228: > 22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on > 192.168.43.254: > 22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on > 192.168.43.204: > 22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on > 192.168.43.231: > What is wrong? And how can I get executors running correctly? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode
[ https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478478#comment-17478478 ] Petri commented on SPARK-37910: --- In deployment.yaml we have: * name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name * name: SPARK_DRIVER_BIND_ADDRESS valueFrom: fieldRef: fieldPath: status.podIP * name: K8S_NS valueFrom: fieldRef: fieldPath: metadata.namespace We are setting following confs for spark-submit: DRIVER_HOSTNAME=$(echo $SPARK_DRIVER_BIND_ADDRESS | sed 's/\./-/g') --conf spark.kubernetes.driver.pod.name=$POD_NAME \ --conf spark.driver.host=$DRIVER_HOSTNAME.$K8S_NS.pod.cluster.local \ So we are using Pod DNS name, is that ok? Or should we use headless service? Your documentation is not clear about it. What we are missing in our confs is the spark.driver.port. Is that a mandatory conf needed? Can you give exact steps how to check to pod network status? We have a quite similar setup in our other microservice, which is working OK with (Spark 3.2.0 ja Java 11), but for some reason this microservice in question has the problem. > Spark executor self-exiting due to driver disassociated in Kubernetes with > client deploy-mode > - > > Key: SPARK-37910 > URL: https://issues.apache.org/jira/browse/SPARK-37910 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Petri >Priority: Major > > I have Spark driver running in a Kubernetes pod with client deploy-mode and > it tries to start an executor. > Executor will fail with error: > \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", > "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", > "class":"dispatcher-Executor", > "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", > "log":"Executor self-exiting due to : Driver > 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting > down.\n"} > Then driver will attempt to start another executor which fails with same > error and this goes on and on. > In the driver pod, I see only following errors: > 22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on > 192.168.43.250: > 22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on > 192.168.43.233: > 22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on > 192.168.43.221: > 22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on > 192.168.43.217: > 22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on > 192.168.43.197: > 22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on > 192.168.43.237: > 22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on > 192.168.43.196: > 22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on > 192.168.43.228: > 22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on > 192.168.43.254: > 22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on > 192.168.43.204: > 22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on > 192.168.43.231: > What is wrong? And how can I get executors running correctly? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37962) Cannot fetch remote jar correctly
[ https://issues.apache.org/jira/browse/SPARK-37962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478469#comment-17478469 ] Jinpeng Chi commented on SPARK-37962: - The root cause was that the link I had encoded was decoded > Cannot fetch remote jar correctly > - > > Key: SPARK-37962 > URL: https://issues.apache.org/jira/browse/SPARK-37962 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0 >Reporter: Jinpeng Chi >Priority: Major > Attachments: image-2022-01-19-17-18-24-795.png, > image-2022-01-19-17-21-53-011.png > > > When my Jar link address is encoded, the Jar cannot be pulled correctly > Log: > !image-2022-01-19-17-18-24-795.png! > > My static file server(tomcat) log: > !image-2022-01-19-17-21-53-011.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37962) Cannot fetch remote jar correctly
[ https://issues.apache.org/jira/browse/SPARK-37962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinpeng Chi updated SPARK-37962: Description: When my Jar link address is encoded, the Jar cannot be pulled correctly Log: !image-2022-01-19-17-18-24-795.png! My static file server(tomcat) log: !image-2022-01-19-17-21-53-011.png! > Cannot fetch remote jar correctly > - > > Key: SPARK-37962 > URL: https://issues.apache.org/jira/browse/SPARK-37962 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0 >Reporter: Jinpeng Chi >Priority: Major > Attachments: image-2022-01-19-17-18-24-795.png, > image-2022-01-19-17-21-53-011.png > > > When my Jar link address is encoded, the Jar cannot be pulled correctly > Log: > !image-2022-01-19-17-18-24-795.png! > > My static file server(tomcat) log: > !image-2022-01-19-17-21-53-011.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37962) Cannot fetch remote jar correctly
[ https://issues.apache.org/jira/browse/SPARK-37962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinpeng Chi updated SPARK-37962: Attachment: image-2022-01-19-17-21-53-011.png > Cannot fetch remote jar correctly > - > > Key: SPARK-37962 > URL: https://issues.apache.org/jira/browse/SPARK-37962 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0 >Reporter: Jinpeng Chi >Priority: Major > Attachments: image-2022-01-19-17-18-24-795.png, > image-2022-01-19-17-21-53-011.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37962) Cannot fetch remote jar correctly
[ https://issues.apache.org/jira/browse/SPARK-37962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinpeng Chi updated SPARK-37962: Attachment: image-2022-01-19-17-18-24-795.png > Cannot fetch remote jar correctly > - > > Key: SPARK-37962 > URL: https://issues.apache.org/jira/browse/SPARK-37962 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0 >Reporter: Jinpeng Chi >Priority: Major > Attachments: image-2022-01-19-17-18-24-795.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37962) Cannot fetch remote jar correctly
Jinpeng Chi created SPARK-37962: --- Summary: Cannot fetch remote jar correctly Key: SPARK-37962 URL: https://issues.apache.org/jira/browse/SPARK-37962 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0, 3.1.2 Reporter: Jinpeng Chi Attachments: image-2022-01-19-17-18-24-795.png -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37961) override maxRows/maxRowsPerPartition for some logical operators
[ https://issues.apache.org/jira/browse/SPARK-37961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37961: Assignee: (was: Apache Spark) > override maxRows/maxRowsPerPartition for some logical operators > --- > > Key: SPARK-37961 > URL: https://issues.apache.org/jira/browse/SPARK-37961 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37915) Push down deterministic projection through SQL UNION
[ https://issues.apache.org/jira/browse/SPARK-37915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478451#comment-17478451 ] Apache Spark commented on SPARK-37915: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/35249 > Push down deterministic projection through SQL UNION > > > Key: SPARK-37915 > URL: https://issues.apache.org/jira/browse/SPARK-37915 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > spark.range(11).selectExpr("cast(id as decimal(18, 1)) as a", "id as b", "id > as c").write.saveAsTable("t1") > spark.range(12).selectExpr("cast(id as decimal(18, 2)) as a", "id as b", "id > as c").write.saveAsTable("t2") > spark.range(13).selectExpr("cast(id as decimal(18, 3)) as a", "id as b", "id > as c").write.saveAsTable("t3") > spark.range(14).selectExpr("cast(id as decimal(18, 4)) as a", "id as b", "id > as c").write.saveAsTable("t4") > spark.range(15).selectExpr("cast(id as decimal(18, 5)) as a", "id as b", "id > as c").write.saveAsTable("t5") > sql("select a from t1 union select a from t2 union select a from t3 union > select a from t4 union select a from t5").explain(true) > {code} > Current: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[a#76], functions=[], output=[a#76]) >+- Exchange hashpartitioning(a#76, 5), ENSURE_REQUIREMENTS, [id=#159] > +- HashAggregate(keys=[a#76], functions=[], output=[a#76]) > +- Union > :- HashAggregate(keys=[a#74], functions=[], output=[a#76]) > : +- Exchange hashpartitioning(a#74, 5), ENSURE_REQUIREMENTS, > [id=#154] > : +- HashAggregate(keys=[a#74], functions=[], output=[a#74]) > :+- Union > : :- HashAggregate(keys=[a#72], functions=[], > output=[a#74]) > : : +- Exchange hashpartitioning(a#72, 5), > ENSURE_REQUIREMENTS, [id=#149] > : : +- HashAggregate(keys=[a#72], functions=[], > output=[a#72]) > : :+- Union > : : :- HashAggregate(keys=[a#70], > functions=[], output=[a#72]) > : : : +- Exchange hashpartitioning(a#70, 5), > ENSURE_REQUIREMENTS, [id=#144] > : : : +- HashAggregate(keys=[a#70], > functions=[], output=[a#70]) > : : :+- Union > : : : :- Project [cast(a#55 as > decimal(19,2)) AS a#70] > : : : : +- FileScan parquet > default.t1[a#55] Batched: true, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex(1 > paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or..., > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > : : : +- Project [cast(a#58 as > decimal(19,2)) AS a#71] > : : : +- FileScan parquet > default.t2[a#58] Batched: true, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex(1 > paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or..., > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > : : +- Project [cast(a#61 as decimal(20,3)) > AS a#73] > : : +- FileScan parquet default.t3[a#61] > Batched: true, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex(1 > paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or..., > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > : +- Project [cast(a#64 as decimal(21,4)) AS a#75] > : +- FileScan parquet default.t4[a#64] Batched: > true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or..., > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > +- Project [cast(a#67 as decimal(22,5)) AS a#77] >+- FileScan parquet default.t5[a#67] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or..., > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > {noformat} > Expected: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[a#76], functions=[], output=[a#76]) >+- Exchange hashpartitioning(a#76, 5), ENSURE_REQUIREMENTS, [id=#111] > +- HashAggregat
[jira] [Assigned] (SPARK-37961) override maxRows/maxRowsPerPartition for some logical operators
[ https://issues.apache.org/jira/browse/SPARK-37961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37961: Assignee: Apache Spark > override maxRows/maxRowsPerPartition for some logical operators > --- > > Key: SPARK-37961 > URL: https://issues.apache.org/jira/browse/SPARK-37961 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37961) override maxRows/maxRowsPerPartition for some logical operators
[ https://issues.apache.org/jira/browse/SPARK-37961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478452#comment-17478452 ] Apache Spark commented on SPARK-37961: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/35250 > override maxRows/maxRowsPerPartition for some logical operators > --- > > Key: SPARK-37961 > URL: https://issues.apache.org/jira/browse/SPARK-37961 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
[ https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37960: Assignee: (was: Apache Spark) > Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END) > --- > > Key: SPARK-37960 > URL: https://issues.apache.org/jira/browse/SPARK-37960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark supports aggregate push down SUM(column) into JDBC data > source. > SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
[ https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37960: Assignee: Apache Spark > Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END) > --- > > Key: SPARK-37960 > URL: https://issues.apache.org/jira/browse/SPARK-37960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Currently, Spark supports aggregate push down SUM(column) into JDBC data > source. > SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
[ https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478445#comment-17478445 ] Apache Spark commented on SPARK-37960: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/35248 > Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END) > --- > > Key: SPARK-37960 > URL: https://issues.apache.org/jira/browse/SPARK-37960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark supports aggregate push down SUM(column) into JDBC data > source. > SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37951) Refactor ImageFileFormatSuite
[ https://issues.apache.org/jira/browse/SPARK-37951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37951: --- Assignee: angerszhu > Refactor ImageFileFormatSuite > - > > Key: SPARK-37951 > URL: https://issues.apache.org/jira/browse/SPARK-37951 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Not use standard API, sometimes failed, optimize it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37951) Refactor ImageFileFormatSuite
[ https://issues.apache.org/jira/browse/SPARK-37951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37951. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35237 [https://github.com/apache/spark/pull/35237] > Refactor ImageFileFormatSuite > - > > Key: SPARK-37951 > URL: https://issues.apache.org/jira/browse/SPARK-37951 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Not use standard API, sometimes failed, optimize it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37961) override maxRows/maxRowsPerPartition for some logical operators
zhengruifeng created SPARK-37961: Summary: override maxRows/maxRowsPerPartition for some logical operators Key: SPARK-37961 URL: https://issues.apache.org/jira/browse/SPARK-37961 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
[ https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37960: --- Description: Currently, Spark supports aggregate push down SUM(column) into JDBC data source. SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. was: Currently, Spark supports complete push down SUM(column) into JDBC data source. SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. > Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END) > --- > > Key: SPARK-37960 > URL: https://issues.apache.org/jira/browse/SPARK-37960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark supports aggregate push down SUM(column) into JDBC data > source. > SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
[ https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37960: --- Summary: Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END) (was: Support complete push down SUM(CASE ... WHEN ... ELSE ... END)) > Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END) > --- > > Key: SPARK-37960 > URL: https://issues.apache.org/jira/browse/SPARK-37960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark supports complete push down SUM(column) into JDBC data > source. > SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30661) KMeans blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478437#comment-17478437 ] zhengruifeng commented on SPARK-30661: -- recently, I spend some time on testing blockify kmeans and apply GEMM in finding the closest cluster. In short: 1, for sparse datasets, blockifying kmeans still cause regression in most cases; (existing impl with triangle-inequality can skip some distance computation, but scala-based sparse BLAS will always compute all distances) 2, for dense datasets and small k, blockifying kmeans (without native BLAS) is competitive; with native BLAS, it should be significantly faster than existing impl. So I plan to add a new parameter {{solver}} by making KMeans extending HasSolver, and support both two training impls, so that end users can switch to the blockify version. How do you think about it? [~srowen] [~WeichenXu123] [~mengxr] [~huaxingao] > KMeans blockify input vectors > - > > Key: SPARK-30661 > URL: https://issues.apache.org/jira/browse/SPARK-30661 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37960) Support complete push down SUM(CASE ... WHEN ... ELSE ... END)
jiaan.geng created SPARK-37960: -- Summary: Support complete push down SUM(CASE ... WHEN ... ELSE ... END) Key: SPARK-37960 URL: https://issues.apache.org/jira/browse/SPARK-37960 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng Currently, Spark supports complete push down SUM(column) into JDBC data source. SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans
[ https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37959: Assignee: Apache Spark > Fix the UT of checking norm in KMeans & BiKMeans > > > Key: SPARK-37959 > URL: https://issues.apache.org/jira/browse/SPARK-37959 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > In KMeansSuite and BisectingKMeansSuite, there are some unused lines: > > {code:java} > model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code} > > For cosine distance, the norm of centering vector should be 1, so the norm > checking is meaningful; > For euclidean distance, the norm checking is meaningless; > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans
[ https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37959: Assignee: (was: Apache Spark) > Fix the UT of checking norm in KMeans & BiKMeans > > > Key: SPARK-37959 > URL: https://issues.apache.org/jira/browse/SPARK-37959 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Minor > > In KMeansSuite and BisectingKMeansSuite, there are some unused lines: > > {code:java} > model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code} > > For cosine distance, the norm of centering vector should be 1, so the norm > checking is meaningful; > For euclidean distance, the norm checking is meaningless; > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans
[ https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37959: Assignee: Apache Spark > Fix the UT of checking norm in KMeans & BiKMeans > > > Key: SPARK-37959 > URL: https://issues.apache.org/jira/browse/SPARK-37959 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > In KMeansSuite and BisectingKMeansSuite, there are some unused lines: > > {code:java} > model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code} > > For cosine distance, the norm of centering vector should be 1, so the norm > checking is meaningful; > For euclidean distance, the norm checking is meaningless; > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans
[ https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478405#comment-17478405 ] Apache Spark commented on SPARK-37959: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/35247 > Fix the UT of checking norm in KMeans & BiKMeans > > > Key: SPARK-37959 > URL: https://issues.apache.org/jira/browse/SPARK-37959 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Minor > > In KMeansSuite and BisectingKMeansSuite, there are some unused lines: > > {code:java} > model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code} > > For cosine distance, the norm of centering vector should be 1, so the norm > checking is meaningful; > For euclidean distance, the norm checking is meaningless; > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans
zhengruifeng created SPARK-37959: Summary: Fix the UT of checking norm in KMeans & BiKMeans Key: SPARK-37959 URL: https://issues.apache.org/jira/browse/SPARK-37959 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.3.0 Reporter: zhengruifeng In KMeansSuite and BisectingKMeansSuite, there are some unused lines: {code:java} model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code} For cosine distance, the norm of centering vector should be 1, so the norm checking is meaningful; For euclidean distance, the norm checking is meaningless; -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org