[jira] [Updated] (SPARK-37967) ConstantFolding/ Literal.create support ObjectType

2022-01-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37967:
--
Summary: ConstantFolding/ Literal.create support ObjectType  (was: Literal 
support ObjectType)

> ConstantFolding/ Literal.create support ObjectType
> --
>
> Key: SPARK-37967
> URL: https://issues.apache.org/jira/browse/SPARK-37967
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37967) Literal support ObjectType

2022-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-37967:
-

 Summary: Literal support ObjectType
 Key: SPARK-37967
 URL: https://issues.apache.org/jira/browse/SPARK-37967
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37948) Disable mapreduce.fileoutputcommitter.algorithm.version=2 by default

2022-01-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479096#comment-17479096
 ] 

Hyukjin Kwon commented on SPARK-37948:
--

The problem is that users might intentionally enable v2 protocol, and it makes 
less sense to warn and disable it. They might already know the risk, and enable 
v2. I personally think it's discouraged to assume that user's input is wrong.

> Disable mapreduce.fileoutputcommitter.algorithm.version=2 by default
> 
>
> Key: SPARK-37948
> URL: https://issues.apache.org/jira/browse/SPARK-37948
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: hujiahua
>Priority: Major
>
> The hadoop MR v2 commit algorithm had a correctness issue described by 
> SPARK-33019, and changed 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default. 
> But some spark users like me ware unaware of this correctness issue before 
> and had used v2 commit algorithm in spark 2.x for performance purposes. And 
> after upgrade to spark 3.x, we encountered this correctness issue in 
> production environment, caused a very serious failure.The trigger probability 
> of this issue was higher in new version spark 3.x, and I didn't delve into 
> the specific reasons. So I propose we should better disable 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 by default, if 
> users using v2 commit algorithm, then fail the job and warn users this 
> correctness issue. Or users can choose to force the v2 usage through a new 
> configuration.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37954) old columns should not be available after select or drop

2022-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37954:
-
Component/s: SQL

> old columns should not be available after select or drop
> 
>
> Key: SPARK-37954
> URL: https://issues.apache.org/jira/browse/SPARK-37954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.1
>Reporter: Jean Bon
>Priority: Major
>
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col as col
> spark = SparkSession.builder.appName('available_columns').getOrCreate()
> df = spark.range(5).select((col("id")+10).alias("id2"))
> assert df.columns==["id2"] #OK
> try:
>     df.select("id")
>     error_raise = False
> except:
>     error_raise = True
> assert error_raise #OK
> df = df.drop("id") #should raise an error
> df.filter(col("id")!=2).count() #returns 4, should raise an error
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37839) DS V2 supports partial aggregate push-down AVG

2022-01-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37839.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35130
[https://github.com/apache/spark/pull/35130]

> DS V2 supports partial aggregate push-down AVG
> --
>
> Key: SPARK-37839
> URL: https://issues.apache.org/jira/browse/SPARK-37839
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, DS V2 supports complete aggregate push-down AVG. But, supports 
> partial aggregate push-down for AVG is very useful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37839) DS V2 supports partial aggregate push-down AVG

2022-01-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37839:
---

Assignee: jiaan.geng

> DS V2 supports partial aggregate push-down AVG
> --
>
> Key: SPARK-37839
> URL: https://issues.apache.org/jira/browse/SPARK-37839
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Currently, DS V2 supports complete aggregate push-down AVG. But, supports 
> partial aggregate push-down for AVG is very useful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37966) Static insert should write _SUCCESS under partition path

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37966:


Assignee: Apache Spark

> Static insert should write _SUCCESS under partition path
> 
>
> Key: SPARK-37966
> URL: https://issues.apache.org/jira/browse/SPARK-37966
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Currently, static insert write _SUCCESS file under table path when use 
> DataSource Insert, this file should be under partition path.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37966) Static insert should write _SUCCESS under partition path

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37966:


Assignee: (was: Apache Spark)

> Static insert should write _SUCCESS under partition path
> 
>
> Key: SPARK-37966
> URL: https://issues.apache.org/jira/browse/SPARK-37966
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Currently, static insert write _SUCCESS file under table path when use 
> DataSource Insert, this file should be under partition path.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37966) Static insert should write _SUCCESS under partition path

2022-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479082#comment-17479082
 ] 

Apache Spark commented on SPARK-37966:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35254

> Static insert should write _SUCCESS under partition path
> 
>
> Key: SPARK-37966
> URL: https://issues.apache.org/jira/browse/SPARK-37966
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Currently, static insert write _SUCCESS file under table path when use 
> DataSource Insert, this file should be under partition path.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37965) Remove check field name when reading/writing existing data in ORC

2022-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479079#comment-17479079
 ] 

Apache Spark commented on SPARK-37965:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35253

> Remove check field name when reading/writing existing data in ORC
> -
>
> Key: SPARK-37965
> URL: https://issues.apache.org/jira/browse/SPARK-37965
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Remove check field name when reading existing data in Orc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37965) Remove check field name when reading/writing existing data in ORC

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37965:


Assignee: Apache Spark

> Remove check field name when reading/writing existing data in ORC
> -
>
> Key: SPARK-37965
> URL: https://issues.apache.org/jira/browse/SPARK-37965
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Remove check field name when reading existing data in Orc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37965) Remove check field name when reading/writing existing data in ORC

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37965:


Assignee: (was: Apache Spark)

> Remove check field name when reading/writing existing data in ORC
> -
>
> Key: SPARK-37965
> URL: https://issues.apache.org/jira/browse/SPARK-37965
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Remove check field name when reading existing data in Orc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37966) Static insert should write _SUCCESS under partition path

2022-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-37966:
-

 Summary: Static insert should write _SUCCESS under partition path
 Key: SPARK-37966
 URL: https://issues.apache.org/jira/browse/SPARK-37966
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Currently, static insert write _SUCCESS file under table path when use 
DataSource Insert, this file should be under partition path.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37965) Remove check field name when reading/writing existing data in ORC

2022-01-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37965:
--
Summary: Remove check field name when reading/writing existing data in ORC  
(was: Remove check field name when reading existing data in parquet)

> Remove check field name when reading/writing existing data in ORC
> -
>
> Key: SPARK-37965
> URL: https://issues.apache.org/jira/browse/SPARK-37965
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37965) Remove check field name when reading/writing existing data in ORC

2022-01-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37965:
--
Description: Remove check field name when reading existing data in Orc

> Remove check field name when reading/writing existing data in ORC
> -
>
> Key: SPARK-37965
> URL: https://issues.apache.org/jira/browse/SPARK-37965
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Remove check field name when reading existing data in Orc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37965) Remove check field name when reading existing data in parquet

2022-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-37965:
-

 Summary: Remove check field name when reading existing data in 
parquet
 Key: SPARK-37965
 URL: https://issues.apache.org/jira/browse/SPARK-37965
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37964) Replace usages of slaveTracker to workerTracker in MapOutputTrackerSuite

2022-01-19 Thread Venkata krishnan Sowrirajan (Jira)
Venkata krishnan Sowrirajan created SPARK-37964:
---

 Summary: Replace usages of slaveTracker to workerTracker in 
MapOutputTrackerSuite
 Key: SPARK-37964
 URL: https://issues.apache.org/jira/browse/SPARK-37964
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.2.0
Reporter: Venkata krishnan Sowrirajan






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37957) Deterministic flag is not handled for V2 functions

2022-01-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37957:
-
Fix Version/s: 3.2.1

> Deterministic flag is not handled for V2 functions
> --
>
> Key: SPARK-37957
> URL: https://issues.apache.org/jira/browse/SPARK-37957
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37957) Deterministic flag is not handled for V2 functions

2022-01-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37957.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35243
[https://github.com/apache/spark/pull/35243]

> Deterministic flag is not handled for V2 functions
> --
>
> Key: SPARK-37957
> URL: https://issues.apache.org/jira/browse/SPARK-37957
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37154) Inline type hints for python/pyspark/rdd.py

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37154:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/rdd.py
> ---
>
> Key: SPARK-37154
> URL: https://issues.apache.org/jira/browse/SPARK-37154
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37154) Inline type hints for python/pyspark/rdd.py

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37154:


Assignee: Apache Spark

> Inline type hints for python/pyspark/rdd.py
> ---
>
> Key: SPARK-37154
> URL: https://issues.apache.org/jira/browse/SPARK-37154
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37154) Inline type hints for python/pyspark/rdd.py

2022-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478948#comment-17478948
 ] 

Apache Spark commented on SPARK-37154:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35252

> Inline type hints for python/pyspark/rdd.py
> ---
>
> Key: SPARK-37154
> URL: https://issues.apache.org/jira/browse/SPARK-37154
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode

2022-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-37910.
-

> Spark executor self-exiting due to driver disassociated in Kubernetes with 
> client deploy-mode
> -
>
> Key: SPARK-37910
> URL: https://issues.apache.org/jira/browse/SPARK-37910
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Petri
>Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode and 
> it tries to start an executor.
> Executor will fail with error:
>     \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", 
> "class":"dispatcher-Executor", 
> "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", 
> "log":"Executor self-exiting due to : Driver 
> 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting 
> down.\n"}
> Then driver will attempt to start another executor which fails with same 
> error and this goes on and on.
> In the driver pod, I see only following errors:
>     22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 
> 192.168.43.250:
>     22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 
> 192.168.43.233:
>     22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 
> 192.168.43.221:
>     22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 
> 192.168.43.217:
>     22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 
> 192.168.43.197:
>     22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 
> 192.168.43.237:
>     22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 
> 192.168.43.196:
>     22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 
> 192.168.43.228:
>     22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 
> 192.168.43.254:
>     22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 
> 192.168.43.204:
>     22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 
> 192.168.43.231:
> What is wrong? And how can I get executors running correctly?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode

2022-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37910.
---
Resolution: Invalid

> Spark executor self-exiting due to driver disassociated in Kubernetes with 
> client deploy-mode
> -
>
> Key: SPARK-37910
> URL: https://issues.apache.org/jira/browse/SPARK-37910
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Petri
>Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode and 
> it tries to start an executor.
> Executor will fail with error:
>     \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", 
> "class":"dispatcher-Executor", 
> "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", 
> "log":"Executor self-exiting due to : Driver 
> 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting 
> down.\n"}
> Then driver will attempt to start another executor which fails with same 
> error and this goes on and on.
> In the driver pod, I see only following errors:
>     22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 
> 192.168.43.250:
>     22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 
> 192.168.43.233:
>     22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 
> 192.168.43.221:
>     22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 
> 192.168.43.217:
>     22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 
> 192.168.43.197:
>     22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 
> 192.168.43.237:
>     22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 
> 192.168.43.196:
>     22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 
> 192.168.43.228:
>     22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 
> 192.168.43.254:
>     22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 
> 192.168.43.204:
>     22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 
> 192.168.43.231:
> What is wrong? And how can I get executors running correctly?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode

2022-01-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478934#comment-17478934
 ] 

Dongjoon Hyun commented on SPARK-37910:
---

Hi, [~Silen].
Apache Spark JIRA issue is not supposed to be used as Q&A. Could you use 
mailing list or StackOverflow?
- https://spark.apache.org/community.html

Let me close this first because this seems to be misused.

> Spark executor self-exiting due to driver disassociated in Kubernetes with 
> client deploy-mode
> -
>
> Key: SPARK-37910
> URL: https://issues.apache.org/jira/browse/SPARK-37910
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Petri
>Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode and 
> it tries to start an executor.
> Executor will fail with error:
>     \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", 
> "class":"dispatcher-Executor", 
> "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", 
> "log":"Executor self-exiting due to : Driver 
> 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting 
> down.\n"}
> Then driver will attempt to start another executor which fails with same 
> error and this goes on and on.
> In the driver pod, I see only following errors:
>     22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 
> 192.168.43.250:
>     22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 
> 192.168.43.233:
>     22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 
> 192.168.43.221:
>     22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 
> 192.168.43.217:
>     22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 
> 192.168.43.197:
>     22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 
> 192.168.43.237:
>     22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 
> 192.168.43.196:
>     22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 
> 192.168.43.228:
>     22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 
> 192.168.43.254:
>     22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 
> 192.168.43.204:
>     22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 
> 192.168.43.231:
> What is wrong? And how can I get executors running correctly?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2022-01-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478907#comment-17478907
 ] 

Dongjoon Hyun commented on SPARK-24432:
---

[~pralabhkumar]. What you are looking at is not a single PR. For the 
configuration, please check 
{code}
spark.dynamicAllocation.* (including spark.dynamicAllocation.shuffleTracking.*)
spark.decommission.*
spark.storage.decommission.*
{code}

In addition, `master` branch is already for Apache Spark 3.3.0. It seems that 
you are using outdated Spark versions.

bq. The K8s dynamic allocation with storage migration between executors is 
already in `master` branch for Apache Spark 3.1.0.

If you didn't try to use the latest Apache Spark 3.2, please try Apache Spark 
3.2.1 RC2. Although it's not Apache Spark 3.3.0-SNAPSHOT, it has most available 
features you need.
- https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
- https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37934) Upgrade Jetty version to 9.4.44

2022-01-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-37934:


Assignee: Sajith A

> Upgrade Jetty version to 9.4.44
> ---
>
> Key: SPARK-37934
> URL: https://issues.apache.org/jira/browse/SPARK-37934
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Sajith A
>Assignee: Sajith A
>Priority: Minor
> Fix For: 3.3.0
>
>
> Upgrade Jetty version to 9.4.44.v20210927 in current Spark master to bring-in 
> the fixes for the 
> [jetty#6973|https://github.com/eclipse/jetty.project/issues/6973] issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37934) Upgrade Jetty version to 9.4.44

2022-01-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37934:
-
Issue Type: Improvement  (was: Bug)

> Upgrade Jetty version to 9.4.44
> ---
>
> Key: SPARK-37934
> URL: https://issues.apache.org/jira/browse/SPARK-37934
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Sajith A
>Priority: Minor
>
> Upgrade Jetty version to 9.4.44.v20210927 in current Spark master to bring-in 
> the fixes for the 
> [jetty#6973|https://github.com/eclipse/jetty.project/issues/6973] issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37934) Upgrade Jetty version to 9.4.44

2022-01-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37934.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35230
[https://github.com/apache/spark/pull/35230]

> Upgrade Jetty version to 9.4.44
> ---
>
> Key: SPARK-37934
> URL: https://issues.apache.org/jira/browse/SPARK-37934
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Sajith A
>Priority: Minor
> Fix For: 3.3.0
>
>
> Upgrade Jetty version to 9.4.44.v20210927 in current Spark master to bring-in 
> the fixes for the 
> [jetty#6973|https://github.com/eclipse/jetty.project/issues/6973] issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2022-01-19 Thread Kiran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478861#comment-17478861
 ] 

Kiran commented on SPARK-37690:
---

Got this issue with spark 3.2.0. Looking for workarounds but none worked as of 
now.

> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37928) Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark

2022-01-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37928:


Assignee: Yang Jie

> Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
> --
>
> Key: SPARK-37928
> URL: https://issues.apache.org/jira/browse/SPARK-37928
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37928) Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark

2022-01-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37928.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35226
[https://github.com/apache/spark/pull/35226]

> Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
> --
>
> Key: SPARK-37928
> URL: https://issues.apache.org/jira/browse/SPARK-37928
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans

2022-01-19 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-37959.

Fix Version/s: 3.2.1
   3.3.0
 Assignee: zhengruifeng  (was: Apache Spark)
   Resolution: Fixed

> Fix the UT of checking norm in KMeans & BiKMeans
> 
>
> Key: SPARK-37959
> URL: https://issues.apache.org/jira/browse/SPARK-37959
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> In KMeansSuite and BisectingKMeansSuite, there are some unused lines:
>  
> {code:java}
> model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code}
>  
> For cosine distance, the norm of centering vector should be 1, so the norm 
> checking is meaningful;
> For euclidean distance, the norm checking is meaningless;
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2022-01-19 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478698#comment-17478698
 ] 

Wenchen Fan commented on SPARK-32165:
-

I'm closing this as it won't happen in the real world. There should only be one 
`SharedState` instance per driver JVM.

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
> I'd like to point out that this pr 
> (https://github.com/apache/spark/pull/28128) doesn't fix the memory leaky 
> completely. Once {{SessionState}} is touched, it will add two more listeners 
> into the SparkContext, namely {{SQLAppStatusListener}} and 
> {{ExecutionListenerBus}}
> It can be reproduced easily as
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2022-01-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32165.
-
Resolution: Not A Problem

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
> I'd like to point out that this pr 
> (https://github.com/apache/spark/pull/28128) doesn't fix the memory leaky 
> completely. Once {{SessionState}} is touched, it will add two more listeners 
> into the SparkContext, namely {{SQLAppStatusListener}} and 
> {{ExecutionListenerBus}}
> It can be reproduced easily as
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30661) KMeans blockify input vectors

2022-01-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478678#comment-17478678
 ] 

Sean R. Owen commented on SPARK-30661:
--

How much difference does it make? I'm weighing the cost of a new user parameter 
and more code vs benefit.

I would, I suppose, not expect clustering input to be exceptionally sparse. 
Sparse often implies high dimensional, and everything is far from everything in 
high dimensions, so clustering makes less sense. If anything that is an 
argument for your change. I am just wondering out loud about even whether to 
change the default to the blocked impl, if this proceeds.

> KMeans blockify input vectors
> -
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37955) PartitioningAwareFileIndex->basePath incorrectly contains the partition filters

2022-01-19 Thread Andreas Chatzistergiou (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Chatzistergiou resolved SPARK-37955.

Resolution: Not A Bug

> PartitioningAwareFileIndex->basePath incorrectly contains the partition 
> filters
> ---
>
> Key: SPARK-37955
> URL: https://issues.apache.org/jira/browse/SPARK-37955
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Andreas Chatzistergiou
>Priority: Minor
>
> PartitioningAwareFileIndex.getBasePath method returns paths that contain the 
> partitioning directories. This violates the definition of the basePath per 
> FileIndex, i.e. the parent directory of a file path with all the partitioning 
> directories being stripped off. 
> This PR fixes the issue by separating the notion of the partitioningPaths and 
> the basePaths in the PartitioningAwareFileIndex. The basePaths are derived by 
> removing from the partitioningPaths any partitioning columns with the aid of 
> the PartitioningSchema.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog

2022-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478635#comment-17478635
 ] 

Apache Spark commented on SPARK-37963:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35251

> Need to update Partition URI after renaming table in InMemoryCatalog
> 
>
> Key: SPARK-37963
> URL: https://issues.apache.org/jira/browse/SPARK-37963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> After renaming a partitioned table, select from the new table from 
> InMemoryCatalog will get an empty result.
> The following checkAnswer will fail as the result is empty.
> {code:java}
> sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)")
> sql("insert into table foo partition(j=2) values (1)")
> sql(s"alter table foo rename to bar")
> checkAnswer(spark.table("bar"), Row(1, 2)) {code}
> To fix the bug, we need to update Partition URI after renaming a table in 
> InMemoryCatalog
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37963:


Assignee: Gengliang Wang  (was: Apache Spark)

> Need to update Partition URI after renaming table in InMemoryCatalog
> 
>
> Key: SPARK-37963
> URL: https://issues.apache.org/jira/browse/SPARK-37963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> After renaming a partitioned table, select from the new table from 
> InMemoryCatalog will get an empty result.
> The following checkAnswer will fail as the result is empty.
> {code:java}
> sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)")
> sql("insert into table foo partition(j=2) values (1)")
> sql(s"alter table foo rename to bar")
> checkAnswer(spark.table("bar"), Row(1, 2)) {code}
> To fix the bug, we need to update Partition URI after renaming a table in 
> InMemoryCatalog
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37963:


Assignee: Apache Spark  (was: Gengliang Wang)

> Need to update Partition URI after renaming table in InMemoryCatalog
> 
>
> Key: SPARK-37963
> URL: https://issues.apache.org/jira/browse/SPARK-37963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> After renaming a partitioned table, select from the new table from 
> InMemoryCatalog will get an empty result.
> The following checkAnswer will fail as the result is empty.
> {code:java}
> sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)")
> sql("insert into table foo partition(j=2) values (1)")
> sql(s"alter table foo rename to bar")
> checkAnswer(spark.table("bar"), Row(1, 2)) {code}
> To fix the bug, we need to update Partition URI after renaming a table in 
> InMemoryCatalog
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog

2022-01-19 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-37963:
--

 Summary: Need to update Partition URI after renaming table in 
InMemoryCatalog
 Key: SPARK-37963
 URL: https://issues.apache.org/jira/browse/SPARK-37963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


After renaming a partitioned table, select from the new table from 
InMemoryCatalog will get an empty result.

The following checkAnswer will fail as the result is empty.
{code:java}
sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)")
sql("insert into table foo partition(j=2) values (1)")
sql(s"alter table foo rename to bar")
checkAnswer(spark.table("bar"), Row(1, 2)) {code}
To fix the bug, we need to update Partition URI after renaming a table in 
InMemoryCatalog

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2022-01-19 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478618#comment-17478618
 ] 

pralabhkumar commented on SPARK-24432:
--

[~dongjoon] 

one quick question . 
 - The K8s dynamic allocation with storage migration between executors is 
already in `master` branch for Apache Spark 3.1.0.

If u can please provide the PR which is doing that , it would be really helpful

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2022-01-19 Thread Kevin Wallimann (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478610#comment-17478610
 ] 

Kevin Wallimann commented on SPARK-34805:
-

The problem happens in Scala as well. I attached a scala file 
[^nested_columns_metadata.scala] to demonstrate the issue. I tried it in the 
spark-shell of versions 2.4.7, 3.1.2 and 3.2.0, always with the same result. 
This behavior is a bug, because the documentation for {{StructField}} clearly 
says that the "metadata should be preserved during transformation if the 
content of the column is not modified, e.g, in selection"

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Attachments: jsonMetadataTest.py, nested_columns_metadata.scala
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2022-01-19 Thread Kevin Wallimann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wallimann updated SPARK-34805:

Attachment: nested_columns_metadata.scala

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Attachments: jsonMetadataTest.py, nested_columns_metadata.scala
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view

2022-01-19 Thread Zhixiong Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478590#comment-17478590
 ] 

Zhixiong Chen commented on SPARK-37932:
---

I'm working on

> Analyzer can fail when join left side and right side are the same view
> --
>
> Key: SPARK-37932
> URL: https://issues.apache.org/jira/browse/SPARK-37932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Feng Zhu
>Priority: Major
> Attachments: sql_and_exception
>
>
> See the attachment for details, including SQL and the exception information.
>  * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side 
> subquery, Analyzer works as expected;
>  * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in 
> the right side subquery, Analyzer failed with "Resolved attribute(s) 
> LO_SUPPKEY#337 missing ...".
>       From the debug info, the problem seems to be occurred after the rule 
> DeduplicateRelations is applied.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37932) Analyzer can fail when join left side and right side are the same view

2022-01-19 Thread Feng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478542#comment-17478542
 ] 

Feng Zhu edited comment on SPARK-37932 at 1/19/22, 10:31 AM:
-

test 
{code:scala}
test("SPARK-37932: view join view self with having filter") {
  withTable("t") {
withView("v1") {
  Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name")
.write.format("parquet").saveAsTable("t")
  sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t")

  sql("""
|SELECT l1.id
| FROM v1 l1
| INNER JOIN (
| SELECT id
| FROM v1
| GROUP BY id
| HAVING COUNT(DISTINCT name) > 1
| ) l2
| ON l1.id = l2.id
| GROUP BY l1.name, l1.id;
""".stripMargin)

}
  }
}

{code}
 

exception
{code:java}
org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing 
from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct 
name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name 
appear in the operation: name. Please check if the right attribute(s) are used.;
Aggregate [name#25, id#24], [id#24]
+- Join Inner, (id#24 = id#29)
   :- SubqueryAlias l1
   :  +- SubqueryAlias spark_catalog.default.v1
   :     +- View (`default`.`v1`, [id#24,name#25])
   :        +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS 
name#25]
   :           +- Project [id#20, name#21]
   :              +- SubqueryAlias spark_catalog.default.t
   :                 +- Relation default.t[id#20,name#21] parquet
   +- SubqueryAlias l2
      +- Project [id#29]
         +- Filter (count(distinct name#25)#31L > cast(1 as bigint))
            +- !Aggregate [id#29], [id#29, count(distinct name#25) AS 
count(distinct name#25)#31L]
               +- SubqueryAlias spark_catalog.default.v1
                  +- View (`default`.`v1`, [id#29,name#30])
                     +- Project [cast(id#26 as int) AS id#29, cast(name#27 as 
string) AS name#30]
                        +- Project [id#26, name#27]
                           +- SubqueryAlias spark_catalog.default.t
                              +- Relation default.t[id#26,name#27] parquet

    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50)

{code}
 


was (Author: fishcus):
test 
{code:scala}
test("SPARK-37932: view join view self with having filter") {
  withTable("t") {
withView("v1") {
  Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name")
.write.format("parquet").saveAsTable("t")
  sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t")

  sql("""
|SELECT l1.id
| FROM v1 l1
| INNER JOIN (
| SELECT id
| FROM v1
| GROUP BY id
| HAVING COUNT(DISTINCT name) > 1
| ) l2
| ON l1.id = l2.id
| GROUP BY l1.name, l1.id;
""".stripMargin)

  }
}
}

{code}
 

exception
{code:java}
org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing 
from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct 
name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name 
appear in the operation: name. Please check if the right attribute(s) are used.;
Aggregate [name#25, id#24], [id#24]
+- Join Inner, (id#24 = id#29)
   :- SubqueryAlias l1
   :  +- SubqueryAlias spark_catalog.default.v1
   :     +- View (`default`.`v1`, [id#24,name#25])
   :        +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS 
name#25]
   :           +- Project [id#20, name#21]
   :              +- SubqueryAlias spark_catalog.default.t
   :                 +- Relation default.t[id#20,name#21] parquet
   +- SubqueryAlias l2
      +- Project [id#29]
         +- Filter (count(distinct name#25)#31L > cast(1 as bigint))
            +- !Aggregate [id#29], [id#29, count(distinct name#25) AS 
count(distinct name#25)#31L]
               +- SubqueryAlias spark_catalog.default.v1
                  +- View (`default`.`v1`, [id#29,name#30])
                     +- Project [cast(id#26 as int) AS id#29, cast(name#27 as 
string) AS name#30]
                        +- Project [id#26, name#27]
                           +- SubqueryAlias spark_catalog.default.t
                              +- Relation default.t[id#26,name#27] parquet

    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50)

{code}
 

> Analyzer can fail when join left side and right side are the same view
> --
>
> Key: SPARK-37932
> URL: https://issues.a

[jira] [Comment Edited] (SPARK-37932) Analyzer can fail when join left side and right side are the same view

2022-01-19 Thread Feng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478542#comment-17478542
 ] 

Feng Zhu edited comment on SPARK-37932 at 1/19/22, 10:30 AM:
-

test 
{code:scala}
test("SPARK-37932: view join view self with having filter") {
  withTable("t") {
withView("v1") {
  Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name")
.write.format("parquet").saveAsTable("t")
  sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t")

  sql("""
|SELECT l1.id
| FROM v1 l1
| INNER JOIN (
| SELECT id
| FROM v1
| GROUP BY id
| HAVING COUNT(DISTINCT name) > 1
| ) l2
| ON l1.id = l2.id
| GROUP BY l1.name, l1.id;
""".stripMargin)

  }
}
}

{code}
 

exception
{code:java}
org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing 
from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct 
name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name 
appear in the operation: name. Please check if the right attribute(s) are used.;
Aggregate [name#25, id#24], [id#24]
+- Join Inner, (id#24 = id#29)
   :- SubqueryAlias l1
   :  +- SubqueryAlias spark_catalog.default.v1
   :     +- View (`default`.`v1`, [id#24,name#25])
   :        +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS 
name#25]
   :           +- Project [id#20, name#21]
   :              +- SubqueryAlias spark_catalog.default.t
   :                 +- Relation default.t[id#20,name#21] parquet
   +- SubqueryAlias l2
      +- Project [id#29]
         +- Filter (count(distinct name#25)#31L > cast(1 as bigint))
            +- !Aggregate [id#29], [id#29, count(distinct name#25) AS 
count(distinct name#25)#31L]
               +- SubqueryAlias spark_catalog.default.v1
                  +- View (`default`.`v1`, [id#29,name#30])
                     +- Project [cast(id#26 as int) AS id#29, cast(name#27 as 
string) AS name#30]
                        +- Project [id#26, name#27]
                           +- SubqueryAlias spark_catalog.default.t
                              +- Relation default.t[id#26,name#27] parquet

    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50)

{code}
 


was (Author: fishcus):
test 

{code:scala}

test("SPARK-37932: view join view self with having filter") {
withTable("t") {
withView("v1") {
Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name")
.write.format("parquet").saveAsTable("t")
sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t")

sql("""
|SELECT l1.id
| FROM v1 l1
| INNER JOIN (
| SELECT id
| FROM v1
| GROUP BY id
| HAVING COUNT(DISTINCT name) > 1
| ) l2
| ON l1.id = l2.id
| GROUP BY l1.name, l1.id;
""".stripMargin)

}
}
}

{code}

 

exception

{code}

org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing 
from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct 
name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name 
appear in the operation: name. Please check if the right attribute(s) are used.;
Aggregate [name#25, id#24], [id#24]
+- Join Inner, (id#24 = id#29)
   :- SubqueryAlias l1
   :  +- SubqueryAlias spark_catalog.default.v1
   :     +- View (`default`.`v1`, [id#24,name#25])
   :        +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS 
name#25]
   :           +- Project [id#20, name#21]
   :              +- SubqueryAlias spark_catalog.default.t
   :                 +- Relation default.t[id#20,name#21] parquet
   +- SubqueryAlias l2
      +- Project [id#29]
         +- Filter (count(distinct name#25)#31L > cast(1 as bigint))
            +- !Aggregate [id#29], [id#29, count(distinct name#25) AS 
count(distinct name#25)#31L]
               +- SubqueryAlias spark_catalog.default.v1
                  +- View (`default`.`v1`, [id#29,name#30])
                     +- Project [cast(id#26 as int) AS id#29, cast(name#27 as 
string) AS name#30]
                        +- Project [id#26, name#27]
                           +- SubqueryAlias spark_catalog.default.t
                              +- Relation default.t[id#26,name#27] parquet

    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50)

{code}

 

> Analyzer can fail when join left side and right side are the same view
> --
>
> Key: SPARK-37932
> URL: https://issues.apache.org/jira/browse/SPARK-37932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Aff

[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view

2022-01-19 Thread Feng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478542#comment-17478542
 ] 

Feng Zhu commented on SPARK-37932:
--

test 

{code:scala}

test("SPARK-37932: view join view self with having filter") {
withTable("t") {
withView("v1") {
Seq((2, "test2"), (3, "test3"), (1, "test1")).toDF("id", "name")
.write.format("parquet").saveAsTable("t")
sql("CREATE VIEW v1 (id, name) AS SELECT id, name FROM t")

sql("""
|SELECT l1.id
| FROM v1 l1
| INNER JOIN (
| SELECT id
| FROM v1
| GROUP BY id
| HAVING COUNT(DISTINCT name) > 1
| ) l2
| ON l1.id = l2.id
| GROUP BY l1.name, l1.id;
""".stripMargin)

}
}
}

{code}

 

exception

{code}

org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#25 missing 
from id#29,name#30 in operator !Aggregate [id#29], [id#29, count(distinct 
name#25) AS count(distinct name#25)#31L]. Attribute(s) with the same name 
appear in the operation: name. Please check if the right attribute(s) are used.;
Aggregate [name#25, id#24], [id#24]
+- Join Inner, (id#24 = id#29)
   :- SubqueryAlias l1
   :  +- SubqueryAlias spark_catalog.default.v1
   :     +- View (`default`.`v1`, [id#24,name#25])
   :        +- Project [cast(id#20 as int) AS id#24, cast(name#21 as string) AS 
name#25]
   :           +- Project [id#20, name#21]
   :              +- SubqueryAlias spark_catalog.default.t
   :                 +- Relation default.t[id#20,name#21] parquet
   +- SubqueryAlias l2
      +- Project [id#29]
         +- Filter (count(distinct name#25)#31L > cast(1 as bigint))
            +- !Aggregate [id#29], [id#29, count(distinct name#25) AS 
count(distinct name#25)#31L]
               +- SubqueryAlias spark_catalog.default.v1
                  +- View (`default`.`v1`, [id#29,name#30])
                     +- Project [cast(id#26 as int) AS id#29, cast(name#27 as 
string) AS name#30]
                        +- Project [id#26, name#27]
                           +- SubqueryAlias spark_catalog.default.t
                              +- Relation default.t[id#26,name#27] parquet

    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50)

{code}

 

> Analyzer can fail when join left side and right side are the same view
> --
>
> Key: SPARK-37932
> URL: https://issues.apache.org/jira/browse/SPARK-37932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Feng Zhu
>Priority: Major
> Attachments: sql_and_exception
>
>
> See the attachment for details, including SQL and the exception information.
>  * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side 
> subquery, Analyzer works as expected;
>  * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in 
> the right side subquery, Analyzer failed with "Resolved attribute(s) 
> LO_SUPPKEY#337 missing ...".
>       From the debug info, the problem seems to be occurred after the rule 
> DeduplicateRelations is applied.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37932) Analyzer can fail when join left side and right side are the same view

2022-01-19 Thread Feng Zhu (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37932 ]


Feng Zhu deleted comment on SPARK-37932:
--

was (Author: fishcus):
{code:scala} 

{code} 

> Analyzer can fail when join left side and right side are the same view
> --
>
> Key: SPARK-37932
> URL: https://issues.apache.org/jira/browse/SPARK-37932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Feng Zhu
>Priority: Major
> Attachments: sql_and_exception
>
>
> See the attachment for details, including SQL and the exception information.
>  * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side 
> subquery, Analyzer works as expected;
>  * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in 
> the right side subquery, Analyzer failed with "Resolved attribute(s) 
> LO_SUPPKEY#337 missing ...".
>       From the debug info, the problem seems to be occurred after the rule 
> DeduplicateRelations is applied.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view

2022-01-19 Thread Feng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478537#comment-17478537
 ] 

Feng Zhu commented on SPARK-37932:
--

{code:scala} 

{code} 

> Analyzer can fail when join left side and right side are the same view
> --
>
> Key: SPARK-37932
> URL: https://issues.apache.org/jira/browse/SPARK-37932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Feng Zhu
>Priority: Major
> Attachments: sql_and_exception
>
>
> See the attachment for details, including SQL and the exception information.
>  * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side 
> subquery, Analyzer works as expected;
>  * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in 
> the right side subquery, Analyzer failed with "Resolved attribute(s) 
> LO_SUPPKEY#337 missing ...".
>       From the debug info, the problem seems to be occurred after the rule 
> DeduplicateRelations is applied.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode

2022-01-19 Thread Petri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478482#comment-17478482
 ] 

Petri commented on SPARK-37910:
---

Also the error message we get in the executor is pretty vague:

 Executor self-exiting due to : Driver 
192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting down.

It raises questions:
 * What does the disassociation mean? Is it anything related to disconnection 
or what?
 * Why the executor must self-exit? Would it be possible to retry driver 
association?

It would be good to improve the error message and related documentation.

> Spark executor self-exiting due to driver disassociated in Kubernetes with 
> client deploy-mode
> -
>
> Key: SPARK-37910
> URL: https://issues.apache.org/jira/browse/SPARK-37910
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Petri
>Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode and 
> it tries to start an executor.
> Executor will fail with error:
>     \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", 
> "class":"dispatcher-Executor", 
> "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", 
> "log":"Executor self-exiting due to : Driver 
> 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting 
> down.\n"}
> Then driver will attempt to start another executor which fails with same 
> error and this goes on and on.
> In the driver pod, I see only following errors:
>     22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 
> 192.168.43.250:
>     22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 
> 192.168.43.233:
>     22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 
> 192.168.43.221:
>     22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 
> 192.168.43.217:
>     22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 
> 192.168.43.197:
>     22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 
> 192.168.43.237:
>     22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 
> 192.168.43.196:
>     22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 
> 192.168.43.228:
>     22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 
> 192.168.43.254:
>     22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 
> 192.168.43.204:
>     22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 
> 192.168.43.231:
> What is wrong? And how can I get executors running correctly?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37910) Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode

2022-01-19 Thread Petri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478478#comment-17478478
 ] 

Petri commented on SPARK-37910:
---

In deployment.yaml we have:
 * name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
 *  name: SPARK_DRIVER_BIND_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.podIP
 * name: K8S_NS
valueFrom:
fieldRef:
fieldPath: metadata.namespace

 

We are setting following confs for spark-submit:

DRIVER_HOSTNAME=$(echo $SPARK_DRIVER_BIND_ADDRESS | sed 's/\./-/g')

--conf spark.kubernetes.driver.pod.name=$POD_NAME \
--conf spark.driver.host=$DRIVER_HOSTNAME.$K8S_NS.pod.cluster.local \

 

So we are using Pod DNS name, is that ok? Or should we use headless service? 
Your documentation is not clear about it. What we are missing in our confs is 
the spark.driver.port. Is that a mandatory conf needed?

Can you give exact steps how to check to pod network status?

 

We have a quite similar setup in our other microservice, which is working OK 
with (Spark 3.2.0 ja Java 11), but for some reason this microservice in 
question has the problem. 

> Spark executor self-exiting due to driver disassociated in Kubernetes with 
> client deploy-mode
> -
>
> Key: SPARK-37910
> URL: https://issues.apache.org/jira/browse/SPARK-37910
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Petri
>Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode and 
> it tries to start an executor.
> Executor will fail with error:
>     \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", 
> "class":"dispatcher-Executor", 
> "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", 
> "log":"Executor self-exiting due to : Driver 
> 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting 
> down.\n"}
> Then driver will attempt to start another executor which fails with same 
> error and this goes on and on.
> In the driver pod, I see only following errors:
>     22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 
> 192.168.43.250:
>     22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 
> 192.168.43.233:
>     22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 
> 192.168.43.221:
>     22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 
> 192.168.43.217:
>     22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 
> 192.168.43.197:
>     22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 
> 192.168.43.237:
>     22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 
> 192.168.43.196:
>     22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 
> 192.168.43.228:
>     22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 
> 192.168.43.254:
>     22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 
> 192.168.43.204:
>     22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 
> 192.168.43.231:
> What is wrong? And how can I get executors running correctly?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37962) Cannot fetch remote jar correctly

2022-01-19 Thread Jinpeng Chi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478469#comment-17478469
 ] 

Jinpeng Chi commented on SPARK-37962:
-

The root cause was that the link I had encoded was decoded

> Cannot fetch remote jar correctly
> -
>
> Key: SPARK-37962
> URL: https://issues.apache.org/jira/browse/SPARK-37962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Jinpeng Chi
>Priority: Major
> Attachments: image-2022-01-19-17-18-24-795.png, 
> image-2022-01-19-17-21-53-011.png
>
>
> When my Jar link address is encoded, the Jar cannot be pulled correctly
> Log:
> !image-2022-01-19-17-18-24-795.png!
>  
> My static file server(tomcat) log:
> !image-2022-01-19-17-21-53-011.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37962) Cannot fetch remote jar correctly

2022-01-19 Thread Jinpeng Chi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Chi updated SPARK-37962:

Description: 
When my Jar link address is encoded, the Jar cannot be pulled correctly

Log:

!image-2022-01-19-17-18-24-795.png!

 

My static file server(tomcat) log:

!image-2022-01-19-17-21-53-011.png!

> Cannot fetch remote jar correctly
> -
>
> Key: SPARK-37962
> URL: https://issues.apache.org/jira/browse/SPARK-37962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Jinpeng Chi
>Priority: Major
> Attachments: image-2022-01-19-17-18-24-795.png, 
> image-2022-01-19-17-21-53-011.png
>
>
> When my Jar link address is encoded, the Jar cannot be pulled correctly
> Log:
> !image-2022-01-19-17-18-24-795.png!
>  
> My static file server(tomcat) log:
> !image-2022-01-19-17-21-53-011.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37962) Cannot fetch remote jar correctly

2022-01-19 Thread Jinpeng Chi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Chi updated SPARK-37962:

Attachment: image-2022-01-19-17-21-53-011.png

> Cannot fetch remote jar correctly
> -
>
> Key: SPARK-37962
> URL: https://issues.apache.org/jira/browse/SPARK-37962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Jinpeng Chi
>Priority: Major
> Attachments: image-2022-01-19-17-18-24-795.png, 
> image-2022-01-19-17-21-53-011.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37962) Cannot fetch remote jar correctly

2022-01-19 Thread Jinpeng Chi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Chi updated SPARK-37962:

Attachment: image-2022-01-19-17-18-24-795.png

> Cannot fetch remote jar correctly
> -
>
> Key: SPARK-37962
> URL: https://issues.apache.org/jira/browse/SPARK-37962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Jinpeng Chi
>Priority: Major
> Attachments: image-2022-01-19-17-18-24-795.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37962) Cannot fetch remote jar correctly

2022-01-19 Thread Jinpeng Chi (Jira)
Jinpeng Chi created SPARK-37962:
---

 Summary: Cannot fetch remote jar correctly
 Key: SPARK-37962
 URL: https://issues.apache.org/jira/browse/SPARK-37962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0, 3.1.2
Reporter: Jinpeng Chi
 Attachments: image-2022-01-19-17-18-24-795.png





--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37961) override maxRows/maxRowsPerPartition for some logical operators

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37961:


Assignee: (was: Apache Spark)

> override maxRows/maxRowsPerPartition for some logical operators
> ---
>
> Key: SPARK-37961
> URL: https://issues.apache.org/jira/browse/SPARK-37961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37915) Push down deterministic projection through SQL UNION

2022-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478451#comment-17478451
 ] 

Apache Spark commented on SPARK-37915:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/35249

> Push down deterministic projection through SQL UNION
> 
>
> Key: SPARK-37915
> URL: https://issues.apache.org/jira/browse/SPARK-37915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.range(11).selectExpr("cast(id as decimal(18, 1)) as a", "id as b", "id 
> as c").write.saveAsTable("t1")
> spark.range(12).selectExpr("cast(id as decimal(18, 2)) as a", "id as b", "id 
> as c").write.saveAsTable("t2")
> spark.range(13).selectExpr("cast(id as decimal(18, 3)) as a", "id as b", "id 
> as c").write.saveAsTable("t3")
> spark.range(14).selectExpr("cast(id as decimal(18, 4)) as a", "id as b", "id 
> as c").write.saveAsTable("t4")
> spark.range(15).selectExpr("cast(id as decimal(18, 5)) as a", "id as b", "id 
> as c").write.saveAsTable("t5")
> sql("select a from t1 union select a from t2 union select a from t3 union 
> select a from t4 union select a from t5").explain(true)
> {code}
> Current:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[a#76], functions=[], output=[a#76])
>+- Exchange hashpartitioning(a#76, 5), ENSURE_REQUIREMENTS, [id=#159]
>   +- HashAggregate(keys=[a#76], functions=[], output=[a#76])
>  +- Union
> :- HashAggregate(keys=[a#74], functions=[], output=[a#76])
> :  +- Exchange hashpartitioning(a#74, 5), ENSURE_REQUIREMENTS, 
> [id=#154]
> : +- HashAggregate(keys=[a#74], functions=[], output=[a#74])
> :+- Union
> :   :- HashAggregate(keys=[a#72], functions=[], 
> output=[a#74])
> :   :  +- Exchange hashpartitioning(a#72, 5), 
> ENSURE_REQUIREMENTS, [id=#149]
> :   : +- HashAggregate(keys=[a#72], functions=[], 
> output=[a#72])
> :   :+- Union
> :   :   :- HashAggregate(keys=[a#70], 
> functions=[], output=[a#72])
> :   :   :  +- Exchange hashpartitioning(a#70, 5), 
> ENSURE_REQUIREMENTS, [id=#144]
> :   :   : +- HashAggregate(keys=[a#70], 
> functions=[], output=[a#70])
> :   :   :+- Union
> :   :   :   :- Project [cast(a#55 as 
> decimal(19,2)) AS a#70]
> :   :   :   :  +- FileScan parquet 
> default.t1[a#55] Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> :   :   :   +- Project [cast(a#58 as 
> decimal(19,2)) AS a#71]
> :   :   :  +- FileScan parquet 
> default.t2[a#58] Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> :   :   +- Project [cast(a#61 as decimal(20,3)) 
> AS a#73]
> :   :  +- FileScan parquet default.t3[a#61] 
> Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> :   +- Project [cast(a#64 as decimal(21,4)) AS a#75]
> :  +- FileScan parquet default.t4[a#64] Batched: 
> true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> +- Project [cast(a#67 as decimal(22,5)) AS a#77]
>+- FileScan parquet default.t5[a#67] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/spark/SPARK-31890/external/avro/spark-warehouse/or...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {noformat}
> Expected:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[a#76], functions=[], output=[a#76])
>+- Exchange hashpartitioning(a#76, 5), ENSURE_REQUIREMENTS, [id=#111]
>   +- HashAggregat

[jira] [Assigned] (SPARK-37961) override maxRows/maxRowsPerPartition for some logical operators

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37961:


Assignee: Apache Spark

> override maxRows/maxRowsPerPartition for some logical operators
> ---
>
> Key: SPARK-37961
> URL: https://issues.apache.org/jira/browse/SPARK-37961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37961) override maxRows/maxRowsPerPartition for some logical operators

2022-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478452#comment-17478452
 ] 

Apache Spark commented on SPARK-37961:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/35250

> override maxRows/maxRowsPerPartition for some logical operators
> ---
>
> Key: SPARK-37961
> URL: https://issues.apache.org/jira/browse/SPARK-37961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37960:


Assignee: (was: Apache Spark)

> Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
> ---
>
> Key: SPARK-37960
> URL: https://issues.apache.org/jira/browse/SPARK-37960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports aggregate push down SUM(column) into JDBC data 
> source.
> SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37960:


Assignee: Apache Spark

> Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
> ---
>
> Key: SPARK-37960
> URL: https://issues.apache.org/jira/browse/SPARK-37960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Spark supports aggregate push down SUM(column) into JDBC data 
> source.
> SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)

2022-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478445#comment-17478445
 ] 

Apache Spark commented on SPARK-37960:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35248

> Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
> ---
>
> Key: SPARK-37960
> URL: https://issues.apache.org/jira/browse/SPARK-37960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports aggregate push down SUM(column) into JDBC data 
> source.
> SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37951) Refactor ImageFileFormatSuite

2022-01-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37951:
---

Assignee: angerszhu

> Refactor ImageFileFormatSuite
> -
>
> Key: SPARK-37951
> URL: https://issues.apache.org/jira/browse/SPARK-37951
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Not use standard API, sometimes failed, optimize it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37951) Refactor ImageFileFormatSuite

2022-01-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37951.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35237
[https://github.com/apache/spark/pull/35237]

> Refactor ImageFileFormatSuite
> -
>
> Key: SPARK-37951
> URL: https://issues.apache.org/jira/browse/SPARK-37951
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Not use standard API, sometimes failed, optimize it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37961) override maxRows/maxRowsPerPartition for some logical operators

2022-01-19 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-37961:


 Summary: override maxRows/maxRowsPerPartition for some logical 
operators
 Key: SPARK-37961
 URL: https://issues.apache.org/jira/browse/SPARK-37961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)

2022-01-19 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37960:
---
Description: 
Currently, Spark supports aggregate push down SUM(column) into JDBC data source.
SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.

  was:
Currently, Spark supports complete push down SUM(column) into JDBC data source.
SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.


> Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
> ---
>
> Key: SPARK-37960
> URL: https://issues.apache.org/jira/browse/SPARK-37960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports aggregate push down SUM(column) into JDBC data 
> source.
> SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37960) Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)

2022-01-19 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37960:
---
Summary: Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)  
(was: Support complete push down SUM(CASE ... WHEN ... ELSE ... END))

> Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END)
> ---
>
> Key: SPARK-37960
> URL: https://issues.apache.org/jira/browse/SPARK-37960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports complete push down SUM(column) into JDBC data 
> source.
> SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30661) KMeans blockify input vectors

2022-01-19 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478437#comment-17478437
 ] 

zhengruifeng commented on SPARK-30661:
--

recently, I spend some time on testing blockify kmeans and apply GEMM in 
finding the closest cluster.

In short:

1, for sparse datasets, blockifying kmeans still cause regression in most 
cases; (existing impl with triangle-inequality can skip some distance 
computation, but scala-based sparse BLAS will always compute all distances)

2, for dense datasets and small k, blockifying kmeans (without native BLAS) is 
competitive; with native BLAS, it should be significantly faster than existing 
impl.

 

So I plan to add a new parameter {{solver}} by making KMeans extending 
HasSolver, and support both two training impls, so that end users can switch to 
the blockify version.

 

How do you think about it? [~srowen] [~WeichenXu123] [~mengxr] [~huaxingao] 

 

> KMeans blockify input vectors
> -
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37960) Support complete push down SUM(CASE ... WHEN ... ELSE ... END)

2022-01-19 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-37960:
--

 Summary: Support complete push down SUM(CASE ... WHEN ... ELSE ... 
END)
 Key: SPARK-37960
 URL: https://issues.apache.org/jira/browse/SPARK-37960
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng


Currently, Spark supports complete push down SUM(column) into JDBC data source.
SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37959:


Assignee: Apache Spark

> Fix the UT of checking norm in KMeans & BiKMeans
> 
>
> Key: SPARK-37959
> URL: https://issues.apache.org/jira/browse/SPARK-37959
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> In KMeansSuite and BisectingKMeansSuite, there are some unused lines:
>  
> {code:java}
> model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code}
>  
> For cosine distance, the norm of centering vector should be 1, so the norm 
> checking is meaningful;
> For euclidean distance, the norm checking is meaningless;
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37959:


Assignee: (was: Apache Spark)

> Fix the UT of checking norm in KMeans & BiKMeans
> 
>
> Key: SPARK-37959
> URL: https://issues.apache.org/jira/browse/SPARK-37959
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In KMeansSuite and BisectingKMeansSuite, there are some unused lines:
>  
> {code:java}
> model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code}
>  
> For cosine distance, the norm of centering vector should be 1, so the norm 
> checking is meaningful;
> For euclidean distance, the norm checking is meaningless;
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans

2022-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37959:


Assignee: Apache Spark

> Fix the UT of checking norm in KMeans & BiKMeans
> 
>
> Key: SPARK-37959
> URL: https://issues.apache.org/jira/browse/SPARK-37959
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> In KMeansSuite and BisectingKMeansSuite, there are some unused lines:
>  
> {code:java}
> model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code}
>  
> For cosine distance, the norm of centering vector should be 1, so the norm 
> checking is meaningful;
> For euclidean distance, the norm checking is meaningless;
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans

2022-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478405#comment-17478405
 ] 

Apache Spark commented on SPARK-37959:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/35247

> Fix the UT of checking norm in KMeans & BiKMeans
> 
>
> Key: SPARK-37959
> URL: https://issues.apache.org/jira/browse/SPARK-37959
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In KMeansSuite and BisectingKMeansSuite, there are some unused lines:
>  
> {code:java}
> model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code}
>  
> For cosine distance, the norm of centering vector should be 1, so the norm 
> checking is meaningful;
> For euclidean distance, the norm checking is meaningless;
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37959) Fix the UT of checking norm in KMeans & BiKMeans

2022-01-19 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-37959:


 Summary: Fix the UT of checking norm in KMeans & BiKMeans
 Key: SPARK-37959
 URL: https://issues.apache.org/jira/browse/SPARK-37959
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.3.0
Reporter: zhengruifeng


In KMeansSuite and BisectingKMeansSuite, there are some unused lines:

 
{code:java}
model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 {code}
 

For cosine distance, the norm of centering vector should be 1, so the norm 
checking is meaningful;

For euclidean distance, the norm checking is meaningless;

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org