[jira] [Assigned] (SPARK-36892) Disable batch fetch for a shuffle when push based shuffle is enabled

2021-10-06 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-36892:
--

Assignee: Ye Zhou

> Disable batch fetch for a shuffle when push based shuffle is enabled
> 
>
> Key: SPARK-36892
> URL: https://issues.apache.org/jira/browse/SPARK-36892
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Ye Zhou
>Priority: Blocker
>
> When push based shuffle is enabled, efficient fetch of merged mapper shuffle 
> output happens.
> Unfortunately, this currently interacts badly with 
> spark.sql.adaptive.fetchShuffleBlocksInBatch, potentially causing shuffle 
> fetch to hang and/or duplicate data to be fetched, causing correctness issues.
> Given batch fetch does not benefit spark stages reading merged blocks when 
> push based shuffle is enabled, ShuffleBlockFetcherIterator.doBatchFetch can 
> be disabled when push based shuffle is enabled.
> Thx to [~Ngone51] for surfacing this issue.
> +CC [~Gengliang.Wang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36892) Disable batch fetch for a shuffle when push based shuffle is enabled

2021-10-06 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36892.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34156
[https://github.com/apache/spark/pull/34156]

> Disable batch fetch for a shuffle when push based shuffle is enabled
> 
>
> Key: SPARK-36892
> URL: https://issues.apache.org/jira/browse/SPARK-36892
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Ye Zhou
>Priority: Blocker
> Fix For: 3.2.0
>
>
> When push based shuffle is enabled, efficient fetch of merged mapper shuffle 
> output happens.
> Unfortunately, this currently interacts badly with 
> spark.sql.adaptive.fetchShuffleBlocksInBatch, potentially causing shuffle 
> fetch to hang and/or duplicate data to be fetched, causing correctness issues.
> Given batch fetch does not benefit spark stages reading merged blocks when 
> push based shuffle is enabled, ShuffleBlockFetcherIterator.doBatchFetch can 
> be disabled when push based shuffle is enabled.
> Thx to [~Ngone51] for surfacing this issue.
> +CC [~Gengliang.Wang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36892) Disable batch fetch for a shuffle when push based shuffle is enabled

2021-10-06 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424825#comment-17424825
 ] 

Gengliang Wang commented on SPARK-36892:


[~mridulm80] [~mshen] [~zhouyejoe] [~apatnam] Again, thanks for testing Spark 
3.2 with real workloads. Now that all the blockers are resolved. I will have 
the new RC soon.

> Disable batch fetch for a shuffle when push based shuffle is enabled
> 
>
> Key: SPARK-36892
> URL: https://issues.apache.org/jira/browse/SPARK-36892
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Ye Zhou
>Priority: Blocker
> Fix For: 3.2.0
>
>
> When push based shuffle is enabled, efficient fetch of merged mapper shuffle 
> output happens.
> Unfortunately, this currently interacts badly with 
> spark.sql.adaptive.fetchShuffleBlocksInBatch, potentially causing shuffle 
> fetch to hang and/or duplicate data to be fetched, causing correctness issues.
> Given batch fetch does not benefit spark stages reading merged blocks when 
> push based shuffle is enabled, ShuffleBlockFetcherIterator.doBatchFetch can 
> be disabled when push based shuffle is enabled.
> Thx to [~Ngone51] for surfacing this issue.
> +CC [~Gengliang.Wang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424860#comment-17424860
 ] 

Bjørn Jørgensen commented on SPARK-36934:
-

.config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")

 

now in Apache drill it prints

year 

14230080

14571360

 

 

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-36934:

Component/s: Spark Core

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424885#comment-17424885
 ] 

Hyukjin Kwon commented on SPARK-36934:
--

what about TIMESTAMP_MILLIS? 

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36927) Inline type hints for python/pyspark/sql/window.py

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36927.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34173
[https://github.com/apache/spark/pull/34173]

> Inline type hints for python/pyspark/sql/window.py
> --
>
> Key: SPARK-36927
> URL: https://issues.apache.org/jira/browse/SPARK-36927
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints for python/pyspark/sql/window.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36927) Inline type hints for python/pyspark/sql/window.py

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36927:


Assignee: Xinrong Meng

> Inline type hints for python/pyspark/sql/window.py
> --
>
> Key: SPARK-36927
> URL: https://issues.apache.org/jira/browse/SPARK-36927
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/window.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36919) Make BadRecordException serializable

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36919.
--
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 34167
[https://github.com/apache/spark/pull/34167]

> Make BadRecordException serializable
> 
>
> Key: SPARK-36919
> URL: https://issues.apache.org/jira/browse/SPARK-36919
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.2.1
>Reporter: Tianhan Hu
>Assignee: Tianhan Hu
>Priority: Minor
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
>
> Migrating a Spark application from 2.4.x to 3.1.x and finding a difference in 
> the exception chaining behavior. In a case of parsing a malformed CSV, where 
> the root cause exception should be {{Caused by: java.lang.RuntimeException: 
> Malformed CSV record}}, only the top level exception is kept, and all lower 
> level exceptions and root cause are lost. Thus, when we call 
> {{ExceptionUtils.getRootCause}} on the exception, we still get itself.
> The reason for the difference is that {{RuntimeException}} is wrapped in 
> {{BadRecordException}}, which has unserializable fields. When we try to 
> serialize the exception from tasks and deserialize from scheduler, the 
> exception is lost.
> This PR makes unserializable fields of {{BadRecordException}} transient, so 
> the rest of the exception could be serialized and deserialized properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36919) Make BadRecordException serializable

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36919:


Assignee: Tianhan Hu

> Make BadRecordException serializable
> 
>
> Key: SPARK-36919
> URL: https://issues.apache.org/jira/browse/SPARK-36919
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.2.1
>Reporter: Tianhan Hu
>Assignee: Tianhan Hu
>Priority: Minor
>
> Migrating a Spark application from 2.4.x to 3.1.x and finding a difference in 
> the exception chaining behavior. In a case of parsing a malformed CSV, where 
> the root cause exception should be {{Caused by: java.lang.RuntimeException: 
> Malformed CSV record}}, only the top level exception is kept, and all lower 
> level exceptions and root cause are lost. Thus, when we call 
> {{ExceptionUtils.getRootCause}} on the exception, we still get itself.
> The reason for the difference is that {{RuntimeException}} is wrapped in 
> {{BadRecordException}}, which has unserializable fields. When we try to 
> serialize the exception from tasks and deserialize from scheduler, the 
> exception is lost.
> This PR makes unserializable fields of {{BadRecordException}} transient, so 
> the rest of the exception could be serialized and deserialized properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424901#comment-17424901
 ] 

Bjørn Jørgensen commented on SPARK-36934:
-

With .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") 

 

 

In Apache drill now 

year 

2015-02-04T00:00
|2016-03-05T00:00|

 

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424917#comment-17424917
 ] 

Hyukjin Kwon commented on SPARK-36934:
--

Looks like Apache Drill only implements TIMESTAMP_MILLIS in Parquet. 
TIMESTAMP_MICROS  is also Parquet standard but looks like the read path for 
this type seems missing in Drill.

You will have to use TIMESTAMP_MILLIS for now

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36887) Inline type hints for python/pyspark/sql/conf.py

2021-10-06 Thread dch nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dch nguyen resolved SPARK-36887.

Resolution: Resolved

This issue is resolved by https://issues.apache.org/jira/browse/SPARK-36906

> Inline type hints for python/pyspark/sql/conf.py
> 
>
> Key: SPARK-36887
> URL: https://issues.apache.org/jira/browse/SPARK-36887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/conf.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36938) Inline type hints for group.py in python/pyspark/sql

2021-10-06 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424929#comment-17424929
 ] 

dch nguyen commented on SPARK-36938:


i am working on this

> Inline type hints for group.py in python/pyspark/sql  
> -
>
> Key: SPARK-36938
> URL: https://issues.apache.org/jira/browse/SPARK-36938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36938) Inline type hints for group.py in python/pyspark/sql

2021-10-06 Thread dch nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dch nguyen updated SPARK-36938:
---
Summary: Inline type hints for group.py in python/pyspark/sql (was: 
nline type hints for group.py in python/pyspark/sql )

> Inline type hints for group.py in python/pyspark/sql  
> -
>
> Key: SPARK-36938
> URL: https://issues.apache.org/jira/browse/SPARK-36938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36938) nline type hints for group.py in python/pyspark/sql

2021-10-06 Thread dch nguyen (Jira)
dch nguyen created SPARK-36938:
--

 Summary: nline type hints for group.py in python/pyspark/sql   
 Key: SPARK-36938
 URL: https://issues.apache.org/jira/browse/SPARK-36938
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dch nguyen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36939) Add orphan migration page into list in PySpark documentation

2021-10-06 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36939:


 Summary: Add orphan migration page into list in PySpark 
documentation
 Key: SPARK-36939
 URL: https://issues.apache.org/jira/browse/SPARK-36939
 Project: Spark
  Issue Type: Test
  Components: docs, PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


SPARK-36618 added a new migration guide page but that's mistakenly not added to 
{{spark/python/docs/source/migration_guideindex.rst}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36939) Add orphan migration page into list in PySpark documentation

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36939:
-
Issue Type: Documentation  (was: Test)

> Add orphan migration page into list in PySpark documentation
> 
>
> Key: SPARK-36939
> URL: https://issues.apache.org/jira/browse/SPARK-36939
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-36618 added a new migration guide page but that's mistakenly not added 
> to {{spark/python/docs/source/migration_guideindex.rst}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36939) Add orphan migration page into list in PySpark documentation

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36939:


Assignee: (was: Apache Spark)

> Add orphan migration page into list in PySpark documentation
> 
>
> Key: SPARK-36939
> URL: https://issues.apache.org/jira/browse/SPARK-36939
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-36618 added a new migration guide page but that's mistakenly not added 
> to {{spark/python/docs/source/migration_guideindex.rst}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36939) Add orphan migration page into list in PySpark documentation

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424935#comment-17424935
 ] 

Apache Spark commented on SPARK-36939:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34195

> Add orphan migration page into list in PySpark documentation
> 
>
> Key: SPARK-36939
> URL: https://issues.apache.org/jira/browse/SPARK-36939
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-36618 added a new migration guide page but that's mistakenly not added 
> to {{spark/python/docs/source/migration_guideindex.rst}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36939) Add orphan migration page into list in PySpark documentation

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36939:


Assignee: Apache Spark

> Add orphan migration page into list in PySpark documentation
> 
>
> Key: SPARK-36939
> URL: https://issues.apache.org/jira/browse/SPARK-36939
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-36618 added a new migration guide page but that's mistakenly not added 
> to {{spark/python/docs/source/migration_guideindex.rst}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36934.
--
Resolution: Not A Problem

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36751) octet_length/bit_length API is not implemented on Scala/Python/R

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424939#comment-17424939
 ] 

Apache Spark commented on SPARK-36751:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34196

> octet_length/bit_length API is not implemented  on Scala/Python/R
> -
>
> Key: SPARK-36751
> URL: https://issues.apache.org/jira/browse/SPARK-36751
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Major
> Fix For: 3.3.0
>
>
> * octet_length: caliculate the byte length of strings
>  * bit_length: caliculate the bit length of strings
> Those two string related functions are only implemented on SparkSQL, not on 
> Scala, Python and R.
> Those functions would be useful for multi-bytes character users, who mainly 
> working with those languages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36858) Spark API to apply same function to multiple columns

2021-10-06 Thread Armand BERGES (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424988#comment-17424988
 ] 

Armand BERGES commented on SPARK-36858:
---

[~hyukjin.kwon] How would you do this ? 

>From my point, if you make a `df.withColumn` in a for loop, it will end in the 
>same execution plan (so probably in the same problem at the end, no ? 

> Spark API to apply same function to multiple columns
> 
>
> Key: SPARK-36858
> URL: https://issues.apache.org/jira/browse/SPARK-36858
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Armand BERGES
>Priority: Minor
>
> Hi
> My team and I have regularly need to apply the same function to multiple 
> columns at once.
> For example, we want to remove all non alphanumerical characters to each 
> columns of our dataframes. 
> When we hit this use case first, some people in my team were using this kind 
> of code : 
> {code:java}
> val colListToClean =  ## Generate some list, could be very long.
> val dfToClean: DataFrame = ... ## This is the dataframe we want to clean
> def cleanFunction(colName: String): Column = ... ## Write some function to 
> manipulate column based on its name.
> val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => 
> df.withColumn(colName, cleanFunction(colName)){code}
> This kind of code when applied on a large set of columns overloaded our 
> driver (because a Dataframe is generated for each column to clean).
> Based on this issue, we developed some code to add two functions : 
>  * One to apply the same function to multiple columns
>  * One to rename multiple columns based on a Map. 
>  
> I wonder if your ever ask your team to add such kind of API ? If you did, had 
> you any kind of issue regarding the implementation ? If you didn't, is this 
> any idea you could add to Spark ? 
> Best regards, 
>  
> LvffY
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36938) Inline type hints for group.py in python/pyspark/sql

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36938:


Assignee: Apache Spark

> Inline type hints for group.py in python/pyspark/sql  
> -
>
> Key: SPARK-36938
> URL: https://issues.apache.org/jira/browse/SPARK-36938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36938) Inline type hints for group.py in python/pyspark/sql

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424989#comment-17424989
 ] 

Apache Spark commented on SPARK-36938:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34197

> Inline type hints for group.py in python/pyspark/sql  
> -
>
> Key: SPARK-36938
> URL: https://issues.apache.org/jira/browse/SPARK-36938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36938) Inline type hints for group.py in python/pyspark/sql

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36938:


Assignee: (was: Apache Spark)

> Inline type hints for group.py in python/pyspark/sql  
> -
>
> Key: SPARK-36938
> URL: https://issues.apache.org/jira/browse/SPARK-36938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36938) Inline type hints for group.py in python/pyspark/sql

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424990#comment-17424990
 ] 

Apache Spark commented on SPARK-36938:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34197

> Inline type hints for group.py in python/pyspark/sql  
> -
>
> Key: SPARK-36938
> URL: https://issues.apache.org/jira/browse/SPARK-36938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36892) Disable batch fetch for a shuffle when push based shuffle is enabled

2021-10-06 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425016#comment-17425016
 ] 

Mridul Muralidharan commented on SPARK-36892:
-

Sounds good [~Gengliang.Wang], I am not aware of any other issues.
Thanks for driving the process !

> Disable batch fetch for a shuffle when push based shuffle is enabled
> 
>
> Key: SPARK-36892
> URL: https://issues.apache.org/jira/browse/SPARK-36892
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Ye Zhou
>Priority: Blocker
> Fix For: 3.2.0
>
>
> When push based shuffle is enabled, efficient fetch of merged mapper shuffle 
> output happens.
> Unfortunately, this currently interacts badly with 
> spark.sql.adaptive.fetchShuffleBlocksInBatch, potentially causing shuffle 
> fetch to hang and/or duplicate data to be fetched, causing correctness issues.
> Given batch fetch does not benefit spark stages reading merged blocks when 
> push based shuffle is enabled, ShuffleBlockFetcherIterator.doBatchFetch can 
> be disabled when push based shuffle is enabled.
> Thx to [~Ngone51] for surfacing this issue.
> +CC [~Gengliang.Wang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36905) Reading Hive view without explicit column names fails in Spark

2021-10-06 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425072#comment-17425072
 ] 

Gengliang Wang commented on SPARK-36905:


[~shardulm] Thanks for reporting the issue. 
I don't think this is a release blocker. I will mention this one as a known 
issue in the release note if it is not resolved by then.

> Reading Hive view without explicit column names fails in Spark 
> ---
>
> Key: SPARK-36905
> URL: https://issues.apache.org/jira/browse/SPARK-36905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Consider a Hive view in which some columns are not explicitly named
> {code:sql}
> CREATE VIEW test_view AS
> SELECT 1
> FROM some_table
> {code}
> Reading this view in Spark leads to an {{AnalysisException}}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`_c0`' given input 
> columns: [1]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:188)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:127)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:132)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:132)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:182)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
> 

[jira] [Assigned] (SPARK-36300) Refactor eleventh set of 20 query execution errors to use error classes

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36300:


Assignee: (was: Apache Spark)

> Refactor eleventh set of 20 query execution errors to use error classes
> ---
>
> Key: SPARK-36300
> URL: https://issues.apache.org/jira/browse/SPARK-36300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the eleventh set of 20.
> {code:java}
> expressionDecodingError
> expressionEncodingError
> classHasUnexpectedSerializerError
> cannotGetOuterPointerForInnerClassError
> userDefinedTypeNotAnnotatedAndRegisteredError
> invalidInputSyntaxForBooleanError
> unsupportedOperandTypeForSizeFunctionError
> unexpectedValueForStartInFunctionError
> unexpectedValueForLengthInFunctionError
> sqlArrayIndexNotStartAtOneError
> concatArraysWithElementsExceedLimitError
> flattenArraysWithElementsExceedLimitError
> createArrayWithElementsExceedLimitError
> unionArrayWithElementsExceedLimitError
> initialTypeNotTargetDataTypeError
> initialTypeNotTargetDataTypesError
> cannotConvertColumnToJSONError
> malformedRecordsDetectedInSchemaInferenceError
> malformedJSONError
> malformedRecordsDetectedInSchemaInferenceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36300) Refactor eleventh set of 20 query execution errors to use error classes

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36300:


Assignee: Apache Spark

> Refactor eleventh set of 20 query execution errors to use error classes
> ---
>
> Key: SPARK-36300
> URL: https://issues.apache.org/jira/browse/SPARK-36300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Assignee: Apache Spark
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the eleventh set of 20.
> {code:java}
> expressionDecodingError
> expressionEncodingError
> classHasUnexpectedSerializerError
> cannotGetOuterPointerForInnerClassError
> userDefinedTypeNotAnnotatedAndRegisteredError
> invalidInputSyntaxForBooleanError
> unsupportedOperandTypeForSizeFunctionError
> unexpectedValueForStartInFunctionError
> unexpectedValueForLengthInFunctionError
> sqlArrayIndexNotStartAtOneError
> concatArraysWithElementsExceedLimitError
> flattenArraysWithElementsExceedLimitError
> createArrayWithElementsExceedLimitError
> unionArrayWithElementsExceedLimitError
> initialTypeNotTargetDataTypeError
> initialTypeNotTargetDataTypesError
> cannotConvertColumnToJSONError
> malformedRecordsDetectedInSchemaInferenceError
> malformedJSONError
> malformedRecordsDetectedInSchemaInferenceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36300) Refactor eleventh set of 20 query execution errors to use error classes

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425129#comment-17425129
 ] 

Apache Spark commented on SPARK-36300:
--

User 'changvvb' has created a pull request for this issue:
https://github.com/apache/spark/pull/34198

> Refactor eleventh set of 20 query execution errors to use error classes
> ---
>
> Key: SPARK-36300
> URL: https://issues.apache.org/jira/browse/SPARK-36300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the eleventh set of 20.
> {code:java}
> expressionDecodingError
> expressionEncodingError
> classHasUnexpectedSerializerError
> cannotGetOuterPointerForInnerClassError
> userDefinedTypeNotAnnotatedAndRegisteredError
> invalidInputSyntaxForBooleanError
> unsupportedOperandTypeForSizeFunctionError
> unexpectedValueForStartInFunctionError
> unexpectedValueForLengthInFunctionError
> sqlArrayIndexNotStartAtOneError
> concatArraysWithElementsExceedLimitError
> flattenArraysWithElementsExceedLimitError
> createArrayWithElementsExceedLimitError
> unionArrayWithElementsExceedLimitError
> initialTypeNotTargetDataTypeError
> initialTypeNotTargetDataTypesError
> cannotConvertColumnToJSONError
> malformedRecordsDetectedInSchemaInferenceError
> malformedJSONError
> malformedRecordsDetectedInSchemaInferenceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36940) Inline type hints for python/pyspark/sql/avro/functions.py

2021-10-06 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36940:


 Summary: Inline type hints for python/pyspark/sql/avro/functions.py
 Key: SPARK-36940
 URL: https://issues.apache.org/jira/browse/SPARK-36940
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Inline type hints for python/pyspark/sql/avro/functions.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36940) Inline type hints for python/pyspark/sql/avro/functions.py

2021-10-06 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425132#comment-17425132
 ] 

Xinrong Meng commented on SPARK-36940:
--

I'm working on this.

> Inline type hints for python/pyspark/sql/avro/functions.py
> --
>
> Key: SPARK-36940
> URL: https://issues.apache.org/jira/browse/SPARK-36940
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/avro/functions.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36940) Inline type hints for python/pyspark/sql/avro/functions.py

2021-10-06 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36940:
-
Description: 
Inline type hints for python/pyspark/sql/avro/functions.py.

 

Currently, we use stub files for type annotations, which don't support type 
checks within function bodies. So we inline type hints to support that.

  was:Inline type hints for python/pyspark/sql/avro/functions.py


> Inline type hints for python/pyspark/sql/avro/functions.py
> --
>
> Key: SPARK-36940
> URL: https://issues.apache.org/jira/browse/SPARK-36940
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/avro/functions.py.
>  
> Currently, we use stub files for type annotations, which don't support type 
> checks within function bodies. So we inline type hints to support that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36935:


Assignee: Apache Spark

> Enhance ParquetSchemaConverter to capture Parquet repetition & definition 
> level
> ---
>
> Key: SPARK-36935
> URL: https://issues.apache.org/jira/browse/SPARK-36935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> In order to support complex type for Parquet vectorized reader, we'll need to 
> capture the repetition & definition level information associated with 
> Catalyst Spark type converted from Parquet {{MessageType}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425140#comment-17425140
 ] 

Apache Spark commented on SPARK-36935:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34199

> Enhance ParquetSchemaConverter to capture Parquet repetition & definition 
> level
> ---
>
> Key: SPARK-36935
> URL: https://issues.apache.org/jira/browse/SPARK-36935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> In order to support complex type for Parquet vectorized reader, we'll need to 
> capture the repetition & definition level information associated with 
> Catalyst Spark type converted from Parquet {{MessageType}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36935:


Assignee: (was: Apache Spark)

> Enhance ParquetSchemaConverter to capture Parquet repetition & definition 
> level
> ---
>
> Key: SPARK-36935
> URL: https://issues.apache.org/jira/browse/SPARK-36935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> In order to support complex type for Parquet vectorized reader, we'll need to 
> capture the repetition & definition level information associated with 
> Catalyst Spark type converted from Parquet {{MessageType}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425141#comment-17425141
 ] 

Apache Spark commented on SPARK-36935:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34199

> Enhance ParquetSchemaConverter to capture Parquet repetition & definition 
> level
> ---
>
> Key: SPARK-36935
> URL: https://issues.apache.org/jira/browse/SPARK-36935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> In order to support complex type for Parquet vectorized reader, we'll need to 
> capture the repetition & definition level information associated with 
> Catalyst Spark type converted from Parquet {{MessageType}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36940) Inline type hints for python/pyspark/sql/avro/functions.py

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425143#comment-17425143
 ] 

Apache Spark commented on SPARK-36940:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34200

> Inline type hints for python/pyspark/sql/avro/functions.py
> --
>
> Key: SPARK-36940
> URL: https://issues.apache.org/jira/browse/SPARK-36940
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/avro/functions.py.
>  
> Currently, we use stub files for type annotations, which don't support type 
> checks within function bodies. So we inline type hints to support that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36940) Inline type hints for python/pyspark/sql/avro/functions.py

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36940:


Assignee: Apache Spark

> Inline type hints for python/pyspark/sql/avro/functions.py
> --
>
> Key: SPARK-36940
> URL: https://issues.apache.org/jira/browse/SPARK-36940
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints for python/pyspark/sql/avro/functions.py.
>  
> Currently, we use stub files for type annotations, which don't support type 
> checks within function bodies. So we inline type hints to support that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36940) Inline type hints for python/pyspark/sql/avro/functions.py

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36940:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/sql/avro/functions.py
> --
>
> Key: SPARK-36940
> URL: https://issues.apache.org/jira/browse/SPARK-36940
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/avro/functions.py.
>  
> Currently, we use stub files for type annotations, which don't support type 
> checks within function bodies. So we inline type hints to support that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36937) Change OrcSourceSuite to test both V1 and V2 sources.

2021-10-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36937.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34194
[https://github.com/apache/spark/pull/34194]

> Change OrcSourceSuite to test both V1 and V2 sources.
> -
>
> Key: SPARK-36937
> URL: https://issues.apache.org/jira/browse/SPARK-36937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> There is no V2 test for the ORC source which implements 
> CommonFileDataSourceSuite while the corresponding ones exist for all other 
> built-in file-based datasources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories

2021-10-06 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425162#comment-17425162
 ] 

Chao Sun commented on SPARK-36936:
--

[~colin.williams] which version of {{spark-hadoop-cloud}} you were using? I 
think the above error shouldn't happen if the version is the same as the 
Spark's version.

We've already started to publish {{spark-hadoop-cloud}} as part of the Spark 
release procedure, see SPARK-35844.

> spark-hadoop-cloud broken on release and only published via 3rd party 
> repositories
> --
>
> Key: SPARK-36936
> URL: https://issues.apache.org/jira/browse/SPARK-36936
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.1, 3.1.2
> Environment: name:=spark-demo
> version := "0.0.1"
> scalaVersion := "2.12.12"
> lazy val app = (project in file("app")).settings(
>  assemblyPackageScala / assembleArtifact := false,
>  assembly / assemblyJarName := "uber.jar",
>  assembly / mainClass := Some("com.example.Main"),
>  // more settings here ...
>  )
> resolvers += "Cloudera" at 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % 
> "3.1.1.3.1.7270.0-253"
> libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % 
> "3.1.1.7.2.7.0-184"
> libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
> libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
> // test suite settings
> fork in Test := true
> javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", 
> "-XX:+CMSClassUnloadingEnabled")
> // Show runtime of tests
> testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")
> ___
>  
> import org.apache.spark.sql.SparkSession
> object SparkApp {
>  def main(args: Array[String]){
>  val spark = SparkSession.builder().master("local")
>  //.config("spark.jars.repositories", 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";)
>  //.config("spark.jars.packages", 
> "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
>  .appName("spark session").getOrCreate
>  val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
>  val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
>  jsonDF.show()
>  csvDF.show()
>  }
> }
>Reporter: Colin Williams
>Priority: Major
>
> The spark docmentation suggests using `spark-hadoop-cloud` to read / write 
> from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . 
> However artifacts are currently published via only 3rd party resolvers in 
> [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] 
> including Cloudera and Palantir.
>  
> Then apache spark documentation is providing a 3rd party solution for object 
> stores including S3. Furthermore, if you follow the instructions and include 
> one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release 
> and try to access object store, the following exception is returned.
>  
> ```
> Exception in thread "main" java.lang.NoSuchMethodError: 'void 
> com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
> java.lang.Object, java.lang.Object)'
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
>  at 
> org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
>  at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
>  at org.apache.spark.sql.DataFrameRead

[jira] [Commented] (SPARK-32929) StreamSuite failure on IBM Z: - SPARK-20432: union one stream with itself

2021-10-06 Thread Kun Lu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425179#comment-17425179
 ] 

Kun Lu commented on SPARK-32929:


This issue has been addressed in commit 
[3a299aa|https://github.com/apache/spark/commit/3a299aa6480ac22501512cd0310d31a441d7dfdc].
 It does not happen on SPARK 3.1.2 version on IBM Z.

> StreamSuite failure on IBM Z: - SPARK-20432: union one stream with itself
> -
>
> Key: SPARK-32929
> URL: https://issues.apache.org/jira/browse/SPARK-32929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: openjdk version "11.0.8" 2020-07-14
> OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.8+10)
> OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.8+10, mixed mode)
> Linux 4.15.0-117-generic #118-Ubuntu SMP Fri Sep 4 20:00:20 UTC 2020 s390x 
> s390x s390x GNU/Linux
>Reporter: Michael Munday
>Priority: Minor
>  Labels: big-endian
>
> I am getting zeros in the output of this test on IBM Z. This is a big-endian 
> system. See error below.
> I think this issue is related to the use of {{IntegerType}} in the schema for 
> {{FakeDefaultSource}}. Modifying the schema to use {{LongType}} fixes the 
> issue. Another workaround is to remove {{.select("a")}} (see patch below).
> My working theory is that long data (longs are generated by Range) is being 
> read using unsafe int operations (as specified in the schema). This would 
> 'work' on little-endian systems but not big-endian systems. I'm still working 
> to figure out what the mechanism is and I'd appreciate any hints or insights.
> The error looks like this:
> {noformat}
> - SPARK-20432: union one stream with itself *** FAILED ***
>   Decoded objects do not match expected objects:
>   expected: WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4, 5, 
> 6, 7, 8, 9, 10)
>   actual:   WrappedArray(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0)
>   assertnotnull(upcast(getcolumnbyordinal(0, LongType), LongType, - root 
> class: "scala.Long"))
>   +- upcast(getcolumnbyordinal(0, LongType), LongType, - root class: 
> "scala.Long")
>  +- getcolumnbyordinal(0, LongType) (QueryTest.scala:88)
> {noformat}
> This change fixes the issue: 
> {code:java}
> --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
> +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
> @@ -45,7 +45,7 @@ import org.apache.spark.sql.functions._
>  import org.apache.spark.sql.internal.SQLConf
>  import org.apache.spark.sql.sources.StreamSourceProvider
>  import org.apache.spark.sql.streaming.util.{BlockOnStopSourceProvider, 
> StreamManualClock}
> -import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
> +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
> StructType}
>  import org.apache.spark.util.Utils
>  class StreamSuite extends StreamTest {
> @@ -1265,7 +1265,7 @@ class StreamSuite extends StreamTest {
>  }
>  abstract class FakeSource extends StreamSourceProvider {
> -  private val fakeSchema = StructType(StructField("a", IntegerType) :: Nil)
> +  private val fakeSchema = StructType(StructField("a", LongType) :: Nil)
>override def sourceSchema(
>spark: SQLContext,
> @@ -1287,7 +1287,7 @@ class FakeDefaultSource extends FakeSource {
>  new Source {
>private var offset = -1L
> -  override def schema: StructType = StructType(StructField("a", 
> IntegerType) :: Nil)
> +  override def schema: StructType = StructType(StructField("a", 
> LongType) :: Nil)
>override def getOffset: Option[Offset] = {
>  if (offset >= 10) {
> {code}
> Alternatively, this change also fixes the issue:
> {code:java}
> --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
> +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
> @@ -154,7 +154,7 @@ class StreamSuite extends StreamTest {
>}
>  
>test("SPARK-20432: union one stream with itself") {
> -val df = 
> spark.readStream.format(classOf[FakeDefaultSource].getName).load().select("a")
> +val df = 
> spark.readStream.format(classOf[FakeDefaultSource].getName).load()
>  val unioned = df.union(df)
>  withTempDir { outputDir =>
>withTempDir { checkpointDir =>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32952) Test failure on IBM Z: CoalesceShufflePartitionsSuite: - determining the number of reducers: complex query 1

2021-10-06 Thread Kun Lu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425187#comment-17425187
 ] 

Kun Lu commented on SPARK-32952:


This issue still exists on Spark v3.1.2 on IBM Z. Any updates from the 
community would be greatly appreciated.

> Test failure on IBM Z: CoalesceShufflePartitionsSuite: - determining the 
> number of reducers: complex query 1
> 
>
> Key: SPARK-32952
> URL: https://issues.apache.org/jira/browse/SPARK-32952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: Linux on IBM Z (s390x).
>Reporter: Michael Munday
>Priority: Minor
>  Labels: big-endian
>
> I'm seeing the test 'CoalesceShufflePartitionsSuite: - determining the number 
> of reducers: complex query 1' fail on IBM Z with the wrong number of 
> partitions (1 instead of 2). It's strange because none of the other tests 
> fail.
> I'd be grateful for any hints as to how the number of partitions is 
> calculated. Could that calculation be affected by incorrect unsafe code? Is 
> there a way to trace the calculation?
> {noformat}
> CoalesceShufflePartitionsSuite:
> - determining the number of reducers: aggregate 
> operator(minNumPostShufflePartitions: 5)
> - determining the number of reducers: join 
> operator(minNumPostShufflePartitions: 5)
> - determining the number of reducers: complex query 
> 1(minNumPostShufflePartitions: 5)
> - determining the number of reducers: complex query 
> 2(minNumPostShufflePartitions: 5)
> - determining the number of reducers: plan already 
> partitioned(minNumPostShufflePartitions: 5)
> - determining the number of reducers: aggregate operator
> - determining the number of reducers: join operator
> - determining the number of reducers: complex query 1 *** FAILED ***
>  1 did not equal 2 (CoalesceShufflePartitionsSuite.scala:221)
> - determining the number of reducers: complex query 2
> - determining the number of reducers: plan already partitioned
> - SPARK-24705 adaptive query execution works correctly when exchange reuse 
> enabled
> - Do not reduce the number of shuffle partition for repartition
> - Union two datasets with different pre-shuffle partition number{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36941) Check saving of a dataframe with ANSI intervals to a Hive parquet table

2021-10-06 Thread Max Gekk (Jira)
Max Gekk created SPARK-36941:


 Summary: Check saving of a dataframe with ANSI intervals to a Hive 
parquet table
 Key: SPARK-36941
 URL: https://issues.apache.org/jira/browse/SPARK-36941
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk
Assignee: Max Gekk


Add a test which checks saving of a dataframe with ANSI intervals to a Hive 
table using parquet datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36941) Check saving of a dataframe with ANSI intervals to a Hive parquet table

2021-10-06 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425196#comment-17425196
 ] 

Max Gekk commented on SPARK-36941:
--

I am working on this.

> Check saving of a dataframe with ANSI intervals to a Hive parquet table
> ---
>
> Key: SPARK-36941
> URL: https://issues.apache.org/jira/browse/SPARK-36941
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add a test which checks saving of a dataframe with ANSI intervals to a Hive 
> table using parquet datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36941) Check saving of a dataframe with ANSI intervals to a Hive parquet table

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425201#comment-17425201
 ] 

Apache Spark commented on SPARK-36941:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34201

> Check saving of a dataframe with ANSI intervals to a Hive parquet table
> ---
>
> Key: SPARK-36941
> URL: https://issues.apache.org/jira/browse/SPARK-36941
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add a test which checks saving of a dataframe with ANSI intervals to a Hive 
> table using parquet datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36941) Check saving of a dataframe with ANSI intervals to a Hive parquet table

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425202#comment-17425202
 ] 

Apache Spark commented on SPARK-36941:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34201

> Check saving of a dataframe with ANSI intervals to a Hive parquet table
> ---
>
> Key: SPARK-36941
> URL: https://issues.apache.org/jira/browse/SPARK-36941
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add a test which checks saving of a dataframe with ANSI intervals to a Hive 
> table using parquet datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36941) Check saving of a dataframe with ANSI intervals to a Hive parquet table

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36941:


Assignee: Apache Spark  (was: Max Gekk)

> Check saving of a dataframe with ANSI intervals to a Hive parquet table
> ---
>
> Key: SPARK-36941
> URL: https://issues.apache.org/jira/browse/SPARK-36941
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add a test which checks saving of a dataframe with ANSI intervals to a Hive 
> table using parquet datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36941) Check saving of a dataframe with ANSI intervals to a Hive parquet table

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36941:


Assignee: Max Gekk  (was: Apache Spark)

> Check saving of a dataframe with ANSI intervals to a Hive parquet table
> ---
>
> Key: SPARK-36941
> URL: https://issues.apache.org/jira/browse/SPARK-36941
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Add a test which checks saving of a dataframe with ANSI intervals to a Hive 
> table using parquet datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36941) Check saving of a dataframe with ANSI intervals to a Hive parquet table

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36941:


Assignee: Apache Spark  (was: Max Gekk)

> Check saving of a dataframe with ANSI intervals to a Hive parquet table
> ---
>
> Key: SPARK-36941
> URL: https://issues.apache.org/jira/browse/SPARK-36941
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add a test which checks saving of a dataframe with ANSI intervals to a Hive 
> table using parquet datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35520) Spark-SQL test fails on IBM Z for certain config combinations.

2021-10-06 Thread Kun Lu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425205#comment-17425205
 ] 

Kun Lu commented on SPARK-35520:


I've also observed this issue on Spark v3.1.2 on IBM Z. Any comments from the 
community would be greatly appreciated.

> Spark-SQL test fails on IBM Z for certain config combinations.
> --
>
> Key: SPARK-35520
> URL: https://issues.apache.org/jira/browse/SPARK-35520
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.1
>Reporter: Simrit Kaur
>Priority: Major
>
> Some queries of SQL related test cases: in-joins.sql, in-order-by.sql, 
> not-in-group-by.sql and SubquerySuite.scala are failing with specific 
> configuration combinations on IBM Z(s390x).
> For example: 
> sql("select * from l where a = 6 and a not in (select c from r where c is not 
> null)") query from SubquerySuite.scala fails for following config 
> combinations:
> |enableNAAJ|enableAQE|enableCodegen|
> |TRUE|FALSE|FALSE|
> |TRUE|TRUE|FALSE|
> The above combination is also causing 2 other queries in in-joins.sql and 
> in-order-by.sql failing.
> Another query: 
> SELECT Count(*)
>  FROM (SELECT *
>  FROM t2
>  WHERE t2a NOT IN (SELECT t3a
>  FROM t3
>  WHERE t3h != t2h)) t2
>  WHERE t2b NOT IN (SELECT Min(t2b)
>  FROM t2
>  WHERE t2b = t2b
>  GROUP BY t2c);
> from not-in-group-by.sql is failing for following combinations:
> |enableAQE|enableCodegen|
> |FALSE|TRUE|
> |FALSE|FALSE|
>  
> These Test cases are not failing for 3.0.1 release and I believe might have 
> been introduced with 
> [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290] . 
> There is another strange behaviour observed, if expected output is 1,3 , I am 
> getting 1, 3, 9. If I update the Golden file to expect 1, 3, 9, the output 
> will be 1, 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36942) Inline type hints for python/pyspark/sql/readwriter.py

2021-10-06 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36942:


 Summary: Inline type hints for python/pyspark/sql/readwriter.py
 Key: SPARK-36942
 URL: https://issues.apache.org/jira/browse/SPARK-36942
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Inline type hints for python/pyspark/sql/readwriter.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36942) Inline type hints for python/pyspark/sql/readwriter.py

2021-10-06 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425213#comment-17425213
 ] 

Xinrong Meng commented on SPARK-36942:
--

I'm working on that.

> Inline type hints for python/pyspark/sql/readwriter.py
> --
>
> Key: SPARK-36942
> URL: https://issues.apache.org/jira/browse/SPARK-36942
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/readwriter.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36114) Support subqueries with correlated non-equality predicates

2021-10-06 Thread Allison Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425275#comment-17425275
 ] 

Allison Wang commented on SPARK-36114:
--

Support non-equality predicates can lift the restrictions added in SPARK-35080.

> Support subqueries with correlated non-equality predicates
> --
>
> Key: SPARK-36114
> URL: https://issues.apache.org/jira/browse/SPARK-36114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> The new decorrelation framework is able to support subqueries with 
> non-equality predicates. For example:
> SELECT * FROM t1 WHERE c1 = (SELECT SUM(c1) FROM t2 WHERE t1.c2 > t2.c2)
> The restrictions in CheckAnlysis can be removed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36943) Improve error message for missing column

2021-10-06 Thread Karen Feng (Jira)
Karen Feng created SPARK-36943:
--

 Summary: Improve error message for missing column
 Key: SPARK-36943
 URL: https://issues.apache.org/jira/browse/SPARK-36943
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.3.0
Reporter: Karen Feng


Improve the error message for the case that a user asks for a column that does 
not exist.
Today, the message is "cannot resolve 'foo' given input columns [bar, baz, 
froo]".
We should sort the suggestion list by similarity and improve the grammar to 
remove lingo like "resolve."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36943) Improve error message for missing column

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36943:


Assignee: (was: Apache Spark)

> Improve error message for missing column
> 
>
> Key: SPARK-36943
> URL: https://issues.apache.org/jira/browse/SPARK-36943
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Karen Feng
>Priority: Major
>
> Improve the error message for the case that a user asks for a column that 
> does not exist.
> Today, the message is "cannot resolve 'foo' given input columns [bar, baz, 
> froo]".
> We should sort the suggestion list by similarity and improve the grammar to 
> remove lingo like "resolve."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36943) Improve error message for missing column

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425289#comment-17425289
 ] 

Apache Spark commented on SPARK-36943:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/34202

> Improve error message for missing column
> 
>
> Key: SPARK-36943
> URL: https://issues.apache.org/jira/browse/SPARK-36943
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Karen Feng
>Priority: Major
>
> Improve the error message for the case that a user asks for a column that 
> does not exist.
> Today, the message is "cannot resolve 'foo' given input columns [bar, baz, 
> froo]".
> We should sort the suggestion list by similarity and improve the grammar to 
> remove lingo like "resolve."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36943) Improve error message for missing column

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36943:


Assignee: Apache Spark

> Improve error message for missing column
> 
>
> Key: SPARK-36943
> URL: https://issues.apache.org/jira/browse/SPARK-36943
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Karen Feng
>Assignee: Apache Spark
>Priority: Major
>
> Improve the error message for the case that a user asks for a column that 
> does not exist.
> Today, the message is "cannot resolve 'foo' given input columns [bar, baz, 
> froo]".
> We should sort the suggestion list by similarity and improve the grammar to 
> remove lingo like "resolve."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36943) Improve error message for missing column

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425290#comment-17425290
 ] 

Apache Spark commented on SPARK-36943:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/34202

> Improve error message for missing column
> 
>
> Key: SPARK-36943
> URL: https://issues.apache.org/jira/browse/SPARK-36943
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Karen Feng
>Priority: Major
>
> Improve the error message for the case that a user asks for a column that 
> does not exist.
> Today, the message is "cannot resolve 'foo' given input columns [bar, baz, 
> froo]".
> We should sort the suggestion list by similarity and improve the grammar to 
> remove lingo like "resolve."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36939) Add orphan migration page into list in PySpark documentation

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36939.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34195
[https://github.com/apache/spark/pull/34195]

> Add orphan migration page into list in PySpark documentation
> 
>
> Key: SPARK-36939
> URL: https://issues.apache.org/jira/browse/SPARK-36939
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> SPARK-36618 added a new migration guide page but that's mistakenly not added 
> to {{spark/python/docs/source/migration_guideindex.rst}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36939) Add orphan migration page into list in PySpark documentation

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36939:


Assignee: Hyukjin Kwon

> Add orphan migration page into list in PySpark documentation
> 
>
> Key: SPARK-36939
> URL: https://issues.apache.org/jira/browse/SPARK-36939
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-36618 added a new migration guide page but that's mistakenly not added 
> to {{spark/python/docs/source/migration_guideindex.rst}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36939) Add orphan migration page into list in PySpark documentation

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36939:
-
Fix Version/s: (was: 3.2.0)
   3.2.1

> Add orphan migration page into list in PySpark documentation
> 
>
> Key: SPARK-36939
> URL: https://issues.apache.org/jira/browse/SPARK-36939
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.1
>
>
> SPARK-36618 added a new migration guide page but that's mistakenly not added 
> to {{spark/python/docs/source/migration_guideindex.rst}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36944) Remove unused python/pyspark/sql/__init__.pyi

2021-10-06 Thread dch nguyen (Jira)
dch nguyen created SPARK-36944:
--

 Summary: Remove unused python/pyspark/sql/__init__.pyi
 Key: SPARK-36944
 URL: https://issues.apache.org/jira/browse/SPARK-36944
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dch nguyen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36858) Spark API to apply same function to multiple columns

2021-10-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425297#comment-17425297
 ] 

Hyukjin Kwon commented on SPARK-36858:
--

you could use var. e.g.)

{code}
var df = ...
colListToClean.foreach { c => df = df.withColumn(c, func(...)) }
{code}

or actually what you did with foldLeft looks making sense too. What API do you 
have on your mind on this?

> Spark API to apply same function to multiple columns
> 
>
> Key: SPARK-36858
> URL: https://issues.apache.org/jira/browse/SPARK-36858
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Armand BERGES
>Priority: Minor
>
> Hi
> My team and I have regularly need to apply the same function to multiple 
> columns at once.
> For example, we want to remove all non alphanumerical characters to each 
> columns of our dataframes. 
> When we hit this use case first, some people in my team were using this kind 
> of code : 
> {code:java}
> val colListToClean =  ## Generate some list, could be very long.
> val dfToClean: DataFrame = ... ## This is the dataframe we want to clean
> def cleanFunction(colName: String): Column = ... ## Write some function to 
> manipulate column based on its name.
> val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => 
> df.withColumn(colName, cleanFunction(colName)){code}
> This kind of code when applied on a large set of columns overloaded our 
> driver (because a Dataframe is generated for each column to clean).
> Based on this issue, we developed some code to add two functions : 
>  * One to apply the same function to multiple columns
>  * One to rename multiple columns based on a Map. 
>  
> I wonder if your ever ask your team to add such kind of API ? If you did, had 
> you any kind of issue regarding the implementation ? If you didn't, is this 
> any idea you could add to Spark ? 
> Best regards, 
>  
> LvffY
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36944) Remove unused python/pyspark/sql/__init__.pyi

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36944:


Assignee: Apache Spark

> Remove unused python/pyspark/sql/__init__.pyi
> -
>
> Key: SPARK-36944
> URL: https://issues.apache.org/jira/browse/SPARK-36944
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36944) Remove unused python/pyspark/sql/__init__.pyi

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425298#comment-17425298
 ] 

Apache Spark commented on SPARK-36944:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34203

> Remove unused python/pyspark/sql/__init__.pyi
> -
>
> Key: SPARK-36944
> URL: https://issues.apache.org/jira/browse/SPARK-36944
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36944) Remove unused python/pyspark/sql/__init__.pyi

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36944:


Assignee: (was: Apache Spark)

> Remove unused python/pyspark/sql/__init__.pyi
> -
>
> Key: SPARK-36944
> URL: https://issues.apache.org/jira/browse/SPARK-36944
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36944) Remove unused python/pyspark/sql/__init__.pyi

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425299#comment-17425299
 ] 

Apache Spark commented on SPARK-36944:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34203

> Remove unused python/pyspark/sql/__init__.pyi
> -
>
> Key: SPARK-36944
> URL: https://issues.apache.org/jira/browse/SPARK-36944
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36945) Inline type hints for python/pyspark/sql/udf.py

2021-10-06 Thread dch nguyen (Jira)
dch nguyen created SPARK-36945:
--

 Summary: Inline type hints for python/pyspark/sql/udf.py
 Key: SPARK-36945
 URL: https://issues.apache.org/jira/browse/SPARK-36945
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dch nguyen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36945) Inline type hints for python/pyspark/sql/udf.py

2021-10-06 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425301#comment-17425301
 ] 

dch nguyen commented on SPARK-36945:


working on this

> Inline type hints for python/pyspark/sql/udf.py
> ---
>
> Key: SPARK-36945
> URL: https://issues.apache.org/jira/browse/SPARK-36945
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36884.
--
Fix Version/s: 3.3.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34136

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36918) unionByName shouldn't consider types when comparing structs

2021-10-06 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-36918.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34166
[https://github.com/apache/spark/pull/34166]

> unionByName shouldn't consider types when comparing structs
> ---
>
> Key: SPARK-36918
> URL: https://issues.apache.org/jira/browse/SPARK-36918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.3.0
>
>
> Improvement/follow-on of https://issues.apache.org/jira/browse/SPARK-35290.
> We use StructType.sameType to see if we need to recreate the struct, but this 
> can lead to false positives if the structure is the same but the types are 
> different, and will lead to simply creating a new struct that's exactly the 
> same as the original. This can cause significant overhead when unioning 
> multiple deeply nested nullable structs, as each time it's recreated it gets 
> wrapped in a If(IsNull()). Only comparing the field names can lead to more 
> efficient plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36918) unionByName shouldn't consider types when comparing structs

2021-10-06 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-36918:
---

Assignee: Adam Binford

> unionByName shouldn't consider types when comparing structs
> ---
>
> Key: SPARK-36918
> URL: https://issues.apache.org/jira/browse/SPARK-36918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
>
> Improvement/follow-on of https://issues.apache.org/jira/browse/SPARK-35290.
> We use StructType.sameType to see if we need to recreate the struct, but this 
> can lead to false positives if the structure is the same but the types are 
> different, and will lead to simply creating a new struct that's exactly the 
> same as the original. This can cause significant overhead when unioning 
> multiple deeply nested nullable structs, as each time it's recreated it gets 
> wrapped in a If(IsNull()). Only comparing the field names can lead to more 
> efficient plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36742) Fix ps.to_datetime with plurals of keys like years, months, days

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36742.
--
Fix Version/s: 3.3.0
 Assignee: dch nguyen
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34182

> Fix ps.to_datetime with plurals of keys like years, months, days
> 
>
> Key: SPARK-36742
> URL: https://issues.apache.org/jira/browse/SPARK-36742
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425329#comment-17425329
 ] 

Apache Spark commented on SPARK-36874:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34205

> Ambiguous Self-Join detected only on right dataframe
> 
>
> Key: SPARK-36874
> URL: https://issues.apache.org/jira/browse/SPARK-36874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Vincent Doba
>Assignee: Kousuke Saruta
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
>
> When joining two dataframes, if they share the same lineage and one dataframe 
> is a transformation of the other, Ambiguous Self Join detection only works 
> when transformed dataframe is the right dataframe. 
> For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, 
> Ambiguous Self Join detection only works when {{df2}} is the right dataframe:
> - {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
> - {{df2.join(df1, ...)}} returns a valid dataframe
> h1. Minimum Reproducible example
> h2. Code
> {code:scala}
> import sparkSession.implicit._
> val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
> val df2 = df1.filter($"value" === "A2")
> df2.join(df1, df1("key1") === df2("key2")).show()
> {code}
> h2. Expected Result
> Throw the following exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
> key2#11 are ambiguous. It's probably because you joined several Datasets 
> together, and some of these Datasets are the same. This column points to one 
> of the Datasets but Spark is unable to figure out which one. Please alias the 
> Datasets with different names via `Dataset.as` before joining them, and 
> specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), 
> $"a.id" > $"b.id")`. You can also set 
> spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
>   at 
> org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
>   at 
> org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>   at scala.collection.immutable.List.foldLeft(List.scala:91)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed

[jira] [Commented] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425330#comment-17425330
 ] 

Apache Spark commented on SPARK-36874:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34205

> Ambiguous Self-Join detected only on right dataframe
> 
>
> Key: SPARK-36874
> URL: https://issues.apache.org/jira/browse/SPARK-36874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Vincent Doba
>Assignee: Kousuke Saruta
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
>
> When joining two dataframes, if they share the same lineage and one dataframe 
> is a transformation of the other, Ambiguous Self Join detection only works 
> when transformed dataframe is the right dataframe. 
> For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, 
> Ambiguous Self Join detection only works when {{df2}} is the right dataframe:
> - {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
> - {{df2.join(df1, ...)}} returns a valid dataframe
> h1. Minimum Reproducible example
> h2. Code
> {code:scala}
> import sparkSession.implicit._
> val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
> val df2 = df1.filter($"value" === "A2")
> df2.join(df1, df1("key1") === df2("key2")).show()
> {code}
> h2. Expected Result
> Throw the following exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
> key2#11 are ambiguous. It's probably because you joined several Datasets 
> together, and some of these Datasets are the same. This column points to one 
> of the Datasets but Spark is unable to figure out which one. Please alias the 
> Datasets with different names via `Dataset.as` before joining them, and 
> specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), 
> $"a.id" > $"b.id")`. You can also set 
> spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
>   at 
> org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
>   at 
> org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>   at scala.collection.immutable.List.foldLeft(List.scala:91)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed

[jira] [Commented] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425332#comment-17425332
 ] 

Apache Spark commented on SPARK-34634:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34205

> Self-join with script transformation failed to resolve attribute correctly
> --
>
> Key: SPARK-34634
> URL: https://issues.apache.org/jira/browse/SPARK-34634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
> Fix For: 3.2.0
>
>
> To reproduce,
> {code:java}
> // code placeholder
> create temporary view t as select * from values 0, 1, 2 as t(a);
> WITH temp AS (
> SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
> )
> SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
> {code}
>  
> Spark will throw AnalysisException
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425331#comment-17425331
 ] 

Apache Spark commented on SPARK-34634:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34205

> Self-join with script transformation failed to resolve attribute correctly
> --
>
> Key: SPARK-34634
> URL: https://issues.apache.org/jira/browse/SPARK-34634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
> Fix For: 3.2.0
>
>
> To reproduce,
> {code:java}
> // code placeholder
> create temporary view t as select * from values 0, 1, 2 as t(a);
> WITH temp AS (
> SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
> )
> SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
> {code}
>  
> Spark will throw AnalysisException
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33913) Upgrade Kafka to 2.8.0

2021-10-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33913:
--
Fix Version/s: 3.2.0

> Upgrade Kafka to 2.8.0
> --
>
> Key: SPARK-33913
> URL: https://issues.apache.org/jira/browse/SPARK-33913
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, DStreams
>Affects Versions: 3.2.0
>Reporter: dengziming
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> This issue aims to upgrade Kafka client to 2.8.0.
> Note that Kafka 2.8.0 uses ZSTD JNI 1.4.9-1 like Apache Spark 3.2.0.
> *RELEASE NOTE*
> - https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html
> - https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html
> This will bring the latest client-side improvement and bug fixes like the 
> following examples.
> - KAFKA-10631 ProducerFencedException is not Handled on Offest Commit
> - KAFKA-10134 High CPU issue during rebalance in Kafka consumer after 
> upgrading to 2.5
> - KAFKA-12193 Re-resolve IPs when a client is disconnected
> - KAFKA-10090 Misleading warnings: The configuration was supplied but isn't a 
> known config
> - KAFKA-9263 The new hw is added to incorrect log when  
> ReplicaAlterLogDirsThread is replacing log 
> - KAFKA-10607 Ensure the error counts contains the NONE
> - KAFKA-10458 Need a way to update quota for TokenBucket registered with 
> Sensor
> - KAFKA-10503 MockProducer doesn't throw ClassCastException when no partition 
> for topic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36930) Support ps.MultiIndex.dtypes

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425335#comment-17425335
 ] 

Apache Spark commented on SPARK-36930:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34206

> Support ps.MultiIndex.dtypes
> 
>
> Key: SPARK-36930
> URL: https://issues.apache.org/jira/browse/SPARK-36930
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>
> when MultiIndex.dtypes is supported, we can use:
> {code:java}
> >>> idx = pd.MultiIndex.from_arrays([[0, 1, 2, 3, 4, 5, 6, 7, 8], [1, 2, 3, 
> >>> 4, 5, 6, 7, 8, 9]], names=("zero", "one"))
> >>> pdf = pd.DataFrame(
> ... {"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 0]},
> ... index=idx,
> ... )
> >>> psdf = ps.from_pandas(pdf)
> >>> ps.DataFrame[psdf.index.dtypes, psdf.dtypes]
> typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, 
> pyspark.pandas.typedef.typehints.IndexNameType, 
> pyspark.pandas.typedef.typehints.NameType, 
> pyspark.pandas.typedef.typehints.NameType]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36711) Support multi-index in new syntax

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425338#comment-17425338
 ] 

Apache Spark commented on SPARK-36711:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34207

> Support multi-index in new syntax
> -
>
> Key: SPARK-36711
> URL: https://issues.apache.org/jira/browse/SPARK-36711
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>
> Support multi-index in the new syntax SPARK-36709



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36304) Refactor fifteenth set of 20 query execution errors to use error classes

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425341#comment-17425341
 ] 

Apache Spark commented on SPARK-36304:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34208

> Refactor fifteenth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36304
> URL: https://issues.apache.org/jira/browse/SPARK-36304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fifteenth set of 20.
> {code:java}
> unsupportedOperationExceptionError
> nullLiteralsCannotBeCastedError
> notUserDefinedTypeError
> cannotLoadUserDefinedTypeError
> timeZoneIdNotSpecifiedForTimestampTypeError
> notPublicClassError
> primitiveTypesNotSupportedError
> fieldIndexOnRowWithoutSchemaError
> valueIsNullError
> onlySupportDataSourcesProvidingFileFormatError
> failToSetOriginalPermissionBackError
> failToSetOriginalACLBackError
> multiFailuresInStageMaterializationError
> unrecognizedCompressionSchemaTypeIDError
> getParentLoggerNotImplementedError
> cannotCreateParquetConverterForTypeError
> cannotCreateParquetConverterForDecimalTypeError
> cannotCreateParquetConverterForDataTypeError
> cannotAddMultiPartitionsOnNonatomicPartitionTableError
> userSpecifiedSchemaUnsupportedByDataSourceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36304) Refactor fifteenth set of 20 query execution errors to use error classes

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425342#comment-17425342
 ] 

Apache Spark commented on SPARK-36304:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34208

> Refactor fifteenth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36304
> URL: https://issues.apache.org/jira/browse/SPARK-36304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fifteenth set of 20.
> {code:java}
> unsupportedOperationExceptionError
> nullLiteralsCannotBeCastedError
> notUserDefinedTypeError
> cannotLoadUserDefinedTypeError
> timeZoneIdNotSpecifiedForTimestampTypeError
> notPublicClassError
> primitiveTypesNotSupportedError
> fieldIndexOnRowWithoutSchemaError
> valueIsNullError
> onlySupportDataSourcesProvidingFileFormatError
> failToSetOriginalPermissionBackError
> failToSetOriginalACLBackError
> multiFailuresInStageMaterializationError
> unrecognizedCompressionSchemaTypeIDError
> getParentLoggerNotImplementedError
> cannotCreateParquetConverterForTypeError
> cannotCreateParquetConverterForDecimalTypeError
> cannotCreateParquetConverterForDataTypeError
> cannotAddMultiPartitionsOnNonatomicPartitionTableError
> userSpecifiedSchemaUnsupportedByDataSourceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36713) Document new syntax for specifying index type

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36713:


Assignee: (was: Apache Spark)

> Document new syntax for specifying index type
> -
>
> Key: SPARK-36713
> URL: https://issues.apache.org/jira/browse/SPARK-36713
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36713) Document new syntax for specifying index type

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425346#comment-17425346
 ] 

Apache Spark commented on SPARK-36713:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34210

> Document new syntax for specifying index type
> -
>
> Key: SPARK-36713
> URL: https://issues.apache.org/jira/browse/SPARK-36713
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36713) Document new syntax for specifying index type

2021-10-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36713:


Assignee: Apache Spark

> Document new syntax for specifying index type
> -
>
> Key: SPARK-36713
> URL: https://issues.apache.org/jira/browse/SPARK-36713
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36713) Document new syntax for specifying index type

2021-10-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425347#comment-17425347
 ] 

Apache Spark commented on SPARK-36713:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34210

> Document new syntax for specifying index type
> -
>
> Key: SPARK-36713
> URL: https://issues.apache.org/jira/browse/SPARK-36713
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36941) Check saving of a dataframe with ANSI intervals to a Hive parquet table

2021-10-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36941.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34201
[https://github.com/apache/spark/pull/34201]

> Check saving of a dataframe with ANSI intervals to a Hive parquet table
> ---
>
> Key: SPARK-36941
> URL: https://issues.apache.org/jira/browse/SPARK-36941
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Add a test which checks saving of a dataframe with ANSI intervals to a Hive 
> table using parquet datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34634:
-
Fix Version/s: 3.1.3

> Self-join with script transformation failed to resolve attribute correctly
> --
>
> Key: SPARK-34634
> URL: https://issues.apache.org/jira/browse/SPARK-34634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
> Fix For: 3.2.0, 3.1.3
>
>
> To reproduce,
> {code:java}
> // code placeholder
> create temporary view t as select * from values 0, 1, 2 as t(a);
> WITH temp AS (
> SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
> )
> SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
> {code}
>  
> Spark will throw AnalysisException
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-10-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36874:
-
Fix Version/s: 3.1.3

> Ambiguous Self-Join detected only on right dataframe
> 
>
> Key: SPARK-36874
> URL: https://issues.apache.org/jira/browse/SPARK-36874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Vincent Doba
>Assignee: Kousuke Saruta
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3
>
>
> When joining two dataframes, if they share the same lineage and one dataframe 
> is a transformation of the other, Ambiguous Self Join detection only works 
> when transformed dataframe is the right dataframe. 
> For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, 
> Ambiguous Self Join detection only works when {{df2}} is the right dataframe:
> - {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
> - {{df2.join(df1, ...)}} returns a valid dataframe
> h1. Minimum Reproducible example
> h2. Code
> {code:scala}
> import sparkSession.implicit._
> val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
> val df2 = df1.filter($"value" === "A2")
> df2.join(df1, df1("key1") === df2("key2")).show()
> {code}
> h2. Expected Result
> Throw the following exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
> key2#11 are ambiguous. It's probably because you joined several Datasets 
> together, and some of these Datasets are the same. This column points to one 
> of the Datasets but Spark is unable to figure out which one. Please alias the 
> Datasets with different names via `Dataset.as` before joining them, and 
> specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), 
> $"a.id" > $"b.id")`. You can also set 
> spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
>   at 
> org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
>   at 
> org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>   at scala.collection.immutable.List.foldLeft(List.scala:91)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
>   at org.apache.sp