[jira] [Assigned] (SPARK-39167) Throw an exception w/ an error class for multiple rows from a subquery used as an expression

2022-05-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39167:


Assignee: panbingkun

> Throw an exception w/ an error class for multiple rows from a subquery used 
> as an expression
> 
>
> Key: SPARK-39167
> URL: https://issues.apache.org/jira/browse/SPARK-39167
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Major
>
> Users can trigger an illegal state exception by the SQL statement:
> {code:sql}
> > select (select a from (select 1 as a union all select 2 as a) t) as b
> {code}
> {code:java}
> Caused by: java.lang.IllegalStateException: more than one row returned by a 
> subquery used as an expression:
> Subquery subquery#242, [id=#100]
> +- AdaptiveSparkPlan isFinalPlan=true
>+- == Final Plan ==
>   Union
>   :- *(1) Project [1 AS a#240]
>   :  +- *(1) Scan OneRowRelation[]
>   +- *(2) Project [2 AS a#241]
>  +- *(2) Scan OneRowRelation[]
>+- == Initial Plan ==
>   Union
>   :- Project [1 AS a#240]
>   :  +- Scan OneRowRelation[]
>   +- Project [2 AS a#241]
>  +- Scan OneRowRelation[]
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:83)
> {code}
> but such kind of exceptions are not supposed to be visible to users. Need to 
> introduce an error class (or re-use an existing one), and replace the 
> IllegalStateException.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39167) Throw an exception w/ an error class for multiple rows from a subquery used as an expression

2022-05-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39167.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36580
[https://github.com/apache/spark/pull/36580]

> Throw an exception w/ an error class for multiple rows from a subquery used 
> as an expression
> 
>
> Key: SPARK-39167
> URL: https://issues.apache.org/jira/browse/SPARK-39167
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Major
> Fix For: 3.4.0
>
>
> Users can trigger an illegal state exception by the SQL statement:
> {code:sql}
> > select (select a from (select 1 as a union all select 2 as a) t) as b
> {code}
> {code:java}
> Caused by: java.lang.IllegalStateException: more than one row returned by a 
> subquery used as an expression:
> Subquery subquery#242, [id=#100]
> +- AdaptiveSparkPlan isFinalPlan=true
>+- == Final Plan ==
>   Union
>   :- *(1) Project [1 AS a#240]
>   :  +- *(1) Scan OneRowRelation[]
>   +- *(2) Project [2 AS a#241]
>  +- *(2) Scan OneRowRelation[]
>+- == Initial Plan ==
>   Union
>   :- Project [1 AS a#240]
>   :  +- Scan OneRowRelation[]
>   +- Project [2 AS a#241]
>  +- Scan OneRowRelation[]
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:83)
> {code}
> but such kind of exceptions are not supposed to be visible to users. Need to 
> introduce an error class (or re-use an existing one), and replace the 
> IllegalStateException.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39245) Support Avro file scans with DEFAULT values

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540335#comment-17540335
 ] 

Apache Spark commented on SPARK-39245:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/36623

> Support Avro file scans with DEFAULT values
> ---
>
> Key: SPARK-39245
> URL: https://issues.apache.org/jira/browse/SPARK-39245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39245) Support Avro file scans with DEFAULT values

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39245:


Assignee: Apache Spark

> Support Avro file scans with DEFAULT values
> ---
>
> Key: SPARK-39245
> URL: https://issues.apache.org/jira/browse/SPARK-39245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39245) Support Avro file scans with DEFAULT values

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39245:


Assignee: (was: Apache Spark)

> Support Avro file scans with DEFAULT values
> ---
>
> Key: SPARK-39245
> URL: https://issues.apache.org/jira/browse/SPARK-39245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39245) Support Avro file scans with DEFAULT values

2022-05-20 Thread Daniel (Jira)
Daniel created SPARK-39245:
--

 Summary: Support Avro file scans with DEFAULT values
 Key: SPARK-39245
 URL: https://issues.apache.org/jira/browse/SPARK-39245
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39244) Use `--no-echo` instead of `--slave` in R 4.0

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39244:


Assignee: (was: Apache Spark)

> Use `--no-echo` instead of `--slave` in R 4.0
> -
>
> Key: SPARK-39244
> URL: https://issues.apache.org/jira/browse/SPARK-39244
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39244) Use `--no-echo` instead of `--slave` in R 4.0

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39244:


Assignee: Apache Spark

> Use `--no-echo` instead of `--slave` in R 4.0
> -
>
> Key: SPARK-39244
> URL: https://issues.apache.org/jira/browse/SPARK-39244
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39244) Use `--no-echo` instead of `--slave` in R 4.0

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540325#comment-17540325
 ] 

Apache Spark commented on SPARK-39244:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36622

> Use `--no-echo` instead of `--slave` in R 4.0
> -
>
> Key: SPARK-39244
> URL: https://issues.apache.org/jira/browse/SPARK-39244
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39244) Use `--no-echo` instead of `--slave` in R 4.0

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540326#comment-17540326
 ] 

Apache Spark commented on SPARK-39244:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36622

> Use `--no-echo` instead of `--slave` in R 4.0
> -
>
> Key: SPARK-39244
> URL: https://issues.apache.org/jira/browse/SPARK-39244
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39244) Use `--no-echo` instead of `--slave` in R 4.0

2022-05-20 Thread William Hyun (Jira)
William Hyun created SPARK-39244:


 Summary: Use `--no-echo` instead of `--slave` in R 4.0
 Key: SPARK-39244
 URL: https://issues.apache.org/jira/browse/SPARK-39244
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39243) Describe the rules of quoting elements in error messages

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39243:


Assignee: (was: Apache Spark)

> Describe the rules of quoting elements in error messages
> 
>
> Key: SPARK-39243
> URL: https://issues.apache.org/jira/browse/SPARK-39243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Add a comment to QueryErrorsBase and describe rules of quoting 
> elements/parameters in error messages.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39243) Describe the rules of quoting elements in error messages

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39243:


Assignee: Apache Spark

> Describe the rules of quoting elements in error messages
> 
>
> Key: SPARK-39243
> URL: https://issues.apache.org/jira/browse/SPARK-39243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add a comment to QueryErrorsBase and describe rules of quoting 
> elements/parameters in error messages.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39243) Describe the rules of quoting elements in error messages

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540296#comment-17540296
 ] 

Apache Spark commented on SPARK-39243:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36621

> Describe the rules of quoting elements in error messages
> 
>
> Key: SPARK-39243
> URL: https://issues.apache.org/jira/browse/SPARK-39243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Add a comment to QueryErrorsBase and describe rules of quoting 
> elements/parameters in error messages.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39243) Describe the rules of quoting elements in error messages

2022-05-20 Thread Max Gekk (Jira)
Max Gekk created SPARK-39243:


 Summary: Describe the rules of quoting elements in error messages
 Key: SPARK-39243
 URL: https://issues.apache.org/jira/browse/SPARK-39243
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Add a comment to QueryErrorsBase and describe rules of quoting 
elements/parameters in error messages.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39213) Create ANY_VALUE aggregate function

2022-05-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39213:


Assignee: Vitalii Li

> Create ANY_VALUE aggregate function
> ---
>
> Key: SPARK-39213
> URL: https://issues.apache.org/jira/browse/SPARK-39213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
>
> This is a feature request to add an \{{ANY_VALUE}} aggregate function. This 
> would consume input values and quickly return any arbitrary element.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39213) Create ANY_VALUE aggregate function

2022-05-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39213.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36584
[https://github.com/apache/spark/pull/36584]

> Create ANY_VALUE aggregate function
> ---
>
> Key: SPARK-39213
> URL: https://issues.apache.org/jira/browse/SPARK-39213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
> Fix For: 3.4.0
>
>
> This is a feature request to add an \{{ANY_VALUE}} aggregate function. This 
> would consume input values and quickly return any arbitrary element.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39242) AwaitOffset does not wait correctly for atleast expected offset and RateStreamProvider test is flaky

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39242:


Assignee: Apache Spark

> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky
> 
>
> Key: SPARK-39242
> URL: https://issues.apache.org/jira/browse/SPARK-39242
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Anish Shrigondekar
>Assignee: Apache Spark
>Priority: Major
>
> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39242) AwaitOffset does not wait correctly for atleast expected offset and RateStreamProvider test is flaky

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540250#comment-17540250
 ] 

Apache Spark commented on SPARK-39242:
--

User 'anishshri-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/36620

> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky
> 
>
> Key: SPARK-39242
> URL: https://issues.apache.org/jira/browse/SPARK-39242
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Anish Shrigondekar
>Priority: Major
>
> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39242) AwaitOffset does not wait correctly for atleast expected offset and RateStreamProvider test is flaky

2022-05-20 Thread Anish Shrigondekar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540252#comment-17540252
 ] 

Anish Shrigondekar commented on SPARK-39242:


PR for the change submitted here: [https://github.com/apache/spark/pull/36620]

 

CC - [~kabhwan] - please take a look. Thanks

> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky
> 
>
> Key: SPARK-39242
> URL: https://issues.apache.org/jira/browse/SPARK-39242
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Anish Shrigondekar
>Priority: Major
>
> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39242) AwaitOffset does not wait correctly for atleast expected offset and RateStreamProvider test is flaky

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39242:


Assignee: (was: Apache Spark)

> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky
> 
>
> Key: SPARK-39242
> URL: https://issues.apache.org/jira/browse/SPARK-39242
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Anish Shrigondekar
>Priority: Major
>
> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39199) Implement pandas API missing parameters

2022-05-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39199:
-
Description: 
pandas API on Spark aims to make pandas code work on Spark clusters without any 
changes. So full API coverage has been one of our major goals. Currently, most 
pandas functions are implemented, whereas some of them are have incomplete 
parameters support.

There are some common parameters missing (resolved):
 * How to do with NAs   
 * Filter data types    
 * Control result length    
 * Reindex result   

There are remaining missing parameters to implement (see doc below).

See the design and the current status at 
[https://docs.google.com/document/d/1H6RXL6oc-v8qLJbwKl6OEqBjRuMZaXcTYmrZb9yNm5I/edit?usp=sharing].

  was:
pandas API on Spark aims to achieve full pandas API coverage. Currently, most 
pandas functions are supported in pandas API on Spark with parameters missing.

There are some common parameters missing:
- how to do with NAs: `skipna`, `dropna`
- filter data types: `numeric_only`, `bool_only`
- filter result length: `keep`
- reindex result: `ignore_index`

They support common use cases and should be prioritized.



> Implement pandas API missing parameters
> ---
>
> Key: SPARK-39199
> URL: https://issues.apache.org/jira/browse/SPARK-39199
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.3.0, 3.4.0, 3.3.1
>Reporter: Xinrong Meng
>Priority: Major
>
> pandas API on Spark aims to make pandas code work on Spark clusters without 
> any changes. So full API coverage has been one of our major goals. Currently, 
> most pandas functions are implemented, whereas some of them are have 
> incomplete parameters support.
> There are some common parameters missing (resolved):
>  * How to do with NAs   
>  * Filter data types    
>  * Control result length    
>  * Reindex result   
> There are remaining missing parameters to implement (see doc below).
> See the design and the current status at 
> [https://docs.google.com/document/d/1H6RXL6oc-v8qLJbwKl6OEqBjRuMZaXcTYmrZb9yNm5I/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39242) AwaitOffset does not wait correctly for atleast expected offset and RateStreamProvider test is flaky

2022-05-20 Thread Anish Shrigondekar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540231#comment-17540231
 ] 

Anish Shrigondekar commented on SPARK-39242:


I have found the root cause for the issue and will submit the PR soon.

> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky
> 
>
> Key: SPARK-39242
> URL: https://issues.apache.org/jira/browse/SPARK-39242
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Anish Shrigondekar
>Priority: Major
>
> AwaitOffset does not wait correctly for atleast expected offset and 
> RateStreamProvider test is flaky



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39242) AwaitOffset does not wait correctly for atleast expected offset and RateStreamProvider test is flaky

2022-05-20 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-39242:
--

 Summary: AwaitOffset does not wait correctly for atleast expected 
offset and RateStreamProvider test is flaky
 Key: SPARK-39242
 URL: https://issues.apache.org/jira/browse/SPARK-39242
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.1
Reporter: Anish Shrigondekar


AwaitOffset does not wait correctly for atleast expected offset and 
RateStreamProvider test is flaky



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39240.
--
Fix Version/s: 3.3.1
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 36619
[https://github.com/apache/spark/pull/36619]

> Source and binary releases using different tool to generates hashes for 
> integrity
> -
>
> Key: SPARK-39240
> URL: https://issues.apache.org/jira/browse/SPARK-39240
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.3.1, 3.2.2
>
>
> shasum for source
> gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39240:


Assignee: Kent Yao

> Source and binary releases using different tool to generates hashes for 
> integrity
> -
>
> Key: SPARK-39240
> URL: https://issues.apache.org/jira/browse/SPARK-39240
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> shasum for source
> gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-39240:
-
Issue Type: Improvement  (was: Bug)
  Priority: Trivial  (was: Major)

> Source and binary releases using different tool to generates hashes for 
> integrity
> -
>
> Key: SPARK-39240
> URL: https://issues.apache.org/jira/browse/SPARK-39240
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Trivial
> Fix For: 3.2.2, 3.3.1
>
>
> shasum for source
> gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39241) Spark SQL 'Like' operator behaves wrongly while filtering on partitioned column after Spark 3.1

2022-05-20 Thread Dmitry Gorbatsevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Gorbatsevich updated SPARK-39241:

Description: 
It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour 
when filtering on partitioned column. Here is the example:

1. Create test table:
{code:java}
scala> spark.sql(
     | """
     | CREATE EXTERNAL TABLE tmp(
     |         f1 STRING
     |     )
     |     PARTITIONED BY (dt STRING)
     |     ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
     |     LINES TERMINATED BY '\n'
     |     STORED AS TEXTFILE
     |     LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
     | """) 
res2: org.apache.spark.sql.DataFrame = []{code}
2. insert something there:
{code:java}
scala> spark.sql(
     | """
     |     insert into table tmp partition(dt="2022051000") values("1")
     | """
     | )
res3: org.apache.spark.sql.DataFrame = [] {code}
3. Do select using 'like':
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like '202205100%'
     |     """
     |     ).show()
+---+---+
| f1| dt|
+---+---+
+---+---+ {code}
4. Do select using 'like any':
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like any ('202205100%')
     |     """
     |     ).show()
22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does 
not exist
+---+--+
| f1|        dt|
+---+--+
|  1|2022051000|
+---+--+ {code}
Expectation is that results 3 and 4 are identical, however this is not the case 
and result #3 is obviously wrong. 

 

*Environment: EMR*
Release label:emr-6.5.0
Hadoop distribution:Amazon 3.2.1
Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
 

  was:
It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour 
when filtering on partitioned column. Here is the example:

1. Create test table:

 
{code:java}
scala> spark.sql(
     | """
     | CREATE EXTERNAL TABLE tmp(
     |         f1 STRING
     |     )
     |     PARTITIONED BY (dt STRING)
     |     ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
     |     LINES TERMINATED BY '\n'
     |     STORED AS TEXTFILE
     |     LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
     | """) 
res2: org.apache.spark.sql.DataFrame = []{code}
2. insert something there:

 
{code:java}
scala> spark.sql(
     | """
     |     insert into table tmp partition(dt="2022051000") values("1")
     | """
     | )
res3: org.apache.spark.sql.DataFrame = [] {code}
 

3. Do select using 'like':

 

 
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like '202205100%'
     |     """
     |     ).show()
+---+---+
| f1| dt|
+---+---+
+---+---+ {code}
4. Do select using 'like any':

 

 
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like any ('202205100%')
     |     """
     |     ).show()
22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does 
not exist
+---+--+
| f1|        dt|
+---+--+
|  1|2022051000|
+---+--+ {code}
Expectation is that results 3 and 4 are identical, however this is not the case 
and result #3 is obviously wrong. 

 

*Environment: EMR*
Release label:emr-6.5.0
Hadoop distribution:Amazon 3.2.1
Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
 


> Spark SQL 'Like' operator behaves wrongly while filtering on partitioned 
> column after Spark 3.1
> ---
>
> Key: SPARK-39241
> URL: https://issues.apache.org/jira/browse/SPARK-39241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: *Environment: EMR*
> Release label:emr-6.5.0
> Hadoop distribution:Amazon 3.2.1
> Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
>Reporter: Dmitry Gorbatsevich
>Priority: Major
>
> It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour 
> when filtering on partitioned column. Here is the example:
> 1. Create test table:
> {code:java}
> scala> spark.sql(
>      | """
>      | CREATE EXTERNAL TABLE tmp(
>      |         f1 STRING
>      |     )
>      |     PARTITIONED BY (dt STRING)
>      |     ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
>      |     LINES TERMINATED BY '\n'
>      |     STORED AS TEXTFILE
>      |     LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
>      | """) 
> res2: org.apache.spark.sql.DataFrame = []{code}
> 2. insert something there:
> {code:java}
> scala> spark.sql(
>      | """
>      |     insert into table tmp partition(dt="2022051000") values("1")
>      | """
>      | )
> res3: org.apache.spark.sql.DataFrame = [] {code}
> 3. Do select using 'like':
> {code:java

[jira] [Created] (SPARK-39241) Spark SQL 'Like' operator behaves wrongly while filtering on partitioned column after Spark 3.1

2022-05-20 Thread Dmitry Gorbatsevich (Jira)
Dmitry Gorbatsevich created SPARK-39241:
---

 Summary: Spark SQL 'Like' operator behaves wrongly while filtering 
on partitioned column after Spark 3.1
 Key: SPARK-39241
 URL: https://issues.apache.org/jira/browse/SPARK-39241
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
 Environment: *Environment: EMR*
Release label:emr-6.5.0
Hadoop distribution:Amazon 3.2.1
Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
Reporter: Dmitry Gorbatsevich


It seems like introduction of "like any" in spark 3.1 breaks "like" behaviour 
when filtering on partitioned column. Here is the example:

1. Create test table:

 
{code:java}
scala> spark.sql(
     | """
     | CREATE EXTERNAL TABLE tmp(
     |         f1 STRING
     |     )
     |     PARTITIONED BY (dt STRING)
     |     ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
     |     LINES TERMINATED BY '\n'
     |     STORED AS TEXTFILE
     |     LOCATION 's3://vlg-data-us-east-1/tmp/tmp/';
     | """) 
res2: org.apache.spark.sql.DataFrame = []{code}
2. insert something there:

 
{code:java}
scala> spark.sql(
     | """
     |     insert into table tmp partition(dt="2022051000") values("1")
     | """
     | )
res3: org.apache.spark.sql.DataFrame = [] {code}
 

3. Do select using 'like':

 

 
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like '202205100%'
     |     """
     |     ).show()
+---+---+
| f1| dt|
+---+---+
+---+---+ {code}
4. Do select using 'like any':

 

 
{code:java}
scala> spark.sql(
     |     """
     |         select * from tmp
     |         where dt like any ('202205100%')
     |     """
     |     ).show()
22/05/20 14:50:26 WARN HiveConf: HiveConf of name hive.server2.thrift.url does 
not exist
+---+--+
| f1|        dt|
+---+--+
|  1|2022051000|
+---+--+ {code}
Expectation is that results 3 and 4 are identical, however this is not the case 
and result #3 is obviously wrong. 

 

*Environment: EMR*
Release label:emr-6.5.0
Hadoop distribution:Amazon 3.2.1
Applications:{*}Spark 3.1.2{*}, Hive 3.1.2, Livy 0.7.1
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540017#comment-17540017
 ] 

Apache Spark commented on SPARK-39240:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/36619

> Source and binary releases using different tool to generates hashes for 
> integrity
> -
>
> Key: SPARK-39240
> URL: https://issues.apache.org/jira/browse/SPARK-39240
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
> shasum for source
> gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39240:


Assignee: (was: Apache Spark)

> Source and binary releases using different tool to generates hashes for 
> integrity
> -
>
> Key: SPARK-39240
> URL: https://issues.apache.org/jira/browse/SPARK-39240
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
> shasum for source
> gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39240:


Assignee: Apache Spark

> Source and binary releases using different tool to generates hashes for 
> integrity
> -
>
> Key: SPARK-39240
> URL: https://issues.apache.org/jira/browse/SPARK-39240
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> shasum for source
> gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540015#comment-17540015
 ] 

Apache Spark commented on SPARK-39240:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/36619

> Source and binary releases using different tool to generates hashes for 
> integrity
> -
>
> Key: SPARK-39240
> URL: https://issues.apache.org/jira/browse/SPARK-39240
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
> shasum for source
> gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38687) Use error classes in the compilation errors of generators

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540011#comment-17540011
 ] 

Apache Spark commented on SPARK-38687:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36617

> Use error classes in the compilation errors of generators
> -
>
> Key: SPARK-38687
> URL: https://issues.apache.org/jira/browse/SPARK-38687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * nestedGeneratorError
> * moreThanOneGeneratorError
> * generatorOutsideSelectError
> * generatorNotExpectedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39237) Update the ANSI SQL mode documentation

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540008#comment-17540008
 ] 

Apache Spark commented on SPARK-39237:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36618

> Update the ANSI SQL mode documentation
> --
>
> Key: SPARK-39237
> URL: https://issues.apache.org/jira/browse/SPARK-39237
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.3.0
>
>
> 1. Remove the Experimental notation in ANSI SQL compliance doc
> 2. Update the description of `spark.sql.ansi.enabled`, since the ANSI 
> reversed keyword is disabled by default now



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39237) Update the ANSI SQL mode documentation

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540009#comment-17540009
 ] 

Apache Spark commented on SPARK-39237:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36618

> Update the ANSI SQL mode documentation
> --
>
> Key: SPARK-39237
> URL: https://issues.apache.org/jira/browse/SPARK-39237
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.3.0
>
>
> 1. Remove the Experimental notation in ANSI SQL compliance doc
> 2. Update the description of `spark.sql.ansi.enabled`, since the ANSI 
> reversed keyword is disabled by default now



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39240) Source and binary releases using different tool to generates hashes for integrity

2022-05-20 Thread Kent Yao (Jira)
Kent Yao created SPARK-39240:


 Summary: Source and binary releases using different tool to 
generates hashes for integrity
 Key: SPARK-39240
 URL: https://issues.apache.org/jira/browse/SPARK-39240
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Affects Versions: 3.2.1, 3.3.0
Reporter: Kent Yao


shasum for source

gpg for binary



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38687) Use error classes in the compilation errors of generators

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38687:


Assignee: (was: Apache Spark)

> Use error classes in the compilation errors of generators
> -
>
> Key: SPARK-38687
> URL: https://issues.apache.org/jira/browse/SPARK-38687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * nestedGeneratorError
> * moreThanOneGeneratorError
> * generatorOutsideSelectError
> * generatorNotExpectedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38687) Use error classes in the compilation errors of generators

2022-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540010#comment-17540010
 ] 

Apache Spark commented on SPARK-38687:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36617

> Use error classes in the compilation errors of generators
> -
>
> Key: SPARK-38687
> URL: https://issues.apache.org/jira/browse/SPARK-38687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * nestedGeneratorError
> * moreThanOneGeneratorError
> * generatorOutsideSelectError
> * generatorNotExpectedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38687) Use error classes in the compilation errors of generators

2022-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38687:


Assignee: Apache Spark

> Use error classes in the compilation errors of generators
> -
>
> Key: SPARK-38687
> URL: https://issues.apache.org/jira/browse/SPARK-38687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * nestedGeneratorError
> * moreThanOneGeneratorError
> * generatorOutsideSelectError
> * generatorNotExpectedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39237) Update the ANSI SQL mode documentation

2022-05-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-39237.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36614
[https://github.com/apache/spark/pull/36614]

> Update the ANSI SQL mode documentation
> --
>
> Key: SPARK-39237
> URL: https://issues.apache.org/jira/browse/SPARK-39237
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.3.0
>
>
> 1. Remove the Experimental notation in ANSI SQL compliance doc
> 2. Update the description of `spark.sql.ansi.enabled`, since the ANSI 
> reversed keyword is disabled by default now



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39239) Parquet written by spark in yarn mode can not be read by spark in local[2+] mode

2022-05-20 Thread kondziolka9ld (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kondziolka9ld updated SPARK-39239:
--
Description: 
Hi,
I came across a strange issue, namely data written by spark in yarn mode can 
not be read by spark in local[2+] mode. By saying can not be read I mean that 
read operations hangs forever. Strangely enough, local[1] is able to read these 
parquet data. Additionally, repartition of data before writing is some kind of 
workaround as well. I attached thread dump and in fact, thread waits on latch. 
I am not sure if it is a bug or some kind of misconfiguration or 
misunderstanding.

h4. Reproduction scenario:
h4. Writer console log:
{code:java}
user@host [] /tmp $ spark-shell --master yarn
[...]
scala> (1 to 1000).toDF.write.parquet("hdfs:///tmp/sample_1")
scala> (1 to 
1000).toDF.repartition(42).write.parquet("hdfs:///tmp/sample_2"){code}
h4. Reader console log:
{code:java}
user@host [] /tmp $ spark-shell --master local[2]
[...]
scala> spark.read.parquet("hdfs:///tmp/sample_2").count # data were 
repartitioned before write
res2: Long = 1000
scala> spark.read.parquet("hdfs:///tmp/sample_1").count # # it will hang forever
 [Stage 5:=>                             (1 + 0) / 
2]

user@host [] /tmp $ spark-shell --master local[1]
[...]
scala> spark.read.parquet("hdfs:///tmp/sample_1").count
res0: Long = 1000                                                           
     {code}

h4. Thread dump of locked thread
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x7f93b8054000 nid=0x6dce waiting on 
condition [0x7f93c0658000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0xeb65eab8> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
        at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:334)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:859)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
        at org.apache.spark.rdd.RDD$$Lambda$2193/1084000875.apply(Unknown 
Source)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
        at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
        at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3006)
        at 
org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3005)
        at org.apache.spark.sql.Dataset$$Lambda$2847/937335652.apply(Unknown 
Source)
        at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
        at org.apache.spark.sql.Dataset$$Lambda$2848/1831604445.apply(Unknown 
Source)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
        at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2853/2038636888.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2849/1622269832.apply(Unknown
 Source)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
        at org.apache.spark.sql.Dataset.count(Dataset.scala:3005)
        at $line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:24)
        at $line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:28)
     

[jira] [Updated] (SPARK-39239) Parquet written by spark in yarn mode can not be read by spark in local[2+] mode

2022-05-20 Thread kondziolka9ld (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kondziolka9ld updated SPARK-39239:
--
Description: 
Hi,
I came across a strange issue, namely data written by spark in yarn mode can 
not be read by spark in local[2+] mode. By saying can not be read I mean that 
read operations hangs forever. Strangely enough, local[1] is able to read these 
parquet data. Additionally, repartition of data before writing is some kind of 
workaround as well. I attached thread dump and in fact, thread waits on latch. 
I am not sure if it is a bug or some kind of misconfiguration or 
misunderstanding.

h4. Reproduction scenario:
h4. Writer console log:
{code:java}
user@host [] /tmp $ spark-shell --master yarn
[...]
scala> (1 to 1000).toDF.write.parquet("hdfs:///tmp/sample_1")
scala> (1 to 
1000).toDF.repartition(42).write.parquet("hdfs:///tmp/sample_2"){code}
h4. Reader console log:
{code:java}
user@host [] /tmp $ spark-shell --master local[2]
[...]
scala> spark.read.parquet("hdfs:///tmp/sample_2").count
res2: Long = 1000
scala> spark.read.parquet("hdfs:///tmp/sample_1").count
[Stage 5:=>                             (1 + 0) / 2]

user@host [] /tmp $ spark-shell --master local[1]
[...]
scala> spark.read.parquet("hdfs:///tmp/sample_1").count
res0: Long = 1000                                                           
     {code}

h4. Thread dump of locked thread
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x7f93b8054000 nid=0x6dce waiting on 
condition [0x7f93c0658000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0xeb65eab8> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
        at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:334)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:859)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
        at org.apache.spark.rdd.RDD$$Lambda$2193/1084000875.apply(Unknown 
Source)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
        at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
        at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3006)
        at 
org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3005)
        at org.apache.spark.sql.Dataset$$Lambda$2847/937335652.apply(Unknown 
Source)
        at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
        at org.apache.spark.sql.Dataset$$Lambda$2848/1831604445.apply(Unknown 
Source)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
        at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2853/2038636888.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2849/1622269832.apply(Unknown
 Source)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
        at org.apache.spark.sql.Dataset.count(Dataset.scala:3005)
        at $line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:24)
        at $line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:28)
        at $line19.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:30)
        at $line1

[jira] [Updated] (SPARK-39239) Parquet written by spark in yarn mode can not be read by spark in local[2+] mode

2022-05-20 Thread kondziolka9ld (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kondziolka9ld updated SPARK-39239:
--
Description: 
Hi,
I came across a strange issue, namely data written by spark in yarn mode can 
not be read by spark in local[2+] mode. By saying can not be read I mean that 
read operations hangs forever. Strangely enough, local[1] is able to read these 
parquet data. Additionally, repartition of data before writing is some kind of 
workaround as well. I attached thread dump and in fact, thread waits on latch.
I am not sure if it is a bug or some kind of misconfiguration or 
misunderstanding.

h4. Reproduction scenario:
h4. Writer console log:
{code:java}
user@host [] /tmp $ spark-shell --master yarn
[...]
scala> (1 to 1000).toDF.write.parquet("hdfs:///tmp/sample_1")
scala> (1 to 
1000).toDF.repartition(42).write.parquet("hdfs:///tmp/sample_2"){code}
h4. Reader console log:
{code:java}
user@host [] /tmp $ spark-shell --master local[2]
[...]
scala> spark.read.parquet("hdfs:///tmp/sample_2").count
res2: Long = 1000
scala> spark.read.parquet("hdfs:///tmp/sample_1").count
[Stage 5:=>                             (1 + 0) / 2]

user@host [] /tmp $ spark-shell --master local[1]
[...]
scala> spark.read.parquet("hdfs:///tmp/sample_1").count
res0: Long = 1000                                                           
     {code}

h4. Thread dump of locked thread
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x7f93b8054000 nid=0x6dce waiting on 
condition [0x7f93c0658000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0xeb65eab8> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
        at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:334)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:859)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
        at org.apache.spark.rdd.RDD$$Lambda$2193/1084000875.apply(Unknown 
Source)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
        at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
        at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3006)
        at 
org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3005)
        at org.apache.spark.sql.Dataset$$Lambda$2847/937335652.apply(Unknown 
Source)
        at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
        at org.apache.spark.sql.Dataset$$Lambda$2848/1831604445.apply(Unknown 
Source)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
        at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2853/2038636888.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2849/1622269832.apply(Unknown
 Source)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
        at org.apache.spark.sql.Dataset.count(Dataset.scala:3005)
        at $line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:24)
        at $line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:28)
        at $line19.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:30)
        at $line19

[jira] [Updated] (SPARK-39239) Parquet written by spark in yarn mode can not be read by spark in local[2+] mode

2022-05-20 Thread kondziolka9ld (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kondziolka9ld updated SPARK-39239:
--
Attachment: threaddump_spark_shell

> Parquet written by spark in yarn mode can not be read by spark in local[2+] 
> mode
> 
>
> Key: SPARK-39239
> URL: https://issues.apache.org/jira/browse/SPARK-39239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: kondziolka9ld
>Priority: Minor
> Attachments: threaddump_spark_shell
>
>
> Hi,
> I came across a strange issue, namely data written by spark in yarn mode can 
> not be read by spark in local[2+] mode. By saying can not be read I mean that 
> read operations hangs forever. Strangely enough, local[1] is able to read 
> these parquet data. Additionally, repartition of data before writing is some 
> kind of workaround as well. I attached thread dump and in fact, thread waits 
> on latch.
> I am not sure if it is a bug or some kind of misconfiguration or 
> misunderstanding.
> 
> h4. Reproduction scenario:
> h4. Writer console log:
> {code:java}
> user@host [] /tmp $ spark-shell --master yarn
> [...]
> scala> (1 to 1000).toDF.write.parquet("hdfs:///tmp/sample_1")
> scala> (1 to 
> 1000).toDF.repartition(42).write.parquet("hdfs:///tmp/sample_2"){code}
> h4. Reader console log:
> {code:java}
> user@host [] /tmp $ spark-shell --master local[2]
> [...]
> scala> spark.read.parquet("hdfs:///tmp/sample_2").count
> res2: Long = 1000
> scala> spark.read.parquet("hdfs:///tmp/sample_1").count
> [Stage 5:=>                             (1 + 0) / 
> 2]
> user@host [] /tmp $ spark-shell --master local[1]
> [...]
> scala> spark.read.parquet("hdfs:///tmp/sample_1").count
> res0: Long = 1000                                                           
>      {code}
> 
> h4. Thread dump of locked thread
> {code:java}
> "main" #1 prio=5 os_prio=0 tid=0x7f93b8054000 nid=0x6dce waiting on 
> condition [0x7f93c0658000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0xeb65eab8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
>         at 
> org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:334)
>         at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:859)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
>         at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>         at org.apache.spark.rdd.RDD$$Lambda$2193/1084000875.apply(Unknown 
> Source)
>         at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>         at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>         at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>         at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
>         at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3006)
>         at 
> org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3005)
>         at org.apache.spark.sql.Dataset$$Lambda$2847/937335652.apply(Unknown 
> Source)
>         at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
>         at org.apache.spark.sql.Dataset$$Lambda$2848/1831604445.apply(Unknown 
> Source)
>         at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>         at 
> org.apache.spark.sql.execution.SQLExecution$$$Lambda$2853/2038636888.apply(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>      

[jira] [Created] (SPARK-39239) Parquet written by spark in yarn mode can not be read by spark in local[2+] mode

2022-05-20 Thread kondziolka9ld (Jira)
kondziolka9ld created SPARK-39239:
-

 Summary: Parquet written by spark in yarn mode can not be read by 
spark in local[2+] mode
 Key: SPARK-39239
 URL: https://issues.apache.org/jira/browse/SPARK-39239
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: kondziolka9ld


Hi,
I came across a strange issue, namely data written by spark in yarn mode can 
not be read by spark in local[2+] mode. By saying can not be read I mean that 
read operations hangs forever. Strangely enough, local[1] is able to read these 
parquet data. Additionally, repartition of data before writing is some kind of 
workaround as well. I attached thread dump and in fact, thread waits on latch.
I am not sure if it is a bug or some kind of misconfiguration or 
misunderstanding.

h4. Reproduction scenario:
h4. Writer console log:
{code:java}
user@host [] /tmp $ spark-shell --master yarn
[...]
scala> (1 to 1000).toDF.write.parquet("hdfs:///tmp/sample_1")
scala> (1 to 
1000).toDF.repartition(42).write.parquet("hdfs:///tmp/sample_2"){code}
h4. Reader console log:
{code:java}
user@host [] /tmp $ spark-shell --master local[2]
[...]
scala> spark.read.parquet("hdfs:///tmp/sample_2").count
res2: Long = 1000
scala> spark.read.parquet("hdfs:///tmp/sample_1").count
[Stage 5:=>                             (1 + 0) / 2]

user@host [] /tmp $ spark-shell --master local[1]
[...]
scala> spark.read.parquet("hdfs:///tmp/sample_1").count
res0: Long = 1000                                                           
     {code}

h4. Thread dump of locked thread
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x7f93b8054000 nid=0x6dce waiting on 
condition [0x7f93c0658000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0xeb65eab8> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
        at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:334)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:859)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
        at org.apache.spark.rdd.RDD$$Lambda$2193/1084000875.apply(Unknown 
Source)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
        at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
        at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3006)
        at 
org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3005)
        at org.apache.spark.sql.Dataset$$Lambda$2847/937335652.apply(Unknown 
Source)
        at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
        at org.apache.spark.sql.Dataset$$Lambda$2848/1831604445.apply(Unknown 
Source)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
        at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2853/2038636888.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2849/1622269832.apply(Unknown
 Source)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
        at org.apache.spark.sql.Dataset.co