[jira] [Created] (SPARK-38895) Unify the AQE shuffle read canonicalized

2022-04-13 Thread XiDuo You (Jira)
XiDuo You created SPARK-38895:
-

 Summary: Unify the AQE shuffle read canonicalized
 Key: SPARK-38895
 URL: https://issues.apache.org/jira/browse/SPARK-38895
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


After canonicalized, the child of AQEShuffleReadExec will be a exchange instead 
of shuffle query stage. For better maintenance, we can simply override the 
isCanonicalizedPlan and let famework to check if the plan can be executed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-13 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38725.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36188
[https://github.com/apache/spark/pull/36188]

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Minor
>  Labels: starter
> Fix For: 3.4.0
>
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-13 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38725:


Assignee: panbingkun

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38724) Test the error class: DIVIDE_BY_ZERO

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522090#comment-17522090
 ] 

Apache Spark commented on SPARK-38724:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36193

> Test the error class: DIVIDE_BY_ZERO
> 
>
> Key: SPARK-38724
> URL: https://issues.apache.org/jira/browse/SPARK-38724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DIVIDE_BY_ZERO* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def divideByZeroError(): ArithmeticException = {
> new SparkArithmeticException(
>   errorClass = "DIVIDE_BY_ZERO", messageParameters = 
> Array(SQLConf.ANSI_ENABLED.key))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38724) Test the error class: DIVIDE_BY_ZERO

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38724:


Assignee: Apache Spark

> Test the error class: DIVIDE_BY_ZERO
> 
>
> Key: SPARK-38724
> URL: https://issues.apache.org/jira/browse/SPARK-38724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DIVIDE_BY_ZERO* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def divideByZeroError(): ArithmeticException = {
> new SparkArithmeticException(
>   errorClass = "DIVIDE_BY_ZERO", messageParameters = 
> Array(SQLConf.ANSI_ENABLED.key))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38724) Test the error class: DIVIDE_BY_ZERO

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38724:


Assignee: (was: Apache Spark)

> Test the error class: DIVIDE_BY_ZERO
> 
>
> Key: SPARK-38724
> URL: https://issues.apache.org/jira/browse/SPARK-38724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DIVIDE_BY_ZERO* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def divideByZeroError(): ArithmeticException = {
> new SparkArithmeticException(
>   errorClass = "DIVIDE_BY_ZERO", messageParameters = 
> Array(SQLConf.ANSI_ENABLED.key))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38724) Test the error class: DIVIDE_BY_ZERO

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522089#comment-17522089
 ] 

Apache Spark commented on SPARK-38724:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36193

> Test the error class: DIVIDE_BY_ZERO
> 
>
> Key: SPARK-38724
> URL: https://issues.apache.org/jira/browse/SPARK-38724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DIVIDE_BY_ZERO* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def divideByZeroError(): ArithmeticException = {
> new SparkArithmeticException(
>   errorClass = "DIVIDE_BY_ZERO", messageParameters = 
> Array(SQLConf.ANSI_ENABLED.key))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38550) Use a disk-based store to save more information in live UI to help debug

2022-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38550:
---

Assignee: Linhong Liu

> Use a disk-based store to save more information in live UI to help debug
> 
>
> Key: SPARK-38550
> URL: https://issues.apache.org/jira/browse/SPARK-38550
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
>
> In Spark, the UI lacks troubleshooting abilities. For example:
> * AQE plan changes are not available
> * plan description of a large plan is truncated
> This is because the live UI depends on an in-memory KV store. We should 
> always be worried
> about the stability issues when adding more information to the store.
> Therefore, it's better to add a disk-based store to save more information



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38550) Use a disk-based store to save more information in live UI to help debug

2022-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38550.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35856
[https://github.com/apache/spark/pull/35856]

> Use a disk-based store to save more information in live UI to help debug
> 
>
> Key: SPARK-38550
> URL: https://issues.apache.org/jira/browse/SPARK-38550
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.4.0
>
>
> In Spark, the UI lacks troubleshooting abilities. For example:
> * AQE plan changes are not available
> * plan description of a large plan is truncated
> This is because the live UI depends on an in-memory KV store. We should 
> always be worried
> about the stability issues when adding more information to the store.
> Therefore, it's better to add a disk-based store to save more information



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-04-13 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522070#comment-17522070
 ] 

Yang Jie edited comment on SPARK-3 at 4/14/22 5:44 AM:
---

Ok ~ This may involve some pre refactoring work. I will give a draft first so 
that we can determine whether the issue is really valuable.


was (Author: luciferyang):
Ok ~ This may involve pre writing refactoring. I will give a draft first so 
that we can determine whether the issue is really valuable.

> Add `RocksDBProvider` similar to `LevelDBProvider`
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
> `YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-04-13 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522070#comment-17522070
 ] 

Yang Jie commented on SPARK-3:
--

Ok ~ This may involve pre writing refactoring. I will give a draft first so 
that we can determine whether the issue is really valuable.

> Add `RocksDBProvider` similar to `LevelDBProvider`
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
> `YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522066#comment-17522066
 ] 

Apache Spark commented on SPARK-38721:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36192

> Test the error class: CANNOT_PARSE_DECIMAL
> --
>
> Key: SPARK-38721
> URL: https://issues.apache.org/jira/browse/SPARK-38721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def cannotParseDecimalError(): Throwable = {
> new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL",
>   messageParameters = Array.empty)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522065#comment-17522065
 ] 

Apache Spark commented on SPARK-38721:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36192

> Test the error class: CANNOT_PARSE_DECIMAL
> --
>
> Key: SPARK-38721
> URL: https://issues.apache.org/jira/browse/SPARK-38721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def cannotParseDecimalError(): Throwable = {
> new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL",
>   messageParameters = Array.empty)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38894) Exclude pyspark.cloudpickle in test coverage report

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522063#comment-17522063
 ] 

Apache Spark commented on SPARK-38894:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36191

> Exclude pyspark.cloudpickle in test coverage report
> ---
>
> Key: SPARK-38894
> URL: https://issues.apache.org/jira/browse/SPARK-38894
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> cloudpickle is actually a copy from cloudpickle as is. we don't need to check 
> test coverage duplicatedly here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38894) Exclude pyspark.cloudpickle in test coverage report

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38894:


Assignee: (was: Apache Spark)

> Exclude pyspark.cloudpickle in test coverage report
> ---
>
> Key: SPARK-38894
> URL: https://issues.apache.org/jira/browse/SPARK-38894
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> cloudpickle is actually a copy from cloudpickle as is. we don't need to check 
> test coverage duplicatedly here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38894) Exclude pyspark.cloudpickle in test coverage report

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522062#comment-17522062
 ] 

Apache Spark commented on SPARK-38894:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36191

> Exclude pyspark.cloudpickle in test coverage report
> ---
>
> Key: SPARK-38894
> URL: https://issues.apache.org/jira/browse/SPARK-38894
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> cloudpickle is actually a copy from cloudpickle as is. we don't need to check 
> test coverage duplicatedly here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38894) Exclude pyspark.cloudpickle in test coverage report

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38894:


Assignee: Apache Spark

> Exclude pyspark.cloudpickle in test coverage report
> ---
>
> Key: SPARK-38894
> URL: https://issues.apache.org/jira/browse/SPARK-38894
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> cloudpickle is actually a copy from cloudpickle as is. we don't need to check 
> test coverage duplicatedly here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38894) Exclude pyspark.cloudpickle in test coverage report

2022-04-13 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38894:


 Summary: Exclude pyspark.cloudpickle in test coverage report
 Key: SPARK-38894
 URL: https://issues.apache.org/jira/browse/SPARK-38894
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, Tests
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


cloudpickle is actually a copy from cloudpickle as is. we don't need to check 
test coverage duplicatedly here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38894) Exclude pyspark.cloudpickle in test coverage report

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38894:
-
Priority: Minor  (was: Major)

> Exclude pyspark.cloudpickle in test coverage report
> ---
>
> Key: SPARK-38894
> URL: https://issues.apache.org/jira/browse/SPARK-38894
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> cloudpickle is actually a copy from cloudpickle as is. we don't need to check 
> test coverage duplicatedly here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38893) Test SourceProgress in PySpark

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38893:


Assignee: (was: Apache Spark)

> Test SourceProgress in PySpark
> --
>
> Key: SPARK-38893
> URL: https://issues.apache.org/jira/browse/SPARK-38893
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There was a mistake and we're not testing SourceProgress (see 
> https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/sql/streaming/listener.py)
>  We should probably test it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38893) Test SourceProgress in PySpark

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522055#comment-17522055
 ] 

Apache Spark commented on SPARK-38893:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36190

> Test SourceProgress in PySpark
> --
>
> Key: SPARK-38893
> URL: https://issues.apache.org/jira/browse/SPARK-38893
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There was a mistake and we're not testing SourceProgress (see 
> https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/sql/streaming/listener.py)
>  We should probably test it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38893) Test SourceProgress in PySpark

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38893:


Assignee: Apache Spark

> Test SourceProgress in PySpark
> --
>
> Key: SPARK-38893
> URL: https://issues.apache.org/jira/browse/SPARK-38893
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> There was a mistake and we're not testing SourceProgress (see 
> https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/sql/streaming/listener.py)
>  We should probably test it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38893) Test SourceProgress in PySpark

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38893:
-
Issue Type: Test  (was: Bug)

> Test SourceProgress in PySpark
> --
>
> Key: SPARK-38893
> URL: https://issues.apache.org/jira/browse/SPARK-38893
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There was a mistake and we're not testing SourceProgress (see 
> https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/sql/streaming/listener.py)
>  We should probably test it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38893) Test SourceProgress in PySpark

2022-04-13 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38893:


 Summary: Test SourceProgress in PySpark
 Key: SPARK-38893
 URL: https://issues.apache.org/jira/browse/SPARK-38893
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Structured Streaming
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


There was a mistake and we're not testing SourceProgress (see 
https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/sql/streaming/listener.py)
 We should probably test it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38889) Invalid column name while querying bit type column in MSSQL

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38889:


Assignee: Allison Wang

> Invalid column name while querying bit type column in MSSQL
> ---
>
> Key: SPARK-38889
> URL: https://issues.apache.org/jira/browse/SPARK-38889
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> After https://issues.apache.org/jira/browse/SPARK-36644 boolean column 
> filters can be pushed to data sources. However, MSSQL only accepts bit type 
> columns, and the current JDBC dialect for MSSQL does not compile the boolean 
> type values in the pushed predicates into the bit type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38889) Invalid column name while querying bit type column in MSSQL

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38889.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36182
[https://github.com/apache/spark/pull/36182]

> Invalid column name while querying bit type column in MSSQL
> ---
>
> Key: SPARK-38889
> URL: https://issues.apache.org/jira/browse/SPARK-38889
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> After https://issues.apache.org/jira/browse/SPARK-36644 boolean column 
> filters can be pushed to data sources. However, MSSQL only accepts bit type 
> columns, and the current JDBC dialect for MSSQL does not compile the boolean 
> type values in the pushed predicates into the bit type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38892) Fix the UT of schema equal assert

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522027#comment-17522027
 ] 

Apache Spark commented on SPARK-38892:
--

User 'fhygh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36189

> Fix the UT of schema equal assert
> -
>
> Key: SPARK-38892
> URL: https://issues.apache.org/jira/browse/SPARK-38892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: YuanGuanhu
>Priority: Major
>
> in ParquetPartitionDiscoverySuite, there are some assert have no practical 
> significance.
> {code:java}
> assert(input.schema.sameType(input.schema)) {code}
> for check the UT, this assert should be fix to assert the actual result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38892) Fix the UT of schema equal assert

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38892:


Assignee: Apache Spark

> Fix the UT of schema equal assert
> -
>
> Key: SPARK-38892
> URL: https://issues.apache.org/jira/browse/SPARK-38892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: YuanGuanhu
>Assignee: Apache Spark
>Priority: Major
>
> in ParquetPartitionDiscoverySuite, there are some assert have no practical 
> significance.
> {code:java}
> assert(input.schema.sameType(input.schema)) {code}
> for check the UT, this assert should be fix to assert the actual result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38892) Fix the UT of schema equal assert

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522026#comment-17522026
 ] 

Apache Spark commented on SPARK-38892:
--

User 'fhygh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36189

> Fix the UT of schema equal assert
> -
>
> Key: SPARK-38892
> URL: https://issues.apache.org/jira/browse/SPARK-38892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: YuanGuanhu
>Priority: Major
>
> in ParquetPartitionDiscoverySuite, there are some assert have no practical 
> significance.
> {code:java}
> assert(input.schema.sameType(input.schema)) {code}
> for check the UT, this assert should be fix to assert the actual result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38892) Fix the UT of schema equal assert

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38892:


Assignee: (was: Apache Spark)

> Fix the UT of schema equal assert
> -
>
> Key: SPARK-38892
> URL: https://issues.apache.org/jira/browse/SPARK-38892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: YuanGuanhu
>Priority: Major
>
> in ParquetPartitionDiscoverySuite, there are some assert have no practical 
> significance.
> {code:java}
> assert(input.schema.sameType(input.schema)) {code}
> for check the UT, this assert should be fix to assert the actual result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38892) Fix the UT of schema equal assert

2022-04-13 Thread YuanGuanhu (Jira)
YuanGuanhu created SPARK-38892:
--

 Summary: Fix the UT of schema equal assert
 Key: SPARK-38892
 URL: https://issues.apache.org/jira/browse/SPARK-38892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1, 3.3.0
Reporter: YuanGuanhu


in ParquetPartitionDiscoverySuite, there are some assert have no practical 
significance.
{code:java}
assert(input.schema.sameType(input.schema)) {code}
for check the UT, this assert should be fix to assert the actual result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522008#comment-17522008
 ] 

Apache Spark commented on SPARK-38725:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36188

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522007#comment-17522007
 ] 

Apache Spark commented on SPARK-38725:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36188

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38884) java.util.NoSuchElementException: key not found: numPartitions

2022-04-13 Thread chopperChen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522000#comment-17522000
 ] 

chopperChen commented on SPARK-38884:
-

[~hyukjin.kwon] what`s self-contained reproducer?

Spark`s configration:
{code:java}
val DEFAULT_CONF: SparkConf = new SparkConf()
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.cleaner.referenceTracking", "true")
      .set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
      .set("spark.hive.exec.dynamic.partition", "true")
      .set("spark.hive.exec.dynamic.partition.mode", "nonstrict")
      .set("spark.hive.exec.max.dynamic.partitions", "3")
      .set("spark.hive.exec.max.created.files", "100")
      .set("spark.hive.exec.max.dynamic.partitions.pernode", "1")
      .set("spark.hive.metastore.client.capability.check", "false")
      .set("spark.sql.autoBroadcastJoinThreshold", "10485760")
      .set("spark.network.timeout", "360s")
      .set("spark.hive.exec.post.hooks", "")
      .set("spark.hive.exec.failure.hooks", "")
      .set("spark.hive.exec.pre.hooks", "")
      .set("spark.hive.execution.engine", "spark")
      .set("spark.sql.parquet.writeLegacyFormat", "true")
      .set("spark.sql.debug.maxToStringFields", "100")
      .set("spark.sql.sources.partitionOverwriteMode", "dynamic")
SparkSession.builder().config(SparkHelper.DEFAULT_CONF).enableHiveSupport().getOrCreate(){code}

> java.util.NoSuchElementException: key not found: numPartitions
> --
>
> Key: SPARK-38884
> URL: https://issues.apache.org/jira/browse/SPARK-38884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: hadoop 3.1.1
> spark 3.0.1
>Reporter: chopperChen
>Priority: Major
>
> When running function spark.sql("sql").isEmpty, the logs print 
> {*}_java.util.NoSuchElementException: key not found: numPartitions_{*}.
> My sql like:
>  
> {code:java}
> // hr is a partition column
> select * from (select col1, '24' as hr from table1
>union all select col1, '2' as hr from table2
>union all select col1, hr from table3) df1
> inner join (select col1, '24' as hr from table4
> union all select col1, '2' as hr from table5
> union all select col1, hr from table6) df2
> on df1.col1=df2.col1
> {code}
>  
> *exception:*
> Caused by: java.util.NoSuchElementException: key not found: numPartitions
>     at scala.collection.MapLike.default(MapLike.scala:235)
>     at scala.collection.MapLike.default$(MapLike.scala:234)
>     at scala.collection.AbstractMap.default(Map.scala:63)
>     at scala.collection.MapLike.apply(MapLike.scala:144)
>     at scala.collection.MapLike.apply$(MapLike.scala:143)
>     at scala.collection.AbstractMap.apply(Map.scala:63)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$sendDriverMetrics$1(DataSourceScanExec.scala:197)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$sendDriverMetrics$1$adapted(DataSourceScanExec.scala:197)
>     at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
>     at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>     at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>     at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>     at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.sendDriverMetrics(DataSourceScanExec.scala:197)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:407)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:390)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecuteColumnar(DataSourceScanExec.scala:485)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
>     at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:519)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.ex

[jira] [Commented] (SPARK-36604) timestamp type column analyze result is wrong

2022-04-13 Thread YuanGuanhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521999#comment-17521999
 ] 

YuanGuanhu commented on SPARK-36604:


[~senthh] what's the session time zone?

i tested with spark 3.2.1 alse have the issue. The value's '2021-08-15 
15:30:01', while the min/max value is 8 hours diff.

scala>  spark.sql("insert into c select '2021-08-15 15:30:01'")
22/04/14 09:23:36 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res3: org.apache.spark.sql.DataFrame = []

scala> spark.sql("analyze table c compute statistics for columns a")
res4: org.apache.spark.sql.DataFrame = []                                       

scala> spark.sql("desc formatted c a").show(true)
+--++
|     info_name|          info_value|
+--++
|      col_name|                   a|
|     data_type|           timestamp|
|       comment|                NULL|
|           min|2021-08-15 07:30:...|
|           max|2021-08-15 07:30:...|
|     num_nulls|                   0|
|distinct_count|                   1|
|   avg_col_len|                   8|
|   max_col_len|                   8|
|     histogram|                NULL|
+--++


scala> sql("set spark.sql.session.timeZone").show
++-+
|                 key|        value|
++-+
|spark.sql.session...|Asia/Shanghai|
++-+

> timestamp type column analyze result is wrong
> -
>
> Key: SPARK-36604
> URL: https://issues.apache.org/jira/browse/SPARK-36604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark 3.1.1
>Reporter: YuanGuanhu
>Priority: Major
>
> when we create table with timestamp column type, the min and max data of the 
> analyze result for the timestamp column is wrong
> eg:
> {code}
> > select * from a;
> {code}
> {code}
> 2021-08-15 15:30:01
> Time taken: 2.789 seconds, Fetched 1 row(s)
> spark-sql> desc formatted a a;
> col_name a
> data_type timestamp
> comment NULL
> min 2021-08-15 07:30:01.00
> max 2021-08-15 07:30:01.00
> num_nulls 0
> distinct_count 1
> avg_col_len 8
> max_col_len 8
> histogram NULL
> Time taken: 0.278 seconds, Fetched 10 row(s)
> spark-sql> desc a;
> a timestamp NULL
> Time taken: 1.432 seconds, Fetched 1 row(s)
> {code}
>  
> reproduce step:
> {code}
> create table a(a timestamp);
> insert into a select '2021-08-15 15:30:01';
> analyze table a compute statistics for columns a;
> desc formatted a a;
> select * from a;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38857) series name should be preserved in series.mode()

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38857:


Assignee: Yikun Jiang

> series name should be preserved in series.mode()
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38857) series name should be preserved in series.mode()

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38857.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36159
[https://github.com/apache/spark/pull/36159]

> series name should be preserved in series.mode()
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521994#comment-17521994
 ] 

Apache Spark commented on SPARK-37643:
--

User 'fhygh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36187

> when charVarcharAsString is true, char datatype partition table query 
> incorrect
> ---
>
> Key: SPARK-37643
> URL: https://issues.apache.org/jira/browse/SPARK-37643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
> Environment: spark 3.2.0
>Reporter: YuanGuanhu
>Priority: Major
>
> This ticket aim at fixing the bug that does not apply right-padding for char 
> types column when charVarcharAsString is true and partition data length is 
> less than defined length.
> For example, a query below returns nothing in master, but a correct result is 
> `abc`.
> {code:java}
> scala> sql("set spark.sql.legacy.charVarcharAsString=true")
> scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned 
> by (c)")
> scala> sql("INSERT INTO tb01 values(1, 'abc')")
> scala> sql("select c from tb01 where c = 'abc'").show
> +---+
> |  c|
> +---+
> +---+{code}
> This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
> handle this consider conf spark.sql.legacy.charVarcharAsString value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37643) when charVarcharAsString is true, char datatype partition table query incorrect

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521993#comment-17521993
 ] 

Apache Spark commented on SPARK-37643:
--

User 'fhygh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36187

> when charVarcharAsString is true, char datatype partition table query 
> incorrect
> ---
>
> Key: SPARK-37643
> URL: https://issues.apache.org/jira/browse/SPARK-37643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
> Environment: spark 3.2.0
>Reporter: YuanGuanhu
>Priority: Major
>
> This ticket aim at fixing the bug that does not apply right-padding for char 
> types column when charVarcharAsString is true and partition data length is 
> less than defined length.
> For example, a query below returns nothing in master, but a correct result is 
> `abc`.
> {code:java}
> scala> sql("set spark.sql.legacy.charVarcharAsString=true")
> scala> sql("CREATE TABLE tb01(i string, c char(5)) USING parquet partitioned 
> by (c)")
> scala> sql("INSERT INTO tb01 values(1, 'abc')")
> scala> sql("select c from tb01 where c = 'abc'").show
> +---+
> |  c|
> +---+
> +---+{code}
> This is because `ApplyCharTypePadding` rpad the expr to charLength. We should 
> handle this consider conf spark.sql.legacy.charVarcharAsString value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38884) java.util.NoSuchElementException: key not found: numPartitions

2022-04-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521989#comment-17521989
 ] 

Hyukjin Kwon commented on SPARK-38884:
--

[~chopperChen] do you have a self-contained reproducer?

> java.util.NoSuchElementException: key not found: numPartitions
> --
>
> Key: SPARK-38884
> URL: https://issues.apache.org/jira/browse/SPARK-38884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: hadoop 3.1.1
> spark 3.0.1
>Reporter: chopperChen
>Priority: Major
>
> When running function spark.sql("sql").isEmpty, the logs print 
> {*}_java.util.NoSuchElementException: key not found: numPartitions_{*}.
> My sql like:
>  
> {code:java}
> // hr is a partition column
> select * from (select col1, '24' as hr from table1
>union all select col1, '2' as hr from table2
>union all select col1, hr from table3) df1
> inner join (select col1, '24' as hr from table4
> union all select col1, '2' as hr from table5
> union all select col1, hr from table6) df2
> on df1.col1=df2.col1
> {code}
>  
> *exception:*
> Caused by: java.util.NoSuchElementException: key not found: numPartitions
>     at scala.collection.MapLike.default(MapLike.scala:235)
>     at scala.collection.MapLike.default$(MapLike.scala:234)
>     at scala.collection.AbstractMap.default(Map.scala:63)
>     at scala.collection.MapLike.apply(MapLike.scala:144)
>     at scala.collection.MapLike.apply$(MapLike.scala:143)
>     at scala.collection.AbstractMap.apply(Map.scala:63)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$sendDriverMetrics$1(DataSourceScanExec.scala:197)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$sendDriverMetrics$1$adapted(DataSourceScanExec.scala:197)
>     at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
>     at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>     at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>     at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>     at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.sendDriverMetrics(DataSourceScanExec.scala:197)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:407)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:390)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecuteColumnar(DataSourceScanExec.scala:485)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
>     at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:519)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:202)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:198)
>     at 
> org.apache.spark.sql.execution.ColumnarToRowExec.inputRDDs(Columnar.scala:196)
>     at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:133)
>     at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:47)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>     at 
> org.apache.spark.sql.execution.UnionExec.$anonfun$doExecute$5(basicPhysicalOperators.scala:644)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>     at 
> scala.collection.mutable.Resizable

[jira] [Assigned] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38890:


Assignee: Xinrong Meng

> Implement `ignore_index` of `DataFrame.sort_index`.
> ---
>
> Key: SPARK-38890
> URL: https://issues.apache.org/jira/browse/SPARK-38890
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38797) Runtime Filter support pushdown through window

2022-04-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-38797.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36080
[https://github.com/apache/spark/pull/36080]

> Runtime Filter support pushdown through window
> --
>
> Key: SPARK-38797
> URL: https://issues.apache.org/jira/browse/SPARK-38797
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38890.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36184
[https://github.com/apache/spark/pull/36184]

> Implement `ignore_index` of `DataFrame.sort_index`.
> ---
>
> Key: SPARK-38890
> URL: https://issues.apache.org/jira/browse/SPARK-38890
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38797) Runtime Filter support pushdown through window

2022-04-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-38797:
---

Assignee: Yuming Wang

> Runtime Filter support pushdown through window
> --
>
> Key: SPARK-38797
> URL: https://issues.apache.org/jira/browse/SPARK-38797
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37014) Inline type hints for python/pyspark/streaming/context.py

2022-04-13 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37014.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34293
[https://github.com/apache/spark/pull/34293]

> Inline type hints for python/pyspark/streaming/context.py
> -
>
> Key: SPARK-37014
> URL: https://issues.apache.org/jira/browse/SPARK-37014
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37014) Inline type hints for python/pyspark/streaming/context.py

2022-04-13 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37014:
--

Assignee: dch nguyen

> Inline type hints for python/pyspark/streaming/context.py
> -
>
> Key: SPARK-37014
> URL: https://issues.apache.org/jira/browse/SPARK-37014
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521930#comment-17521930
 ] 

Apache Spark commented on SPARK-36664:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36185

> Log time spent waiting for cluster resources
> 
>
> Key: SPARK-36664
> URL: https://issues.apache.org/jira/browse/SPARK-36664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Major
>
> To provide better visibility into why jobs might be running slow it would be 
> useful to log when we are waiting for resources and how long we are waiting 
> for resources so if there is an underlying cluster issue the user can be 
> aware.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule

2022-04-13 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521923#comment-17521923
 ] 

Erik Krogen commented on SPARK-38812:
-

You may want to check the discussion on SPARK-2373 and SPARK-6664

> when i clean data ,I hope one rdd spill two rdd according clean data rule
> -
>
> Key: SPARK-38812
> URL: https://issues.apache.org/jira/browse/SPARK-38812
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when id do clean data,one rdd according one value(>or <) filter data, and 
> then generate two different set,one is error data file, another is errorless 
> data file.
> Now I use filter, but this filter must have two spark dag job, that cost too 
> much.
> exactly some code like iterator.span(preidicate) and then return one 
> tuple(iter1,iter2)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38890:


Assignee: (was: Apache Spark)

> Implement `ignore_index` of `DataFrame.sort_index`.
> ---
>
> Key: SPARK-38890
> URL: https://issues.apache.org/jira/browse/SPARK-38890
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38890:


Assignee: Apache Spark

> Implement `ignore_index` of `DataFrame.sort_index`.
> ---
>
> Key: SPARK-38890
> URL: https://issues.apache.org/jira/browse/SPARK-38890
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521913#comment-17521913
 ] 

Apache Spark commented on SPARK-38890:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36184

> Implement `ignore_index` of `DataFrame.sort_index`.
> ---
>
> Key: SPARK-38890
> URL: https://issues.apache.org/jira/browse/SPARK-38890
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible

2022-04-13 Thread Chao Sun (Jira)
Chao Sun created SPARK-38891:


 Summary: Skipping allocating vector for repetition & definition 
levels when possible
 Key: SPARK-38891
 URL: https://issues.apache.org/jira/browse/SPARK-38891
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently the vectorized Parquet reader will allocate vectors for repetition 
and definition levels in all cases. However in certain cases (e.g., when 
reading primitive types) this is not necessary and should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38890) Implement `ignore_index` of `DataFrame.sort_index`.

2022-04-13 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38890:


 Summary: Implement `ignore_index` of `DataFrame.sort_index`.
 Key: SPARK-38890
 URL: https://issues.apache.org/jira/browse/SPARK-38890
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `ignore_index` of `DataFrame.sort_index`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521909#comment-17521909
 ] 

Apache Spark commented on SPARK-38823:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/36183

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0, 3.4.0
>Reporter: IKozar
>Priority: Major
>  Labels: correctness
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38823:


Assignee: (was: Apache Spark)

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0, 3.4.0
>Reporter: IKozar
>Priority: Major
>  Labels: correctness
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38823:


Assignee: Apache Spark

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0, 3.4.0
>Reporter: IKozar
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521907#comment-17521907
 ] 

Apache Spark commented on SPARK-38823:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/36183

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0, 3.4.0
>Reporter: IKozar
>Priority: Major
>  Labels: correctness
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521894#comment-17521894
 ] 

Bruce Robbins commented on SPARK-38823:
---

By the way, here is some code that demos the issue in spark-shell:
{noformat}
// repro in scala using Java APIs
import org.apache.spark.api.java.function.{MapFunction, ReduceFunction}
import org.apache.spark.sql.Encoders;
import collection.JavaConverters._

class Item (var k: String, var v: Int) extends java.io.Serializable {  
  def setK(value: String): Unit = {
k = value
  }
  def setV(value: Int): Unit = {
v = value
  }
  def getK: String = {
k
  }
  def getV: Int = {
v
  }
  def this() {
this("", 0)
  }

  def addValue(inc: Int): Item = {
new Item(k, v + inc)
  }

  override def toString: String = {
s"Item($k,$v)"
  }
}

val items = Seq(
  new Item("a", 1),
  new Item("b", 3),
  new Item("c", 2),
  new Item("a", 7)
)

val ds = spark.createDataFrame(items.asJava, 
classOf[Item]).as(Encoders.bean(classOf[Item])).coalesce(1)

val mf = new MapFunction[Item, String] {
  override def call(item: Item): String = {
println(s"Key is ${item.k} for item $item")
item.k
  }
}

val kvgd1 = ds.groupByKey(mf, Encoders.STRING)

val rf = new ReduceFunction[Item] {
  override def call(item1: Item, item2: Item): Item = {
val sameRef = item1 eq item2
val msg = s"item1 $item1; item2 $item2"
val newItem = item1.addValue(item2.v)
println(s"$msg; new item is $newItem; sameRef is $sameRef")
newItem
  }
}
 
kvgd1.reduceGroups(rf).show(10)
{noformat}
This will return
{noformat}
+---++
|key|ReduceAggregator($line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Item)|
+---++
|  a|  {a, 
7}|
|  b|  {a, 
7}|
|  c|  {a, 
7}|
+---++
{noformat}
However, it should return
{noformat}
+---++
|key|ReduceAggregator($line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Item)|
+---++
|  a|  {a, 
8}|
|  b|  {b, 
3}|
|  c|  {c, 
2}|
+---++
{noformat}

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0, 3.4.0
>Reporter: IKozar
>Priority: Major
>  Labels: correctness
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-

[jira] [Resolved] (SPARK-38835) Refactor FsHistoryProviderSuite to test rocks db

2022-04-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38835.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36119
[https://github.com/apache/spark/pull/36119]

> Refactor FsHistoryProviderSuite to test rocks db
> 
>
> Key: SPARK-38835
> URL: https://issues.apache.org/jira/browse/SPARK-38835
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> FsHistoryProviderSuite only test leveldb backend now



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38835) Refactor FsHistoryProviderSuite to test rocks db

2022-04-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38835:
-

Assignee: Yang Jie

> Refactor FsHistoryProviderSuite to test rocks db
> 
>
> Key: SPARK-38835
> URL: https://issues.apache.org/jira/browse/SPARK-38835
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> FsHistoryProviderSuite only test leveldb backend now



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38889) Invalid column name while querying bit type column in MSSQL

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38889:


Assignee: Apache Spark

> Invalid column name while querying bit type column in MSSQL
> ---
>
> Key: SPARK-38889
> URL: https://issues.apache.org/jira/browse/SPARK-38889
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> After https://issues.apache.org/jira/browse/SPARK-36644 boolean column 
> filters can be pushed to data sources. However, MSSQL only accepts bit type 
> columns, and the current JDBC dialect for MSSQL does not compile the boolean 
> type values in the pushed predicates into the bit type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38889) Invalid column name while querying bit type column in MSSQL

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38889:


Assignee: (was: Apache Spark)

> Invalid column name while querying bit type column in MSSQL
> ---
>
> Key: SPARK-38889
> URL: https://issues.apache.org/jira/browse/SPARK-38889
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> After https://issues.apache.org/jira/browse/SPARK-36644 boolean column 
> filters can be pushed to data sources. However, MSSQL only accepts bit type 
> columns, and the current JDBC dialect for MSSQL does not compile the boolean 
> type values in the pushed predicates into the bit type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38889) Invalid column name while querying bit type column in MSSQL

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521874#comment-17521874
 ] 

Apache Spark commented on SPARK-38889:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/36182

> Invalid column name while querying bit type column in MSSQL
> ---
>
> Key: SPARK-38889
> URL: https://issues.apache.org/jira/browse/SPARK-38889
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> After https://issues.apache.org/jira/browse/SPARK-36644 boolean column 
> filters can be pushed to data sources. However, MSSQL only accepts bit type 
> columns, and the current JDBC dialect for MSSQL does not compile the boolean 
> type values in the pushed predicates into the bit type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38889) Invalid column name while querying bit type column in MSSQL

2022-04-13 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-38889:
-
Description: After https://issues.apache.org/jira/browse/SPARK-36644 
boolean column filters can be pushed to data sources. However, MSSQL only 
accepts bit type columns, and the current JDBC dialect for MSSQL does not 
compile the boolean type values in the pushed predicates into the bit type.

> Invalid column name while querying bit type column in MSSQL
> ---
>
> Key: SPARK-38889
> URL: https://issues.apache.org/jira/browse/SPARK-38889
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> After https://issues.apache.org/jira/browse/SPARK-36644 boolean column 
> filters can be pushed to data sources. However, MSSQL only accepts bit type 
> columns, and the current JDBC dialect for MSSQL does not compile the boolean 
> type values in the pushed predicates into the bit type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38889) Invalid column name while querying bit type column in MSSQL

2022-04-13 Thread Allison Wang (Jira)
Allison Wang created SPARK-38889:


 Summary: Invalid column name while querying bit type column in 
MSSQL
 Key: SPARK-38889
 URL: https://issues.apache.org/jira/browse/SPARK-38889
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34659) Web UI does not correctly get appId

2022-04-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34659:
-

Assignee: Gengliang Wang

> Web UI does not correctly get appId
> ---
>
> Key: SPARK-34659
> URL: https://issues.apache.org/jira/browse/SPARK-34659
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Web UI
>Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Arata Furukawa
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Web UI does not correctly get appId when it has `proxy` or `history` in URL.
> In my case, it happens on 
> `https://jupyterhub.hosted.us/my-name/proxy/4040/executors/`.
> Web developer console says: `jquery-3.4.1.min.js:2 GET 
> https://jupyterhub.hosted.us/user/my-name/proxy/4040/api/v1/applications/4040/allexecutors
>  404`, and it shows blank pages to me.
> There is relative issue in jupyterhub 
> https://github.com/jupyterhub/jupyter-server-proxy/issues/57
> https://github.com/apache/spark/blob/2526fdea481b1777b2c4a2242254b72b5c49d820/core/src/main/resources/org/apache/spark/ui/static/utils.js#L93-L105
> It should not get from document.baseURI.
> A request will occur, but performance impacts will be a bit.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34659) Web UI does not correctly get appId

2022-04-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34659.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36176
[https://github.com/apache/spark/pull/36176]

> Web UI does not correctly get appId
> ---
>
> Key: SPARK-34659
> URL: https://issues.apache.org/jira/browse/SPARK-34659
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Web UI
>Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Arata Furukawa
>Priority: Major
> Fix For: 3.4.0
>
>
> Web UI does not correctly get appId when it has `proxy` or `history` in URL.
> In my case, it happens on 
> `https://jupyterhub.hosted.us/my-name/proxy/4040/executors/`.
> Web developer console says: `jquery-3.4.1.min.js:2 GET 
> https://jupyterhub.hosted.us/user/my-name/proxy/4040/api/v1/applications/4040/allexecutors
>  404`, and it shows blank pages to me.
> There is relative issue in jupyterhub 
> https://github.com/jupyterhub/jupyter-server-proxy/issues/57
> https://github.com/apache/spark/blob/2526fdea481b1777b2c4a2242254b72b5c49d820/core/src/main/resources/org/apache/spark/ui/static/utils.js#L93-L105
> It should not get from document.baseURI.
> A request will occur, but performance impacts will be a bit.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38792) Regression in time executor takes to do work sometime after v3.0.1 ?

2022-04-13 Thread Danny Guinther (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521852#comment-17521852
 ] 

Danny Guinther commented on SPARK-38792:


I'm getting the impression that the problem may be with some code that 
Databricks bolts on to Spark. I'd say ignore this ticket unless you hear 
otherwise.

> Regression in time executor takes to do work sometime after v3.0.1 ?
> 
>
> Key: SPARK-38792
> URL: https://issues.apache.org/jira/browse/SPARK-38792
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Danny Guinther
>Priority: Major
> Attachments: dummy-job-job.jpg, dummy-job-query.png, 
> executor-timing-debug-number-2.jpg, executor-timing-debug-number-4.jpg, 
> executor-timing-debug-number-5.jpg, min-time-way-up.jpg, 
> what-is-this-code.jpg, what-s-up-with-exec-actions.jpg
>
>
> Hello!
> I'm sorry to trouble you with this, but I'm seeing a noticeable regression in 
> performance when upgrading from 3.0.1 to 3.2.1 and I can't pin down why. I 
> don't believe it is specific to my application since the upgrade to 3.0.1 to 
> 3.2.1 is purely a configuration change. I'd guess it presents itself in my 
> application due to the high volume of work my application does, but I could 
> be mistaken.
> The gist is that it seems like the executor actions I'm running suddenly 
> appear to take a lot longer on Spark 3.2.1. I don't have any ability to test 
> versions between 3.0.1 and 3.2.1 because my application was previously 
> blocked from upgrading beyond Spark 3.0.1 by 
> https://issues.apache.org/jira/browse/SPARK-37391 (which I helped to fix).
> Any ideas what might cause this or metrics I might try to gather to pinpoint 
> the problem? I've tried a bunch of the suggestions from 
> [https://spark.apache.org/docs/latest/tuning.html] to see if any of those 
> help, but none of the adjustments I've tried have been fruitful. I also tried 
> to look in [https://spark.apache.org/docs/latest/sql-migration-guide.html] 
> for ideas as to what might have changed to cause this behavior, but haven't 
> seen anything that sticks out as being a possible source of the problem.
> I have attached a graph that shows the drastic change in time taken by 
> executor actions. In the image the blue and purple lines are different kinds 
> of reads using the built-in JDBC data reader and the green line is writes 
> using a custom-built data writer. The deploy to switch from 3.0.1 to 3.2.1 
> occurred at 9AM on the graph. The graph data comes from timing blocks that 
> surround only the calls to dataframe actions, so there shouldn't be anything 
> specific to my application that is suddenly inflating these numbers. The 
> specific actions I'm invoking are: count() (but there's some transforming and 
> caching going on, so it's really more than that); first(); and write().
> The driver process does seem to be seeing more GC churn then with Spark 
> 3.0.1, but I don't think that explains this behavior. The executors don't 
> seem to have any problem with memory or GC and are not overutilized (our 
> pipeline is very read and write heavy, less heavy on transformations, so 
> executors tend to be idle while waiting for various network I/O).
>  
> Thanks in advance for any help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38852) Better Data Source V2 operator pushdown framework

2022-04-13 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521815#comment-17521815
 ] 

Max Gekk edited comment on SPARK-38852 at 4/13/22 4:22 PM:
---

SPARK-38788 was created specifically for the release note. Also it includes 
features/bug fixes that were made out of this umbrella, for example 
https://issues.apache.org/jira/browse/SPARK-36644


was (Author: maxgekk):
SPARK-38788 was created specifically for the release note. It includes 
features/bug fixes that were made out of this umbrella, for example 
https://issues.apache.org/jira/browse/SPARK-36644

> Better Data Source V2 operator pushdown framework
> -
>
> Key: SPARK-38852
> URL: https://issues.apache.org/jira/browse/SPARK-38852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports push down Filters and Aggregates to data source.
> However, the Data Source V2 operator pushdown framework has the following 
> shortcomings:
> # Only simple filter and aggregate are supported, which makes it impossible 
> to apply in most scenarios
> # The incompatibility of SQL syntax makes it impossible to apply in most 
> scenarios
> # Aggregate push down does not support multiple partitions of data sources
> # Spark's additional aggregate will cause some overhead
> # Limit push down is not supported
> # Top n push down is not supported
> # Offset push down is not supported



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38852) Better Data Source V2 operator pushdown framework

2022-04-13 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521815#comment-17521815
 ] 

Max Gekk commented on SPARK-38852:
--

SPARK-38788 was created specifically for the release note. It includes 
features/bug fixes that were made out of this umbrella, for example 
https://issues.apache.org/jira/browse/SPARK-36644

> Better Data Source V2 operator pushdown framework
> -
>
> Key: SPARK-38852
> URL: https://issues.apache.org/jira/browse/SPARK-38852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports push down Filters and Aggregates to data source.
> However, the Data Source V2 operator pushdown framework has the following 
> shortcomings:
> # Only simple filter and aggregate are supported, which makes it impossible 
> to apply in most scenarios
> # The incompatibility of SQL syntax makes it impossible to apply in most 
> scenarios
> # Aggregate push down does not support multiple partitions of data sources
> # Spark's additional aggregate will cause some overhead
> # Limit push down is not supported
> # Top n push down is not supported
> # Offset push down is not supported



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38852) Better Data Source V2 operator pushdown framework

2022-04-13 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38852:
-
Epic Link: SPARK-38788

> Better Data Source V2 operator pushdown framework
> -
>
> Key: SPARK-38852
> URL: https://issues.apache.org/jira/browse/SPARK-38852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports push down Filters and Aggregates to data source.
> However, the Data Source V2 operator pushdown framework has the following 
> shortcomings:
> # Only simple filter and aggregate are supported, which makes it impossible 
> to apply in most scenarios
> # The incompatibility of SQL syntax makes it impossible to apply in most 
> scenarios
> # Aggregate push down does not support multiple partitions of data sources
> # Spark's additional aggregate will cause some overhead
> # Limit push down is not supported
> # Top n push down is not supported
> # Offset push down is not supported



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38788) More comprehensive DSV2 push down capabilities

2022-04-13 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521807#comment-17521807
 ] 

Erik Krogen commented on SPARK-38788:
-

Yeah, I got that, but isn't SPARK-38852 trying to do the same thing? Or are 
they targeting different functionality..? The descriptions seem the same to me.

> More comprehensive DSV2 push down capabilities
> --
>
> Key: SPARK-38788
> URL: https://issues.apache.org/jira/browse/SPARK-38788
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Get together all tickets related to push down (filters) via Datasource V2 
> APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-04-13 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521756#comment-17521756
 ] 

Dongjoon Hyun commented on SPARK-3:
---

If you need, please proceed it. Technically, I don't think the old legacy 
Hadoop stack (YARN) will use Apple Silicon. So, I didn't put it into the 
subtask of SPARK-35781. For the feature parity, +1 for your suggestion. 

> Add `RocksDBProvider` similar to `LevelDBProvider`
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
> `YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38745) Move the tests for `NON_PARTITION_COLUMN` to QueryCompilationErrorsSuite

2022-04-13 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38745:


Assignee: Max Gekk

> Move the tests for `NON_PARTITION_COLUMN` to QueryCompilationErrorsSuite
> 
>
> Key: SPARK-38745
> URL: https://issues.apache.org/jira/browse/SPARK-38745
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Move tests for the error class *NON_PARTITION_COLUMN* from InsertIntoTests to 
> QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38745) Move the tests for `NON_PARTITION_COLUMN` to QueryCompilationErrorsSuite

2022-04-13 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38745.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36175
[https://github.com/apache/spark/pull/36175]

> Move the tests for `NON_PARTITION_COLUMN` to QueryCompilationErrorsSuite
> 
>
> Key: SPARK-38745
> URL: https://issues.apache.org/jira/browse/SPARK-38745
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Move tests for the error class *NON_PARTITION_COLUMN* from InsertIntoTests to 
> QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-04-13 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521701#comment-17521701
 ] 

Yang Jie commented on SPARK-3:
--

cc [~dongjoon] should we do this?

 

> Add `RocksDBProvider` similar to `LevelDBProvider`
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
> `YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-04-13 Thread Yang Jie (Jira)
Yang Jie created SPARK-3:


 Summary: Add `RocksDBProvider` similar to `LevelDBProvider`
 Key: SPARK-3
 URL: https://issues.apache.org/jira/browse/SPARK-3
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, YARN
Affects Versions: 3.4.0
Reporter: Yang Jie


`LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
`YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38887) Support switch inner join side for sort merge join

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38887:


Assignee: Apache Spark

> Support switch inner join side for sort merge join
> --
>
> Key: SPARK-38887
> URL: https://issues.apache.org/jira/browse/SPARK-38887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> For an inner join type SortMergeJoin, it always uses the left side as 
> streamed side and right side as buffered side.
> Accoirding to the implementaion of SortMergeJoin, we expect the buffered side 
> to be:
>  * smaller than streamed side
>  * has less duplicate data
> We do not know whether the join will be SortMergeJoin at logical phase, so it 
> should do this selection at physcial phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38887) Support switch inner join side for sort merge join

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38887:


Assignee: (was: Apache Spark)

> Support switch inner join side for sort merge join
> --
>
> Key: SPARK-38887
> URL: https://issues.apache.org/jira/browse/SPARK-38887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> For an inner join type SortMergeJoin, it always uses the left side as 
> streamed side and right side as buffered side.
> Accoirding to the implementaion of SortMergeJoin, we expect the buffered side 
> to be:
>  * smaller than streamed side
>  * has less duplicate data
> We do not know whether the join will be SortMergeJoin at logical phase, so it 
> should do this selection at physcial phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38887) Support switch inner join side for sort merge join

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521660#comment-17521660
 ] 

Apache Spark commented on SPARK-38887:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/36180

> Support switch inner join side for sort merge join
> --
>
> Key: SPARK-38887
> URL: https://issues.apache.org/jira/browse/SPARK-38887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> For an inner join type SortMergeJoin, it always uses the left side as 
> streamed side and right side as buffered side.
> Accoirding to the implementaion of SortMergeJoin, we expect the buffered side 
> to be:
>  * smaller than streamed side
>  * has less duplicate data
> We do not know whether the join will be SortMergeJoin at logical phase, so it 
> should do this selection at physcial phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38887) Support switch inner join side for sort merge join

2022-04-13 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38887:
--
Summary: Support switch inner join side for sort merge join  (was: Support 
swtich inner join side for sort merge join)

> Support switch inner join side for sort merge join
> --
>
> Key: SPARK-38887
> URL: https://issues.apache.org/jira/browse/SPARK-38887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> For an inner join type SortMergeJoin, it always uses the left side as 
> streamed side and right side as buffered side.
> Accoirding to the implementaion of SortMergeJoin, we expect the buffered side 
> to be:
>  * smaller than streamed side
>  * has less duplicate data
> We do not know whether the join will be SortMergeJoin at logical phase, so it 
> should do this selection at physcial phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38887) Support swtich inner join side for sort merge join

2022-04-13 Thread XiDuo You (Jira)
XiDuo You created SPARK-38887:
-

 Summary: Support swtich inner join side for sort merge join
 Key: SPARK-38887
 URL: https://issues.apache.org/jira/browse/SPARK-38887
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


For an inner join type SortMergeJoin, it always uses the left side as streamed 
side and right side as buffered side.

Accoirding to the implementaion of SortMergeJoin, we expect the buffered side 
to be:
 * smaller than streamed side
 * has less duplicate data

We do not know whether the join will be SortMergeJoin at logical phase, so it 
should do this selection at physcial phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38844) impl Series.interpolate and DataFrame.interpolate

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38844.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36127
[https://github.com/apache/spark/pull/36127]

> impl Series.interpolate and DataFrame.interpolate
> -
>
> Key: SPARK-38844
> URL: https://issues.apache.org/jira/browse/SPARK-38844
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.4.0
>
>
> h2. Goal:
> [pandas's 
> interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html]
>  supports many methods, _linear_ is applied by default, other methods ( _pad_ 
> _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.
> The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be 
> implemented easily since scipy is used internally and the window frame used 
> is complex.
> Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented 
> in pandas API on spark via {_}fillna{_}, so this work currently focus on 
> implementing the missing *linear interpolation*
> h2.  
> h2. Impl:
> To implement the linear interpolation, two extra window functions are added, 
> one ( _null_index_ ) is to compute the indices of missing values in each 
> consecutive seq, the other ({_}last_not_null{_}) is to keep the last 
> no-missing value.
> ||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled
>  (limit=1)||
> |1|nan|1|nan|1|1|-|-|
> |2|1|0|1|0|1| | |
> |3|nan|1|1|3|5|2.0|2.0|
> |4|nan|2|1|2|5|3.0|-|
> |5|nan|3|1|1|5|4.0|-|
> |6|5|0|5|0|5| | |
> |7|6|0|6|0|6| | |
> |8|nan|1|6|2|nan|6.0|6.0|
> |9|nan|2|6|1|nan|6.0|-|
>  * for the NANs at indices (3,4,5), we always compute the filled value via
> ({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / 
> ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ 
> + _last_not_null_forward_
>  * for the NaN at index(1), skip it due to the default *limit_direction* = 
> _forward_
>  * for the NaN at index(8), fill it like _ffill_ with vlaue 
> _last_not_null_forward_
>  * If _limit_ is set, then NaNs with _null_index_forward_ greater than 
> _limit_ will not be interpolated.
> h2. Plan
> 1, impl the basic _linear interpolate_ with param _limit_
> 2, add param _limit_direction_
> 3, add param _limit_area_



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38844) impl Series.interpolate and DataFrame.interpolate

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38844:


Assignee: zhengruifeng

> impl Series.interpolate and DataFrame.interpolate
> -
>
> Key: SPARK-38844
> URL: https://issues.apache.org/jira/browse/SPARK-38844
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> h2. Goal:
> [pandas's 
> interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html]
>  supports many methods, _linear_ is applied by default, other methods ( _pad_ 
> _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.
> The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be 
> implemented easily since scipy is used internally and the window frame used 
> is complex.
> Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented 
> in pandas API on spark via {_}fillna{_}, so this work currently focus on 
> implementing the missing *linear interpolation*
> h2.  
> h2. Impl:
> To implement the linear interpolation, two extra window functions are added, 
> one ( _null_index_ ) is to compute the indices of missing values in each 
> consecutive seq, the other ({_}last_not_null{_}) is to keep the last 
> no-missing value.
> ||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled
>  (limit=1)||
> |1|nan|1|nan|1|1|-|-|
> |2|1|0|1|0|1| | |
> |3|nan|1|1|3|5|2.0|2.0|
> |4|nan|2|1|2|5|3.0|-|
> |5|nan|3|1|1|5|4.0|-|
> |6|5|0|5|0|5| | |
> |7|6|0|6|0|6| | |
> |8|nan|1|6|2|nan|6.0|6.0|
> |9|nan|2|6|1|nan|6.0|-|
>  * for the NANs at indices (3,4,5), we always compute the filled value via
> ({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / 
> ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ 
> + _last_not_null_forward_
>  * for the NaN at index(1), skip it due to the default *limit_direction* = 
> _forward_
>  * for the NaN at index(8), fill it like _ffill_ with vlaue 
> _last_not_null_forward_
>  * If _limit_ is set, then NaNs with _null_index_forward_ greater than 
> _limit_ will not be interpolated.
> h2. Plan
> 1, impl the basic _linear interpolate_ with param _limit_
> 2, add param _limit_direction_
> 3, add param _limit_area_



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38832) Remove unnecessary distinct in aggregate expression by distinctKeys

2022-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38832.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36117
[https://github.com/apache/spark/pull/36117]

> Remove unnecessary distinct in aggregate expression by distinctKeys
> ---
>
> Key: SPARK-38832
> URL: https://issues.apache.org/jira/browse/SPARK-38832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> We can remove the distinct in aggregate expression if the child distinct 
> semantics is guaranteed.
> For example:
> {code:java}
> SELECT count(distinct c) FROM (
>   SELECT c FROM t GROUP BY c
> ){code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38867) Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin codegen

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521567#comment-17521567
 ] 

Apache Spark commented on SPARK-38867:
--

User 'mcdull-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36178

> Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin 
> codegen
> 
>
> Key: SPARK-38867
> URL: https://issues.apache.org/jira/browse/SPARK-38867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: mcdull_zhang
>Priority: Minor
>
> WholeStageCodegenExec is wrapped in BufferedRowIterator.
> BufferedRowIterator uses a LinkedList to hold the output of 
> WholeStageCodegenExec.
> When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to 
> append the output to this LinkedList.
> SortMergeJoin processes a record in streamedPlan each time. If all records in 
> bufferedPlan can match this record, all records in bufferedPlan will be saved 
> in LinkedList, resulting in OOM.
> The above situation is very common in our internal use, so it is best to add 
> a configuration to the codegen code. If there are enough pieces in the 
> LinkedList, stop SortMergeJoin and let the parent consume it first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38867) Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin codegen

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38867:


Assignee: Apache Spark

> Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin 
> codegen
> 
>
> Key: SPARK-38867
> URL: https://issues.apache.org/jira/browse/SPARK-38867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: mcdull_zhang
>Assignee: Apache Spark
>Priority: Minor
>
> WholeStageCodegenExec is wrapped in BufferedRowIterator.
> BufferedRowIterator uses a LinkedList to hold the output of 
> WholeStageCodegenExec.
> When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to 
> append the output to this LinkedList.
> SortMergeJoin processes a record in streamedPlan each time. If all records in 
> bufferedPlan can match this record, all records in bufferedPlan will be saved 
> in LinkedList, resulting in OOM.
> The above situation is very common in our internal use, so it is best to add 
> a configuration to the codegen code. If there are enough pieces in the 
> LinkedList, stop SortMergeJoin and let the parent consume it first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38867) Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin codegen

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38867:


Assignee: (was: Apache Spark)

> Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin 
> codegen
> 
>
> Key: SPARK-38867
> URL: https://issues.apache.org/jira/browse/SPARK-38867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: mcdull_zhang
>Priority: Minor
>
> WholeStageCodegenExec is wrapped in BufferedRowIterator.
> BufferedRowIterator uses a LinkedList to hold the output of 
> WholeStageCodegenExec.
> When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to 
> append the output to this LinkedList.
> SortMergeJoin processes a record in streamedPlan each time. If all records in 
> bufferedPlan can match this record, all records in bufferedPlan will be saved 
> in LinkedList, resulting in OOM.
> The above situation is very common in our internal use, so it is best to add 
> a configuration to the codegen code. If there are enough pieces in the 
> LinkedList, stop SortMergeJoin and let the parent consume it first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38774) impl Series.autocorr

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38774:


Assignee: zhengruifeng

> impl Series.autocorr
> 
>
> Key: SPARK-38774
> URL: https://issues.apache.org/jira/browse/SPARK-38774
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38774) impl Series.autocorr

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38774.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36048
[https://github.com/apache/spark/pull/36048]

> impl Series.autocorr
> 
>
> Key: SPARK-38774
> URL: https://issues.apache.org/jira/browse/SPARK-38774
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38829) New configuration for controlling timestamp inference of Parquet

2022-04-13 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38829.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36137
[https://github.com/apache/spark/pull/36137]

> New configuration for controlling timestamp inference of Parquet
> 
>
> Key: SPARK-38829
> URL: https://issues.apache.org/jira/browse/SPARK-38829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.3.0
>
>
> A new SQL conf which can fallback to the behavior that reads all the Parquet 
> Timestamp column as TimestampType.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38886) Remove outer join if aggregate functions are duplicate agnostic on streamed side

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521535#comment-17521535
 ] 

Apache Spark commented on SPARK-38886:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/36177

> Remove outer join if aggregate functions are duplicate agnostic on streamed 
> side
> 
>
> Key: SPARK-38886
> URL: https://issues.apache.org/jira/browse/SPARK-38886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> If aggregate child is outer join, and the aggregate references are all coming 
> from the streamed side and the aggregate functions are all duplicate 
> agnostic, we can remve the outer join.
> For example:
> {code:java}
> SELECT t1.c1, min(t1.c2) FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 GROUP BY t1.c1
> ==>
> SELECT t1.c1, min(t1.c2) FROM t1 GROUP BY t1.c1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38886) Remove outer join if aggregate functions are duplicate agnostic on streamed side

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38886:


Assignee: (was: Apache Spark)

> Remove outer join if aggregate functions are duplicate agnostic on streamed 
> side
> 
>
> Key: SPARK-38886
> URL: https://issues.apache.org/jira/browse/SPARK-38886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> If aggregate child is outer join, and the aggregate references are all coming 
> from the streamed side and the aggregate functions are all duplicate 
> agnostic, we can remve the outer join.
> For example:
> {code:java}
> SELECT t1.c1, min(t1.c2) FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 GROUP BY t1.c1
> ==>
> SELECT t1.c1, min(t1.c2) FROM t1 GROUP BY t1.c1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38886) Remove outer join if aggregate functions are duplicate agnostic on streamed side

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521536#comment-17521536
 ] 

Apache Spark commented on SPARK-38886:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/36177

> Remove outer join if aggregate functions are duplicate agnostic on streamed 
> side
> 
>
> Key: SPARK-38886
> URL: https://issues.apache.org/jira/browse/SPARK-38886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> If aggregate child is outer join, and the aggregate references are all coming 
> from the streamed side and the aggregate functions are all duplicate 
> agnostic, we can remve the outer join.
> For example:
> {code:java}
> SELECT t1.c1, min(t1.c2) FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 GROUP BY t1.c1
> ==>
> SELECT t1.c1, min(t1.c2) FROM t1 GROUP BY t1.c1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38886) Remove outer join if aggregate functions are duplicate agnostic on streamed side

2022-04-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38886:


Assignee: Apache Spark

> Remove outer join if aggregate functions are duplicate agnostic on streamed 
> side
> 
>
> Key: SPARK-38886
> URL: https://issues.apache.org/jira/browse/SPARK-38886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> If aggregate child is outer join, and the aggregate references are all coming 
> from the streamed side and the aggregate functions are all duplicate 
> agnostic, we can remve the outer join.
> For example:
> {code:java}
> SELECT t1.c1, min(t1.c2) FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 GROUP BY t1.c1
> ==>
> SELECT t1.c1, min(t1.c2) FROM t1 GROUP BY t1.c1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38886) Remove outer join if aggregate functions are duplicate agnostic on streamed side

2022-04-13 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38886:
--
Description: 
If aggregate child is outer join, and the aggregate references are all coming 
from the streamed side and the aggregate functions are all duplicate agnostic, 
we can remve the outer join.

For example:
{code:java}
SELECT t1.c1, min(t1.c2) FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 GROUP BY t1.c1
==>
SELECT t1.c1, min(t1.c2) FROM t1 GROUP BY t1.c1
{code}

  was:
If aggregate child is outer join, and the aggregate references are all coming 
from the streamed side and the aggregate functions are all duplicate agnostic, 
we can remve the outer join.

For example:
{code:java}
SELECT t1.c1, max(t1.c2) FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 GROUP BY t1.c1
==>
SELECT t1.c1, max(t1.c2) FROM t1 GROUP BY t1.c1
{code}



> Remove outer join if aggregate functions are duplicate agnostic on streamed 
> side
> 
>
> Key: SPARK-38886
> URL: https://issues.apache.org/jira/browse/SPARK-38886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> If aggregate child is outer join, and the aggregate references are all coming 
> from the streamed side and the aggregate functions are all duplicate 
> agnostic, we can remve the outer join.
> For example:
> {code:java}
> SELECT t1.c1, min(t1.c2) FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 GROUP BY t1.c1
> ==>
> SELECT t1.c1, min(t1.c2) FROM t1 GROUP BY t1.c1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38833) PySpark applyInPandas should allow to return empty DataFrame without columns

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38833:


Assignee: Enrico Minack

> PySpark applyInPandas should allow to return empty DataFrame without columns
> 
>
> Key: SPARK-38833
> URL: https://issues.apache.org/jira/browse/SPARK-38833
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>
> Currently, returning an empty Pandas DataFrame from {{applyInPandas}} raises 
> an error:
> {noformat}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema. Expected: 2 Actual: 0
> {noformat}
> Here is an example:
> {code}
> import pandas as pd  
> from pyspark.sql.functions import pandas_udf, ceil
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))  
> def mean_func(key, pdf):
> if key == (1,):
> return pd.DataFrame([])
> else:
> return pd.DataFrame([key + (pdf.v.mean(),)])
> df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show()
> {code}
> Since the schema is defined when calling {{applyInPandas()}}, it looks 
> redundant to define the columns when returning an empty {{pd.DataFrame}}. 
> Returning a non-empty DataFrame does not require defining columns, so 
> returning an empty DataFrame shouldn't require that either.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38833) PySpark applyInPandas should allow to return empty DataFrame without columns

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38833.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36120
[https://github.com/apache/spark/pull/36120]

> PySpark applyInPandas should allow to return empty DataFrame without columns
> 
>
> Key: SPARK-38833
> URL: https://issues.apache.org/jira/browse/SPARK-38833
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, returning an empty Pandas DataFrame from {{applyInPandas}} raises 
> an error:
> {noformat}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema. Expected: 2 Actual: 0
> {noformat}
> Here is an example:
> {code}
> import pandas as pd  
> from pyspark.sql.functions import pandas_udf, ceil
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))  
> def mean_func(key, pdf):
> if key == (1,):
> return pd.DataFrame([])
> else:
> return pd.DataFrame([key + (pdf.v.mean(),)])
> df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show()
> {code}
> Since the schema is defined when calling {{applyInPandas()}}, it looks 
> redundant to define the columns when returning an empty {{pd.DataFrame}}. 
> Returning a non-empty DataFrame does not require defining columns, so 
> returning an empty DataFrame shouldn't require that either.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38883) smaller pyspark install if not using streaming?

2022-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38883.
--
Resolution: Invalid

Let's interact with Spark mailing list for questions.

> smaller pyspark install if not using streaming?
> ---
>
> Key: SPARK-38883
> URL: https://issues.apache.org/jira/browse/SPARK-38883
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: t oo
>Priority: Minor
>
> h3. Describe the feature
> i am trying to include pyspark in my docker image, but the size is around 
> 300MB
> the largest jar is rocksdbjni-6.20.3.jar at 35MB
> is it safe to remove this jar if i have no need for SparkStreaming?
> is there any advice on getting the install smaller? perhaps a map of which 
> jars are needed for batch vs sql vs streaming?
> h3. Use Case
> smaller python package means i can pack more concurrent pods on to my eks 
> workers



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34659) Web UI does not correctly get appId

2022-04-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521520#comment-17521520
 ] 

Apache Spark commented on SPARK-34659:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36176

> Web UI does not correctly get appId
> ---
>
> Key: SPARK-34659
> URL: https://issues.apache.org/jira/browse/SPARK-34659
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Web UI
>Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Arata Furukawa
>Priority: Major
>
> Web UI does not correctly get appId when it has `proxy` or `history` in URL.
> In my case, it happens on 
> `https://jupyterhub.hosted.us/my-name/proxy/4040/executors/`.
> Web developer console says: `jquery-3.4.1.min.js:2 GET 
> https://jupyterhub.hosted.us/user/my-name/proxy/4040/api/v1/applications/4040/allexecutors
>  404`, and it shows blank pages to me.
> There is relative issue in jupyterhub 
> https://github.com/jupyterhub/jupyter-server-proxy/issues/57
> https://github.com/apache/spark/blob/2526fdea481b1777b2c4a2242254b72b5c49d820/core/src/main/resources/org/apache/spark/ui/static/utils.js#L93-L105
> It should not get from document.baseURI.
> A request will occur, but performance impacts will be a bit.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >