date:20210903

[jira] [Resolved] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set

2021-09-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36643.
---
Fix Version/s: 3.3.0
 Assignee: Senthil Kumar
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/33894

> Add more information in ERROR log while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
> --
>
> Key: SPARK-36643
> URL: https://issues.apache.org/jira/browse/SPARK-36643
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Assignee: Senthil Kumar
>Priority: Minor
> Fix For: 3.3.0
>
>
> Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as 
> true in Spark 3.* versions int order to avoid changing Spark Confs. But from 
> the error message we get confused if we can not modify/change Spark conf in 
> Spark 3.* or not.
> Current Error Message :
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> modify the value of a Spark config: spark.driver.host
>  at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156)
>  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code}
>  
> So adding little more information( how to modify Spark Conf), in ERROR log 
> while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful 
> to avoid confusions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec

2021-09-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-3:
-

Assignee: Andy Grove

> [SQL] Regression in AQEShuffleReadExec
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
>
> I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
> 3.2 release candidate and there is a regression in AQEShuffleReadExec where 
> it now throws an exception if the shuffle's output partitioning does not 
> match a specific list of schemes.
> The problem can be solved by returning UnknownPartitioning, as it already 
> does in some cases, rather than throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec

2021-09-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-3:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> [SQL] Regression in AQEShuffleReadExec
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Andy Grove
>Priority: Blocker
>
> I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
> 3.2 release candidate and there is a regression in AQEShuffleReadExec where 
> it now throws an exception if the shuffle's output partitioning does not 
> match a specific list of schemes.
> The problem can be solved by returning UnknownPartitioning, as it already 
> does in some cases, rather than throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36653) Implement Series.xor

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36653:


Assignee: (was: Apache Spark)

> Implement Series.__xor__
> 
>
> Key: SPARK-36653
> URL: https://issues.apache.org/jira/browse/SPARK-36653
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36653) Implement Series.xor

2021-09-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409857#comment-17409857
 ] 

Apache Spark commented on SPARK-36653:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/33911

> Implement Series.__xor__
> 
>
> Key: SPARK-36653
> URL: https://issues.apache.org/jira/browse/SPARK-36653
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36653) Implement Series.xor

2021-09-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409858#comment-17409858
 ] 

Apache Spark commented on SPARK-36653:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/33911

> Implement Series.__xor__
> 
>
> Key: SPARK-36653
> URL: https://issues.apache.org/jira/browse/SPARK-36653
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36653) Implement Series.xor

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36653:


Assignee: Apache Spark

> Implement Series.__xor__
> 
>
> Key: SPARK-36653
> URL: https://issues.apache.org/jira/browse/SPARK-36653
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36667) Close resources properly in StateStoreSuite/RocksDBStateStoreSuite

2021-09-03 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409779#comment-17409779
 ] 

Jungtaek Lim commented on SPARK-36667:
--

Will submit a PR soon.

> Close resources properly in StateStoreSuite/RocksDBStateStoreSuite
> --
>
> Key: SPARK-36667
> URL: https://issues.apache.org/jira/browse/SPARK-36667
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The StateStoreProvider instances created from "newStoreProvider" are NOT 
> automatically closed.
> While this is trivial for HDFSBackedStateStoreProvider, for 
> RocksDBStateStoreProvider we leak RocksDB instance as well which should have 
> closed. Most tests in the RocksDBStateStoreSuite initialize 
> RocksDBStateStoreProvider, meaning that 60+ RocksDB instances are not closed 
> in the suite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36667) Close resources properly in StateStoreSuite/RocksDBStateStoreSuite

2021-09-03 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-36667:


 Summary: Close resources properly in 
StateStoreSuite/RocksDBStateStoreSuite
 Key: SPARK-36667
 URL: https://issues.apache.org/jira/browse/SPARK-36667
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Jungtaek Lim


The StateStoreProvider instances created from "newStoreProvider" are NOT 
automatically closed.

While this is trivial for HDFSBackedStateStoreProvider, for 
RocksDBStateStoreProvider we leak RocksDB instance as well which should have 
closed. Most tests in the RocksDBStateStoreSuite initialize 
RocksDBStateStoreProvider, meaning that 60+ RocksDB instances are not closed in 
the suite.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec

2021-09-03 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-3:
--
Priority: Blocker  (was: Major)

> [SQL] Regression in AQEShuffleReadExec
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Andy Grove
>Priority: Blocker
>
> I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
> 3.2 release candidate and there is a regression in AQEShuffleReadExec where 
> it now throws an exception if the shuffle's output partitioning does not 
> match a specific list of schemes.
> The problem can be solved by returning UnknownPartitioning, as it already 
> does in some cases, rather than throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: Apache Spark

> [SQL] Regression in AQEShuffleReadExec
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Andy Grove
>Assignee: Apache Spark
>Priority: Major
>
> I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
> 3.2 release candidate and there is a regression in AQEShuffleReadExec where 
> it now throws an exception if the shuffle's output partitioning does not 
> match a specific list of schemes.
> The problem can be solved by returning UnknownPartitioning, as it already 
> does in some cases, rather than throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec

2021-09-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409761#comment-17409761
 ] 

Apache Spark commented on SPARK-3:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/33910

> [SQL] Regression in AQEShuffleReadExec
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Andy Grove
>Priority: Major
>
> I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
> 3.2 release candidate and there is a regression in AQEShuffleReadExec where 
> it now throws an exception if the shuffle's output partitioning does not 
> match a specific list of schemes.
> The problem can be solved by returning UnknownPartitioning, as it already 
> does in some cases, rather than throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: (was: Apache Spark)

> [SQL] Regression in AQEShuffleReadExec
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Andy Grove
>Priority: Major
>
> I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
> 3.2 release candidate and there is a regression in AQEShuffleReadExec where 
> it now throws an exception if the shuffle's output partitioning does not 
> match a specific list of schemes.
> The problem can be solved by returning UnknownPartitioning, as it already 
> does in some cases, rather than throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec

2021-09-03 Thread Andy Grove (Jira)

Andy Grove created SPARK-3:
--

 Summary: [SQL] Regression in AQEShuffleReadExec
 Key: SPARK-3
 URL: https://issues.apache.org/jira/browse/SPARK-3
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Andy Grove


I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
3.2 release candidate and there is a regression in AQEShuffleReadExec where it 
now throws an exception if the shuffle's output partitioning does not match a 
specific list of schemes.

The problem can be solved by returning UnknownPartitioning, as it does in some 
cases, rather than throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec

2021-09-03 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-3:
---
Description: 
I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
3.2 release candidate and there is a regression in AQEShuffleReadExec where it 
now throws an exception if the shuffle's output partitioning does not match a 
specific list of schemes.

The problem can be solved by returning UnknownPartitioning, as it already does 
in some cases, rather than throwing an exception.

  was:
I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
3.2 release candidate and there is a regression in AQEShuffleReadExec where it 
now throws an exception if the shuffle's output partitioning does not match a 
specific list of schemes.

The problem can be solved by returning UnknownPartitioning, as it does in some 
cases, rather than throwing an exception.


> [SQL] Regression in AQEShuffleReadExec
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Andy Grove
>Priority: Major
>
> I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark 
> 3.2 release candidate and there is a regression in AQEShuffleReadExec where 
> it now throws an exception if the shuffle's output partitioning does not 
> match a specific list of schemes.
> The problem can be solved by returning UnknownPartitioning, as it already 
> does in some cases, rather than throwing an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2021-09-03 Thread Kazuyuki Tanimura (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409746#comment-17409746
 ] 

Kazuyuki Tanimura commented on SPARK-36665:
---

I am working on this

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36665) Add more Not operator optimizations

2021-09-03 Thread Kazuyuki Tanimura (Jira)

Kazuyuki Tanimura created SPARK-36665:
-

 Summary: Add more Not operator optimizations
 Key: SPARK-36665
 URL: https://issues.apache.org/jira/browse/SPARK-36665
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.2.0, 3.3.0
Reporter: Kazuyuki Tanimura


{{BooleanSimplification should be able to do more simplifications for Not 
operators applying following rules}}
 # {{Not(null) == null}}
 ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
 # {{(Not(a) = b) == (a = Not(b))}}
 ## {{e.g. Not(...) = true can be (...) = false}}
 # {{(a != b) == (a = Not(b))}}
 ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36655) Add `versionadded` for API added in Spark 3.3.0

2021-09-03 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36655.
---
Fix Version/s: 3.3.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 33901
https://github.com/apache/spark/pull/33901

> Add `versionadded` for API added in Spark 3.3.0
> ---
>
> Key: SPARK-36655
> URL: https://issues.apache.org/jira/browse/SPARK-36655
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36401) Implement Series.cov

2021-09-03 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36401.
---
Fix Version/s: 3.3.0
 Assignee: dgd_contributor
   Resolution: Fixed

Issue resolved by pull request 33752
https://github.com/apache/spark/pull/33752

> Implement Series.cov
> 
>
> Key: SPARK-36401
> URL: https://issues.apache.org/jira/browse/SPARK-36401
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: dgd_contributor
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36659) Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config

2021-09-03 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409634#comment-17409634
 ] 

Dongjoon Hyun commented on SPARK-36659:
---

Although RC2 will fail, I set the fixed version with 3.2.1 because the RC vote 
is still open.

> Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config
> --
>
> Key: SPARK-36659
> URL: https://issues.apache.org/jira/browse/SPARK-36659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.2.1
>
>
> spark.sql.execution.topKSortFallbackThreshold now is an internal config 
> hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world 
> cases, if the K is very big,  there would be performance issues.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36659) Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config

2021-09-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36659:
--
Fix Version/s: (was: 3.3.0)
   3.2.1

> Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config
> --
>
> Key: SPARK-36659
> URL: https://issues.apache.org/jira/browse/SPARK-36659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.2.1
>
>
> spark.sql.execution.topKSortFallbackThreshold now is an internal config 
> hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world 
> cases, if the K is very big,  there would be performance issues.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36659) Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config

2021-09-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36659:
--
Fix Version/s: (was: 3.2.0)
   3.3.0

> Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config
> --
>
> Key: SPARK-36659
> URL: https://issues.apache.org/jira/browse/SPARK-36659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.3.0
>
>
> spark.sql.execution.topKSortFallbackThreshold now is an internal config 
> hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world 
> cases, if the K is very big,  there would be performance issues.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36659) Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config

2021-09-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36659:
--
Fix Version/s: (was: 3.3.0)
   3.2.0

> Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config
> --
>
> Key: SPARK-36659
> URL: https://issues.apache.org/jira/browse/SPARK-36659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.2.0
>
>
> spark.sql.execution.topKSortFallbackThreshold now is an internal config 
> hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world 
> cases, if the K is very big,  there would be performance issues.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36639) SQL sequence function with interval returns unexpected error in latest versions

2021-09-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409620#comment-17409620
 ] 

Apache Spark commented on SPARK-36639:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33909

> SQL sequence function with interval returns unexpected error in latest 
> versions
> ---
>
> Key: SPARK-36639
> URL: https://issues.apache.org/jira/browse/SPARK-36639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Ignatiy Vdovichenko
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> For example this returns 
> {color:#FF}java.lang.ArrayIndexOutOfBoundsException: 1 {color}
> {code:java}
> select sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-08-15'),
>  - interval 1 month){code}
> Another cases like - all ok
> {code:java}
> select sequence(
>  date_trunc('month', '2021-07-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as x
>  , sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-07-15'),
>  - interval 1 month) as y
>  , sequence(
>  date_trunc('month', '2021-08-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as z{code}
> In version 3.0.0 this works



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36639) SQL sequence function with interval returns unexpected error in latest versions

2021-09-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409619#comment-17409619
 ] 

Apache Spark commented on SPARK-36639:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33909

> SQL sequence function with interval returns unexpected error in latest 
> versions
> ---
>
> Key: SPARK-36639
> URL: https://issues.apache.org/jira/browse/SPARK-36639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Ignatiy Vdovichenko
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> For example this returns 
> {color:#FF}java.lang.ArrayIndexOutOfBoundsException: 1 {color}
> {code:java}
> select sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-08-15'),
>  - interval 1 month){code}
> Another cases like - all ok
> {code:java}
> select sequence(
>  date_trunc('month', '2021-07-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as x
>  , sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-07-15'),
>  - interval 1 month) as y
>  , sequence(
>  date_trunc('month', '2021-08-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as z{code}
> In version 3.0.0 this works



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36664) Log time spent waiting for cluster resources

2021-09-03 Thread Holden Karau (Jira)

Holden Karau created SPARK-36664:


 Summary: Log time spent waiting for cluster resources
 Key: SPARK-36664
 URL: https://issues.apache.org/jira/browse/SPARK-36664
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Holden Karau


To provide better visibility into why jobs might be running slow it would be 
useful to log when we are waiting for resources and how long we are waiting for 
resources so if there is an underlying cluster issue the user can be aware.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST

2021-09-03 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409617#comment-17409617
 ] 

pralabhkumar commented on SPARK-36622:
--

[~thejdeep] 

Its better to have _HOST , its been common practice for  hiveserver and similar 
 projects. 

 

[~tgraves]

Agreed

 

Please let me know  , if you are ok . I can create the PR . 

 

> spark.history.kerberos.principal doesn't take value _HOST
> -
>
> Key: SPARK-36622
> URL: https://issues.apache.org/jira/browse/SPARK-36622
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Security, Spark Core
>Affects Versions: 3.0.1, 3.1.2
>Reporter: pralabhkumar
>Priority: Minor
>
> spark.history.kerberos.principal doesn't understand value _HOST. 
> It says failure to login for principal : spark/_HOST@realm . 
> It will be helpful to take _HOST value via config file and change it with 
> current hostname(similar to what Hive does) . This will also help to run SHS 
> on multiple machines without hardcoding principal hostname.  
> .spark.history.kerberos.principal
>  
> It require minor change in HistoryServer.scala in initSecurity  method . 
>  
> Please let me know , if this request make sense , I'll create the PR . 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36639) SQL sequence function with interval returns unexpected error in latest versions

2021-09-03 Thread Kousuke Saruta (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409539#comment-17409539
 ] 

Kousuke Saruta commented on SPARK-36639:


Issue resolved in https://github.com/apache/spark/pull/33895 for 3.1 and 3.2.

> SQL sequence function with interval returns unexpected error in latest 
> versions
> ---
>
> Key: SPARK-36639
> URL: https://issues.apache.org/jira/browse/SPARK-36639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Ignatiy Vdovichenko
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> For example this returns 
> {color:#FF}java.lang.ArrayIndexOutOfBoundsException: 1 {color}
> {code:java}
> select sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-08-15'),
>  - interval 1 month){code}
> Another cases like - all ok
> {code:java}
> select sequence(
>  date_trunc('month', '2021-07-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as x
>  , sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-07-15'),
>  - interval 1 month) as y
>  , sequence(
>  date_trunc('month', '2021-08-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as z{code}
> In version 3.0.0 this works



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36639) SQL sequence function with interval returns unexpected error in latest versions

2021-09-03 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-36639.

  Assignee: Kousuke Saruta
Resolution: Fixed

> SQL sequence function with interval returns unexpected error in latest 
> versions
> ---
>
> Key: SPARK-36639
> URL: https://issues.apache.org/jira/browse/SPARK-36639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Ignatiy Vdovichenko
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> For example this returns 
> {color:#FF}java.lang.ArrayIndexOutOfBoundsException: 1 {color}
> {code:java}
> select sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-08-15'),
>  - interval 1 month){code}
> Another cases like - all ok
> {code:java}
> select sequence(
>  date_trunc('month', '2021-07-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as x
>  , sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-07-15'),
>  - interval 1 month) as y
>  , sequence(
>  date_trunc('month', '2021-08-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as z{code}
> In version 3.0.0 this works



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36639) SQL sequence function with interval returns unexpected error in latest versions

2021-09-03 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36639:
---
Fix Version/s: 3.1.3
   3.2.0

> SQL sequence function with interval returns unexpected error in latest 
> versions
> ---
>
> Key: SPARK-36639
> URL: https://issues.apache.org/jira/browse/SPARK-36639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Ignatiy Vdovichenko
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> For example this returns 
> {color:#FF}java.lang.ArrayIndexOutOfBoundsException: 1 {color}
> {code:java}
> select sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-08-15'),
>  - interval 1 month){code}
> Another cases like - all ok
> {code:java}
> select sequence(
>  date_trunc('month', '2021-07-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as x
>  , sequence(
>  date_trunc('month', '2021-08-30'),
>  date_trunc('month', '2021-07-15'),
>  - interval 1 month) as y
>  , sequence(
>  date_trunc('month', '2021-08-15'),
>  date_trunc('month', '2021-08-30'),
>  interval 1 month) as z{code}
> In version 3.0.0 this works



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-03 Thread mcdull_zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409492#comment-17409492
 ] 

mcdull_zhang commented on SPARK-36663:
--

cc  [~hyukjin.kwon]      [~cloud_fan]

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: mcdull_zhang
>Priority: Critical
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint")
> {quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-56-28-846.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-03 Thread mcdull_zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-36663:
-
Description: 
You can use the following methods to reproduce the problem:
{quote}val path = "file:///tmp/test_orc"

spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)

spark.read.orc(path)
{quote}
The error message is like this:
{quote}org.apache.spark.sql.catalyst.parser.ParseException:
 mismatched input '100' expecting {'ADD', 'AFTER'

== SQL ==
 struct<100:bigint>
 ---^^^
{quote}
The error is actually issued by this line of code:
{quote}CatalystSqlParser.parseDataType("100:bigint")
{quote}
 

The specific background is that spark calls the above code in the process of 
converting the schema of the orc file into the catalyst schema.
{quote}// code in OrcUtils
 private def toCatalystSchema(schema: TypeDescription): StructType =
Unknown macro: \{  
CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
 }{quote}
There are two solutions I currently think of:
 # Modify the syntax analysis of SparkSQL to identify this kind of schema
 # The TypeDescription.toString method should add the quote symbol to the 
numeric column name, because the following syntax is supported:
{quote}CatalystSqlParser.parseDataType("`100`:bigint")
{quote}

But currently TypeDescription does not support changing the UNQUOTED_NAMES 
variable, should we first submit a pr to the orc project to support the 
configuration of this variable。

!image-2021-09-03-20-56-28-846.png!

 

How do spark members think about this issue?

 

  was:
You can use the following methods to reproduce the problem:
{quote}val path = "file:///tmp/test_orc"

spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)

spark.read.orc(path)
{quote}
The error message is like this:
{quote}org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '100' expecting {'ADD', 'AFTER'

== SQL ==
struct<100:bigint>
---^^^
{quote}
The error is actually issued by this line of code:
{quote}CatalystSqlParser.parseDataType("100:bigint")
{quote}
 

The specific background is that spark calls the above code in the process of 
converting the schema of the orc file into the catalyst schema.
{quote}// code in OrcUtils
private def toCatalystSchema(schema: TypeDescription): StructType = {
 
CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
}{quote}
There are two solutions I currently think of:
 # Modify the syntax analysis of SparkSQL to identify this kind of schema
 # The TypeDescription.toString method should add the quote symbol to the 
numeric column name, because the following syntax is supported:
{quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote}

But currently TypeDescription does not support changing the UNQUOTED_NAMES 
variable, should we first submit a pr to the orc project to support the 
configuration of this variable。

!image-2021-09-03-20-53-35-626.png!

 

How do spark members think about this issue?

 


> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: mcdull_zhang
>Priority: Critical
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supp

[jira] [Updated] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-03 Thread mcdull_zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-36663:
-
Attachment: image-2021-09-03-20-56-28-846.png

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: mcdull_zhang
>Priority: Critical
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
> struct<100:bigint>
> ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
> private def toCatalystSchema(schema: TypeDescription): StructType = {
>  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
> }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-53-35-626.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-03 Thread mcdull_zhang (Jira)

mcdull_zhang created SPARK-36663:


 Summary: When the existing field name is a number, an error will 
be reported when reading the orc file
 Key: SPARK-36663
 URL: https://issues.apache.org/jira/browse/SPARK-36663
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.0.3
Reporter: mcdull_zhang


You can use the following methods to reproduce the problem:
{quote}val path = "file:///tmp/test_orc"

spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)

spark.read.orc(path)
{quote}
The error message is like this:
{quote}org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '100' expecting {'ADD', 'AFTER'

== SQL ==
struct<100:bigint>
---^^^
{quote}
The error is actually issued by this line of code:
{quote}CatalystSqlParser.parseDataType("100:bigint")
{quote}
 

The specific background is that spark calls the above code in the process of 
converting the schema of the orc file into the catalyst schema.
{quote}// code in OrcUtils
private def toCatalystSchema(schema: TypeDescription): StructType = {
 
CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
}{quote}
There are two solutions I currently think of:
 # Modify the syntax analysis of SparkSQL to identify this kind of schema
 # The TypeDescription.toString method should add the quote symbol to the 
numeric column name, because the following syntax is supported:
{quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote}

But currently TypeDescription does not support changing the UNQUOTED_NAMES 
variable, should we first submit a pr to the orc project to support the 
configuration of this variable。

!image-2021-09-03-20-53-35-626.png!

 

How do spark members think about this issue?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36609) Add `errors` argument for `ps.to_numeric`.

2021-09-03 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36609.
--
Fix Version/s: 3.3.0
 Assignee: Haejoon Lee
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33882

> Add `errors` argument for `ps.to_numeric`.
> --
>
> Key: SPARK-36609
> URL: https://issues.apache.org/jira/browse/SPARK-36609
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> To match the behavior with pandas, we should support `errors` argument for 
> `ps.to_numeric` API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36659) Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config

2021-09-03 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-36659.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33904
[https://github.com/apache/spark/pull/33904]

> Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config
> --
>
> Key: SPARK-36659
> URL: https://issues.apache.org/jira/browse/SPARK-36659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.3.0
>
>
> spark.sql.execution.topKSortFallbackThreshold now is an internal config 
> hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world 
> cases, if the K is very big,  there would be performance issues.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36659) Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config

2021-09-03 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-36659:


Assignee: Kent Yao

> Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config
> --
>
> Key: SPARK-36659
> URL: https://issues.apache.org/jira/browse/SPARK-36659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> spark.sql.execution.topKSortFallbackThreshold now is an internal config 
> hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world 
> cases, if the K is very big,  there would be performance issues.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36661) Support TimestampNTZ in Py4J

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36661:


Assignee: (was: Apache Spark)

> Support TimestampNTZ in Py4J
> 
>
> Key: SPARK-36661
> URL: https://issues.apache.org/jira/browse/SPARK-36661
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36661) Support TimestampNTZ in Py4J

2021-09-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409447#comment-17409447
 ] 

Apache Spark commented on SPARK-36661:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33877

> Support TimestampNTZ in Py4J
> 
>
> Key: SPARK-36661
> URL: https://issues.apache.org/jira/browse/SPARK-36661
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36661) Support TimestampNTZ in Py4J

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36661:


Assignee: Apache Spark

> Support TimestampNTZ in Py4J
> 
>
> Key: SPARK-36661
> URL: https://issues.apache.org/jira/browse/SPARK-36661
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2021-09-03 Thread Ranga Reddy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409427#comment-17409427
 ] 

Ranga Reddy commented on SPARK-26208:
-

cc [~hyukjin.kwon]

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
> Fix For: 3.0.0
>
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36662) special timestamps values support for path filters - modifiedBefore/modifiedAfter

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36662:


Assignee: Apache Spark

> special timestamps values support for path filters - 
> modifiedBefore/modifiedAfter
> -
>
> Key: SPARK-36662
> URL: https://issues.apache.org/jira/browse/SPARK-36662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> support today, now, tomorrow, etc in path filter modifiedBefore/modifiedAfter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36662) special timestamps values support for path filters - modifiedBefore/modifiedAfter

2021-09-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409356#comment-17409356
 ] 

Apache Spark commented on SPARK-36662:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/33908

> special timestamps values support for path filters - 
> modifiedBefore/modifiedAfter
> -
>
> Key: SPARK-36662
> URL: https://issues.apache.org/jira/browse/SPARK-36662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
> support today, now, tomorrow, etc in path filter modifiedBefore/modifiedAfter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36662) special timestamps values support for path filters - modifiedBefore/modifiedAfter

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36662:


Assignee: (was: Apache Spark)

> special timestamps values support for path filters - 
> modifiedBefore/modifiedAfter
> -
>
> Key: SPARK-36662
> URL: https://issues.apache.org/jira/browse/SPARK-36662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
> support today, now, tomorrow, etc in path filter modifiedBefore/modifiedAfter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36644) Push down boolean column filter

2021-09-03 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36644.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33898
[https://github.com/apache/spark/pull/33898]

> Push down boolean column filter
> ---
>
> Key: SPARK-36644
> URL: https://issues.apache.org/jira/browse/SPARK-36644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE boolean_field
> ```
> although the following query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE boolean_field = true
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36644) Push down boolean column filter

2021-09-03 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-36644:
---

Assignee: Kazuyuki Tanimura

> Push down boolean column filter
> ---
>
> Key: SPARK-36644
> URL: https://issues.apache.org/jira/browse/SPARK-36644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE boolean_field
> ```
> although the following query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE boolean_field = true
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36662) special timestamps values support for path filters - modifiedBefore/modifiedAfter

2021-09-03 Thread Kent Yao (Jira)

Kent Yao created SPARK-36662:


 Summary: special timestamps values support for path filters - 
modifiedBefore/modifiedAfter
 Key: SPARK-36662
 URL: https://issues.apache.org/jira/browse/SPARK-36662
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kent Yao


support today, now, tomorrow, etc in path filter modifiedBefore/modifiedAfter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36610) Add `thousands` argument to `ps.read_csv`.

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36610:


Assignee: Apache Spark

> Add `thousands` argument to `ps.read_csv`.
> --
>
> Key: SPARK-36610
> URL: https://issues.apache.org/jira/browse/SPARK-36610
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> When reading csv file in pandas, pandas automatically detect the thousand 
> separator if `thousands` argument is specified.
> {code:java}
> >>> pd.read_csv(path, sep=";")
> name  agejob  money
> 0  Jorge   30  Developer  1,000,000
> 1Bob   32  Developer100
> >>> pd.read_csv(path, sep=";", thousands=",")
> name  agejobmoney
> 0  Jorge   30  Developer  100
> 1Bob   32  Developer  100{code}
> However, pandas-on-Spark doesn't support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36610) Add `thousands` argument to `ps.read_csv`.

2021-09-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36610:


Assignee: (was: Apache Spark)

> Add `thousands` argument to `ps.read_csv`.
> --
>
> Key: SPARK-36610
> URL: https://issues.apache.org/jira/browse/SPARK-36610
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> When reading csv file in pandas, pandas automatically detect the thousand 
> separator if `thousands` argument is specified.
> {code:java}
> >>> pd.read_csv(path, sep=";")
> name  agejob  money
> 0  Jorge   30  Developer  1,000,000
> 1Bob   32  Developer100
> >>> pd.read_csv(path, sep=";", thousands=",")
> name  agejobmoney
> 0  Jorge   30  Developer  100
> 1Bob   32  Developer  100{code}
> However, pandas-on-Spark doesn't support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36610) Add `thousands` argument to `ps.read_csv`.

2021-09-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409328#comment-17409328
 ] 

Apache Spark commented on SPARK-36610:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33907

> Add `thousands` argument to `ps.read_csv`.
> --
>
> Key: SPARK-36610
> URL: https://issues.apache.org/jira/browse/SPARK-36610
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> When reading csv file in pandas, pandas automatically detect the thousand 
> separator if `thousands` argument is specified.
> {code:java}
> >>> pd.read_csv(path, sep=";")
> name  agejob  money
> 0  Jorge   30  Developer  1,000,000
> 1Bob   32  Developer100
> >>> pd.read_csv(path, sep=";", thousands=",")
> name  agejobmoney
> 0  Jorge   30  Developer  100
> 1Bob   32  Developer  100{code}
> However, pandas-on-Spark doesn't support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

50 matches

Mail list logo