date:20220713

[jira] [Updated] (SPARK-39757) Upgrade sbt from 1.7.0 to 1.7.1

2022-07-13 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-39757:

Description: 
release notes: [https://github.com/sbt/sbt/releases]

https://github.com/sbt/sbt/compare/v1.7.0...v1.7.1

sbt 1.7.1 Bug fix
 * Fixes Java incremental compilation, specifically parsing of annotations in 
class files of [@SethTisue|https://github.com/SethTisue] in 
[sbt/zinc#|https://github.com/sbt/zinc/pull/]

  was:
release notes: [https://github.com/sbt/sbt/releases]

sbt 1.7.1 Bug fix
 * Fixes Java incremental compilation, specifically parsing of annotations in 
class files of [@SethTisue|https://github.com/SethTisue] in 
[sbt/zinc#|https://github.com/sbt/zinc/pull/]


> Upgrade sbt from 1.7.0 to 1.7.1
> ---
>
> Key: SPARK-39757
> URL: https://issues.apache.org/jira/browse/SPARK-39757
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Major
> Fix For: 3.4.0
>
>
> release notes: [https://github.com/sbt/sbt/releases]
> https://github.com/sbt/sbt/compare/v1.7.0...v1.7.1
> sbt 1.7.1 Bug fix
>  * Fixes Java incremental compilation, specifically parsing of annotations in 
> class files of [@SethTisue|https://github.com/SethTisue] in 
> [sbt/zinc#|https://github.com/sbt/zinc/pull/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39749) Always use plain string representation on casting Decimal to String

2022-07-13 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-39749.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37160
[https://github.com/apache/spark/pull/37160]

> Always use plain string representation on casting Decimal to String
> ---
>
> Key: SPARK-39749
> URL: https://issues.apache.org/jira/browse/SPARK-39749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, casting decimal as string type will result in Strings with 
> exponential notations if the adjusted exponent is less than -6. This is 
> consistent with BigDecimal.toString 
> [https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString]
>  
> This is different from external databases like PostgreSQL/Oracle/MS SQL 
> server. It doesn't compliant with the ANSI SQL standard either. 
> I suggest always using the plain string in the casting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39758) NPE on invalid patterns from the regexp functions

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566206#comment-17566206
 ] 

Apache Spark commented on SPARK-39758:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37171

> NPE on invalid patterns from the regexp functions
> -
>
> Key: SPARK-39758
> URL: https://issues.apache.org/jira/browse/SPARK-39758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The example below reproduces the issue:
> {code:sql}
> spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)');
> 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 
> 2b 14m', '(?l)')]
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768)
>  ~[spark-catalyst_2.12-3.3.0.jar:3.3.0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39758) NPE on invalid patterns from the regexp functions

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39758:


Assignee: Apache Spark  (was: Max Gekk)

> NPE on invalid patterns from the regexp functions
> -
>
> Key: SPARK-39758
> URL: https://issues.apache.org/jira/browse/SPARK-39758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The example below reproduces the issue:
> {code:sql}
> spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)');
> 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 
> 2b 14m', '(?l)')]
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768)
>  ~[spark-catalyst_2.12-3.3.0.jar:3.3.0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39758) NPE on invalid patterns from the regexp functions

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39758:


Assignee: Max Gekk  (was: Apache Spark)

> NPE on invalid patterns from the regexp functions
> -
>
> Key: SPARK-39758
> URL: https://issues.apache.org/jira/browse/SPARK-39758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The example below reproduces the issue:
> {code:sql}
> spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)');
> 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 
> 2b 14m', '(?l)')]
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768)
>  ~[spark-catalyst_2.12-3.3.0.jar:3.3.0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39759) Implement listIndexes in JDBC (H2 dialect)

2022-07-13 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-39759:
---

 Summary: Implement listIndexes in JDBC (H2 dialect)
 Key: SPARK-39759
 URL: https://issues.apache.org/jira/browse/SPARK-39759
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39759) Implement listIndexes in JDBC (H2 dialect)

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39759:


Assignee: (was: Apache Spark)

> Implement listIndexes in JDBC (H2 dialect)
> --
>
> Key: SPARK-39759
> URL: https://issues.apache.org/jira/browse/SPARK-39759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39759) Implement listIndexes in JDBC (H2 dialect)

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566219#comment-17566219
 ] 

Apache Spark commented on SPARK-39759:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37172

> Implement listIndexes in JDBC (H2 dialect)
> --
>
> Key: SPARK-39759
> URL: https://issues.apache.org/jira/browse/SPARK-39759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39759) Implement listIndexes in JDBC (H2 dialect)

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39759:


Assignee: Apache Spark

> Implement listIndexes in JDBC (H2 dialect)
> --
>
> Key: SPARK-39759
> URL: https://issues.apache.org/jira/browse/SPARK-39759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18127) Add hooks and extension points to Spark

2022-07-13 Thread Badrul Chowdhury (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566221#comment-17566221
 ] 

Badrul Chowdhury commented on SPARK-18127:
--

Here is another resource where they use this feature to implement AutoExecutor 
for "serverless" Spark: [https://arxiv.org/pdf/2112.08572.pdf]

 

 

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
>Priority: Major
> Fix For: 2.2.0
>
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39760) Support Varchar in PySpark

2022-07-13 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-39760:
-

 Summary: Support Varchar in PySpark
 Key: SPARK-39760
 URL: https://issues.apache.org/jira/browse/SPARK-39760
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39760) Support Varchar in PySpark

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566234#comment-17566234
 ] 

Apache Spark commented on SPARK-39760:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37173

> Support Varchar in PySpark
> --
>
> Key: SPARK-39760
> URL: https://issues.apache.org/jira/browse/SPARK-39760
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39760) Support Varchar in PySpark

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39760:


Assignee: Apache Spark

> Support Varchar in PySpark
> --
>
> Key: SPARK-39760
> URL: https://issues.apache.org/jira/browse/SPARK-39760
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39760) Support Varchar in PySpark

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39760:


Assignee: (was: Apache Spark)

> Support Varchar in PySpark
> --
>
> Key: SPARK-39760
> URL: https://issues.apache.org/jira/browse/SPARK-39760
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39760) Support Varchar in PySpark

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566235#comment-17566235
 ] 

Apache Spark commented on SPARK-39760:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37173

> Support Varchar in PySpark
> --
>
> Key: SPARK-39760
> URL: https://issues.apache.org/jira/browse/SPARK-39760
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39761) Add Apache Spark images info in running-on-kubernetes doc

2022-07-13 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-39761:
---

 Summary: Add Apache Spark images info in running-on-kubernetes doc
 Key: SPARK-39761
 URL: https://issues.apache.org/jira/browse/SPARK-39761
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Kubernetes
Affects Versions: 3.4.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39761) Add Apache Spark images info in running-on-kubernetes doc

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39761:


Assignee: (was: Apache Spark)

> Add Apache Spark images info in running-on-kubernetes doc
> -
>
> Key: SPARK-39761
> URL: https://issues.apache.org/jira/browse/SPARK-39761
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39761) Add Apache Spark images info in running-on-kubernetes doc

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39761:


Assignee: Apache Spark

> Add Apache Spark images info in running-on-kubernetes doc
> -
>
> Key: SPARK-39761
> URL: https://issues.apache.org/jira/browse/SPARK-39761
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39761) Add Apache Spark images info in running-on-kubernetes doc

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566257#comment-17566257
 ] 

Apache Spark commented on SPARK-39761:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37174

> Add Apache Spark images info in running-on-kubernetes doc
> -
>
> Key: SPARK-39761
> URL: https://issues.apache.org/jira/browse/SPARK-39761
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39761) Add Apache Spark images info in running-on-kubernetes doc

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566259#comment-17566259
 ] 

Apache Spark commented on SPARK-39761:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37174

> Add Apache Spark images info in running-on-kubernetes doc
> -
>
> Key: SPARK-39761
> URL: https://issues.apache.org/jira/browse/SPARK-39761
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39757) Upgrade sbt from 1.7.0 to 1.7.1

2022-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39757:


Assignee: BingKun Pan

> Upgrade sbt from 1.7.0 to 1.7.1
> ---
>
> Key: SPARK-39757
> URL: https://issues.apache.org/jira/browse/SPARK-39757
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.4.0
>
>
> release notes: [https://github.com/sbt/sbt/releases]
> https://github.com/sbt/sbt/compare/v1.7.0...v1.7.1
> sbt 1.7.1 Bug fix
>  * Fixes Java incremental compilation, specifically parsing of annotations in 
> class files of [@SethTisue|https://github.com/SethTisue] in 
> [sbt/zinc#|https://github.com/sbt/zinc/pull/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39757) Upgrade sbt from 1.7.0 to 1.7.1

2022-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39757.
--
Resolution: Fixed

Issue resolved by pull request 37170
[https://github.com/apache/spark/pull/37170]

> Upgrade sbt from 1.7.0 to 1.7.1
> ---
>
> Key: SPARK-39757
> URL: https://issues.apache.org/jira/browse/SPARK-39757
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.4.0
>
>
> release notes: [https://github.com/sbt/sbt/releases]
> https://github.com/sbt/sbt/compare/v1.7.0...v1.7.1
> sbt 1.7.1 Bug fix
>  * Fixes Java incremental compilation, specifically parsing of annotations in 
> class files of [@SethTisue|https://github.com/SethTisue] in 
> [sbt/zinc#|https://github.com/sbt/zinc/pull/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39762) Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)

2022-07-13 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-39762:
---

 Summary: Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)
 Key: SPARK-39762
 URL: https://issues.apache.org/jira/browse/SPARK-39762
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Yikun Jiang


After SPARK-39611 SPARK-39714



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39762) Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39762:


Assignee: (was: Apache Spark)

> Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)
> 
>
> Key: SPARK-39762
> URL: https://issues.apache.org/jira/browse/SPARK-39762
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> After SPARK-39611 SPARK-39714



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39762) Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39762:


Assignee: Apache Spark

> Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)
> 
>
> Key: SPARK-39762
> URL: https://issues.apache.org/jira/browse/SPARK-39762
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-39611 SPARK-39714



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39762) Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566270#comment-17566270
 ] 

Apache Spark commented on SPARK-39762:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37175

> Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)
> 
>
> Key: SPARK-39762
> URL: https://issues.apache.org/jira/browse/SPARK-39762
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> After SPARK-39611 SPARK-39714



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39753) Broadcast joins should pushdown join constraints as Filter to the larger relation

2022-07-13 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566271#comment-17566271
 ] 

Victor Delépine commented on SPARK-39753:
-

[~yumwang] I'm not too familiar with build side and Spark SQL internals, 
unfortunately :). Do you think this change is a good idea? 

Let me know if there's anything I can do to help move this forward!

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-39753
> URL: https://issues.apache.org/jira/browse/SPARK-39753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Victor Delépine
>Priority: Major
>
> SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to 
> re-open it here for more visibility, since I believe this bug has a major 
> impact and that fixing it could drastically improve the performance of many 
> pipelines.
> Allow me to paste the initial description again here:
> _For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} 
> clause. An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_}
> _This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results._
>  
> Essentially, when doing a Broadcast join, the smaller side can be used to 
> filter down the bigger side before performing the join. As of today, the join 
> will read all partitions of the bigger side, without pruning partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39762) Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)

2022-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39762:


Assignee: Yikun Jiang

> Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)
> 
>
> Key: SPARK-39762
> URL: https://issues.apache.org/jira/browse/SPARK-39762
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> After SPARK-39611 SPARK-39714



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39762) Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)

2022-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39762.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37175
[https://github.com/apache/spark/pull/37175]

> Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)
> 
>
> Key: SPARK-39762
> URL: https://issues.apache.org/jira/browse/SPARK-39762
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> After SPARK-39611 SPARK-39714



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39763) Executor memory footprint substantially increases while reading zstd compressed parquet files

2022-07-13 Thread Yeachan Park (Jira)

Yeachan Park created SPARK-39763:


 Summary: Executor memory footprint substantially increases while 
reading zstd compressed parquet files
 Key: SPARK-39763
 URL: https://issues.apache.org/jira/browse/SPARK-39763
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Yeachan Park


Hi all,

 

While transitioning from the default snappy compression to zstd, we noticed a 
substantial increase in executor memory whilst reading and processing zstd 
compressed parquet files.

Memory footprint increased increased nearly 3 fold in some cases.

Reading and processing files in snappy and writing to zstd did not result in 
this behaviour.

To reproduce:
 # Set "spark.sql.parquet.compression.codec" to zstd
 # Write some parquet files, the compression will default to zstd after setting 
the option above
 # Read the compressed zstd file and run some transformations. Compare the 
memory usage of the executor vs running the same transformation on a parquet 
file with snappy compression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39763) Executor memory footprint substantially increases while reading zstd compressed parquet files

2022-07-13 Thread Yeachan Park (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeachan Park updated SPARK-39763:
-
Description: 
Hi all,

 

While transitioning from the default snappy compression to zstd, we noticed a 
substantial increase in executor memory whilst *reading* and applying 
transformations on *zstd* compressed parquet files.

Memory footprint increased increased 3 fold in some cases, compared to reading 
and applying the same transformations on a parquet file compressed with snappy.

This behaviour only occurs when reading zstd compressed parquet files. Writing 
a zstd parquet file does not result in this behaviour.

To reproduce:
 # Set "spark.sql.parquet.compression.codec" to zstd
 # Write some parquet files, the compression will default to zstd after setting 
the option above
 # Read the compressed zstd file and run some transformations. Compare the 
memory usage of the executor vs running the same transformation on a parquet 
file with snappy compression.

  was:
Hi all,

 

While transitioning from the default snappy compression to zstd, we noticed a 
substantial increase in executor memory whilst *reading* and applying 
transformations on *zstd* compressed parquet files.

Memory footprint increased increased nearly 3 fold in some cases, compared to 
reading and applying the same transformations on a parquet file compressed with 
snappy.

This behaviour only occurs when reading zstd compressed parquet files. Writing 
a zstd parquet file does not result in this behaviour.

To reproduce:
 # Set "spark.sql.parquet.compression.codec" to zstd
 # Write some parquet files, the compression will default to zstd after setting 
the option above
 # Read the compressed zstd file and run some transformations. Compare the 
memory usage of the executor vs running the same transformation on a parquet 
file with snappy compression.


> Executor memory footprint substantially increases while reading zstd 
> compressed parquet files
> -
>
> Key: SPARK-39763
> URL: https://issues.apache.org/jira/browse/SPARK-39763
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> Hi all,
>  
> While transitioning from the default snappy compression to zstd, we noticed a 
> substantial increase in executor memory whilst *reading* and applying 
> transformations on *zstd* compressed parquet files.
> Memory footprint increased increased 3 fold in some cases, compared to 
> reading and applying the same transformations on a parquet file compressed 
> with snappy.
> This behaviour only occurs when reading zstd compressed parquet files. 
> Writing a zstd parquet file does not result in this behaviour.
> To reproduce:
>  # Set "spark.sql.parquet.compression.codec" to zstd
>  # Write some parquet files, the compression will default to zstd after 
> setting the option above
>  # Read the compressed zstd file and run some transformations. Compare the 
> memory usage of the executor vs running the same transformation on a parquet 
> file with snappy compression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39763) Executor memory footprint substantially increases while reading zstd compressed parquet files

2022-07-13 Thread Yeachan Park (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeachan Park updated SPARK-39763:
-
Description: 
Hi all,

 

While transitioning from the default snappy compression to zstd, we noticed a 
substantial increase in executor memory whilst *reading* and applying 
transformations on *zstd* compressed parquet files.

Memory footprint increased increased nearly 3 fold in some cases, compared to 
reading and applying the same transformations on a parquet file compressed with 
snappy.

This behaviour only occurs when reading zstd compressed parquet files. Writing 
a zstd parquet file does not result in this behaviour.

To reproduce:
 # Set "spark.sql.parquet.compression.codec" to zstd
 # Write some parquet files, the compression will default to zstd after setting 
the option above
 # Read the compressed zstd file and run some transformations. Compare the 
memory usage of the executor vs running the same transformation on a parquet 
file with snappy compression.

  was:
Hi all,

 

While transitioning from the default snappy compression to zstd, we noticed a 
substantial increase in executor memory whilst reading and processing zstd 
compressed parquet files.

Memory footprint increased increased nearly 3 fold in some cases.

Reading and processing files in snappy and writing to zstd did not result in 
this behaviour.

To reproduce:
 # Set "spark.sql.parquet.compression.codec" to zstd
 # Write some parquet files, the compression will default to zstd after setting 
the option above
 # Read the compressed zstd file and run some transformations. Compare the 
memory usage of the executor vs running the same transformation on a parquet 
file with snappy compression.


> Executor memory footprint substantially increases while reading zstd 
> compressed parquet files
> -
>
> Key: SPARK-39763
> URL: https://issues.apache.org/jira/browse/SPARK-39763
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> Hi all,
>  
> While transitioning from the default snappy compression to zstd, we noticed a 
> substantial increase in executor memory whilst *reading* and applying 
> transformations on *zstd* compressed parquet files.
> Memory footprint increased increased nearly 3 fold in some cases, compared to 
> reading and applying the same transformations on a parquet file compressed 
> with snappy.
> This behaviour only occurs when reading zstd compressed parquet files. 
> Writing a zstd parquet file does not result in this behaviour.
> To reproduce:
>  # Set "spark.sql.parquet.compression.codec" to zstd
>  # Write some parquet files, the compression will default to zstd after 
> setting the option above
>  # Read the compressed zstd file and run some transformations. Compare the 
> memory usage of the executor vs running the same transformation on a parquet 
> file with snappy compression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39764) Make PhysicalOperation the same as ScanOperation

2022-07-13 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-39764:
---

 Summary: Make PhysicalOperation the same as ScanOperation
 Key: SPARK-39764
 URL: https://issues.apache.org/jira/browse/SPARK-39764
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39764) Make PhysicalOperation the same as ScanOperation

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39764:


Assignee: (was: Apache Spark)

> Make PhysicalOperation the same as ScanOperation
> 
>
> Key: SPARK-39764
> URL: https://issues.apache.org/jira/browse/SPARK-39764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39764) Make PhysicalOperation the same as ScanOperation

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39764:


Assignee: Apache Spark

> Make PhysicalOperation the same as ScanOperation
> 
>
> Key: SPARK-39764
> URL: https://issues.apache.org/jira/browse/SPARK-39764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39764) Make PhysicalOperation the same as ScanOperation

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566302#comment-17566302
 ] 

Apache Spark commented on SPARK-39764:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37176

> Make PhysicalOperation the same as ScanOperation
> 
>
> Key: SPARK-39764
> URL: https://issues.apache.org/jira/browse/SPARK-39764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39758) NPE on invalid patterns from the regexp functions

2022-07-13 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39758.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37171
[https://github.com/apache/spark/pull/37171]

> NPE on invalid patterns from the regexp functions
> -
>
> Key: SPARK-39758
> URL: https://issues.apache.org/jira/browse/SPARK-39758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> The example below reproduces the issue:
> {code:sql}
> spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)');
> 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 
> 2b 14m', '(?l)')]
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768)
>  ~[spark-catalyst_2.12-3.3.0.jar:3.3.0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39765) Logging the exception of detect jdbc table exist

2022-07-13 Thread Yi kaifei (Jira)

Yi kaifei created SPARK-39765:
-

 Summary: Logging the exception of detect jdbc table exist
 Key: SPARK-39765
 URL: https://issues.apache.org/jira/browse/SPARK-39765
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yi kaifei


This may be a known `TableAlreadyExist` exception, which is normal and proves 
that the table does not exist, but perhaps some other exception, which in fact 
does exist, will cause an error when renewing the table later. We need to know 
about the exception stack



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39765) Logging the exception of detect jdbc table exist

2022-07-13 Thread Yi kaifei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi kaifei updated SPARK-39765:
--
Description: Logging the exception of detect jdbc table exist in 
`JdbcUtils.tableExists`  (was: This may be a known `TableAlreadyExist` 
exception, which is normal and proves that the table does not exist, but 
perhaps some other exception, which in fact does exist, will cause an error 
when renewing the table later. We need to know about the exception stack)

> Logging the exception of detect jdbc table exist
> 
>
> Key: SPARK-39765
> URL: https://issues.apache.org/jira/browse/SPARK-39765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yi kaifei
>Priority: Trivial
>
> Logging the exception of detect jdbc table exist in `JdbcUtils.tableExists`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Yang Jie (Jira)

Yang Jie created SPARK-39766:


 Summary: For the `arrayOfAnyAsSeq` scenario in 
`GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12
 Key: SPARK-39766
 URL: https://issues.apache.org/jira/browse/SPARK-39766
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yang Jie


Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
`arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is 
slower than Scala 2.12:

 

Scala 2.12

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
constructor:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq  25 29  
 2395.1   2.5   0.1X{code}
 

Scala 2.13

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
constructor:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq 241243  
 1 41.4  24.1   0.0X {code}
 

 

the test code as follows:

 
{code:java}
benchmark.addCase("arrayOfAnyAsSeq") { _ =>
  val arr: Seq[Any] = new Array[Any](arraySize)
  var n = 0
  while (n < valuesPerIteration) {
new GenericArrayData(arr)
n += 1
  }
} {code}
 

 

the constructor of GenericArrayData as follows:

 

 
{code:java}
def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
 

The performance difference is due to the following reasons：

 

When using Scala 2.12:

The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
`toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.

 

When using Scala 2.13:

 

The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will call 
`IterableOnceOps#toArray`, the corresponding implementation uses memory copy.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39766:
-
Description: 
Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
`arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is 
slower than Scala 2.12:

*Scala 2.12*
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
constructor:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq  25 29  
 2395.1   2.5   0.1X{code}
*Scala 2.13*
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
constructor:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq 241243  
 1 41.4  24.1   0.0X {code}
the test code as follows:
{code:java}
benchmark.addCase("arrayOfAnyAsSeq") { _ =>
  val arr: Seq[Any] = new Array[Any](arraySize)
  var n = 0
  while (n < valuesPerIteration) {
new GenericArrayData(arr)
n += 1
  }
} {code}
the constructor of GenericArrayData as follows:
{code:java}
def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
 

The performance difference is due to the following reasons：

*When using Scala 2.12:*

The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
`toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.

*When using Scala 2.13:*

The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will call 
`IterableOnceOps#toArray`, the corresponding implementation uses memory copy.

  was:
Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
`arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is 
slower than Scala 2.12:

 

Scala 2.12

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
constructor:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq  25 29  
 2395.1   2.5   0.1X{code}
 

Scala 2.13

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
constructor:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq 241243  
 1 41.4  24.1   0.0X {code}
 

 

the test code as follows:

 
{code:java}
benchmark.addCase("arrayOfAnyAsSeq") { _ =>
  val arr: Seq[Any] = new Array[Any](arraySize)
  var n = 0
  while (n < valuesPerIteration) {
new GenericArrayData(arr)
n += 1
  }
} {code}
 

 

the constructor of GenericArrayData as follows:

 

 
{code:java}
def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
 

The performance difference is due to the following reasons：

 

When using Scala 2.12:

The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
`toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.

 

When using Scala 2.13:

 

The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will call 
`IterableOnceOps#toArray`, the corresponding implementation uses memory copy.

 

 

 


> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure

[jira] [Commented] (SPARK-39765) Logging the exception of detect jdbc table exist

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566340#comment-17566340
 ] 

Apache Spark commented on SPARK-39765:
--

User 'yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/37177

> Logging the exception of detect jdbc table exist
> 
>
> Key: SPARK-39765
> URL: https://issues.apache.org/jira/browse/SPARK-39765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yi kaifei
>Priority: Trivial
>
> Logging the exception of detect jdbc table exist in `JdbcUtils.tableExists`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39765) Logging the exception of detect jdbc table exist

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39765:


Assignee: (was: Apache Spark)

> Logging the exception of detect jdbc table exist
> 
>
> Key: SPARK-39765
> URL: https://issues.apache.org/jira/browse/SPARK-39765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yi kaifei
>Priority: Trivial
>
> Logging the exception of detect jdbc table exist in `JdbcUtils.tableExists`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39765) Logging the exception of detect jdbc table exist

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39765:


Assignee: Apache Spark

> Logging the exception of detect jdbc table exist
> 
>
> Key: SPARK-39765
> URL: https://issues.apache.org/jira/browse/SPARK-39765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yi kaifei
>Assignee: Apache Spark
>Priority: Trivial
>
> Logging the exception of detect jdbc table exist in `JdbcUtils.tableExists`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39765) Logging the exception of detect jdbc table exist

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566342#comment-17566342
 ] 

Apache Spark commented on SPARK-39765:
--

User 'yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/37177

> Logging the exception of detect jdbc table exist
> 
>
> Key: SPARK-39765
> URL: https://issues.apache.org/jira/browse/SPARK-39765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yi kaifei
>Priority: Trivial
>
> Logging the exception of detect jdbc table exist in `JdbcUtils.tableExists`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566355#comment-17566355
 ] 

Yang Jie edited comment on SPARK-39766 at 7/13/22 2:44 PM:
---

Should we add a new  GenericArrayData for Scala 2.13? Then we can rewrite `def 
this(seq: scala.collection.Seq[Any])` to To optimize this scenario, for example:
{code:java}
def this(seq: scala.collection.Seq[Any]) = this(seq match {
  case ias: scala.collection.immutable.ArraySeq.ofRef[_] =>
ias.unsafeArray.asInstanceOf[Array[Any]]
  case mas: scala.collection.mutable.ArraySeq.ofRef[_] =>
mas.array.asInstanceOf[Array[Any]]
  case _ => seq.toArray
}) {code}
with above change, the results of  `arrayOfAnyAsSeq` scenario as follows:
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
constructor:                              Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq                                       4              4          
 0       2491.6           0.4       1.0X
 {code}
*The rate from `41.4M/s` to `2491.6M/s`*

 


was (Author: luciferyang):
Should we add a new  GenericArrayData for Scala 2.13? Then we can rewrite `def 
this(seq: scala.collection.Seq[Any])` to To optimize this scenario, for example:

 
{code:java}
def this(seq: scala.collection.Seq[Any]) = this(seq match {
  case ias: scala.collection.immutable.ArraySeq.ofRef[_] =>
ias.unsafeArray.asInstanceOf[Array[Any]]
  case mas: scala.collection.mutable.ArraySeq.ofRef[_] =>
mas.array.asInstanceOf[Array[Any]]
  case _ => seq.toArray
}) {code}
 

 

with above change, the results of  `arrayOfAnyAsSeq` scenario as follows:

 

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
constructor:                              Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq                                       4              4          
 0       2491.6           0.4       1.0X
 {code}
 

 

Rate from `41.4M/s` to `2491.6M/s`

 

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> ca

[jira] [Commented] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566355#comment-17566355
 ] 

Yang Jie commented on SPARK-39766:
--

Should we add a new  GenericArrayData for Scala 2.13? Then we can rewrite `def 
this(seq: scala.collection.Seq[Any])` to To optimize this scenario, for example:

 
{code:java}
def this(seq: scala.collection.Seq[Any]) = this(seq match {
  case ias: scala.collection.immutable.ArraySeq.ofRef[_] =>
ias.unsafeArray.asInstanceOf[Array[Any]]
  case mas: scala.collection.mutable.ArraySeq.ofRef[_] =>
mas.array.asInstanceOf[Array[Any]]
  case _ => seq.toArray
}) {code}
 

 

with above change, the results of  `arrayOfAnyAsSeq` scenario as follows:

 

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
constructor:                              Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

arrayOfAnyAsSeq                                       4              4          
 0       2491.6           0.4       1.0X
 {code}
 

 

Rate from `41.4M/s` to `2491.6M/s`

 

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39751) Better naming for hash aggregate key probing metric

2022-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39751:
---

Assignee: Cheng Su

> Better naming for hash aggregate key probing metric
> ---
>
> Key: SPARK-39751
> URL: https://issues.apache.org/jira/browse/SPARK-39751
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
>
> Hash aggregate has a SQL metric to record average probes per key, but it has 
> a very obsure name called "avg hash probe bucket list iters". We should give 
> it a better name to avoid confusing users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39751) Better naming for hash aggregate key probing metric

2022-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39751.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37164
[https://github.com/apache/spark/pull/37164]

> Better naming for hash aggregate key probing metric
> ---
>
> Key: SPARK-39751
> URL: https://issues.apache.org/jira/browse/SPARK-39751
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Hash aggregate has a SQL metric to record average probes per key, but it has 
> a very obsure name called "avg hash probe bucket list iters". We should give 
> it a better name to avoid confusing users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566360#comment-17566360
 ] 

Yang Jie commented on SPARK-39766:
--

Or should we rewrite `toArray` method in Scala 2.13.x? [~srowen] 

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-39766:
-
Priority: Minor  (was: Major)

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566364#comment-17566364
 ] 

Sean R. Owen commented on SPARK-39766:
--

How would you rewrite Scala's toArray? you mean write a toArray method in Spark 
that's different for 2.13? either approach is fine, whatever needs the least 
code. Yes let's fix this.

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566367#comment-17566367
 ] 

Yang Jie commented on SPARK-39766:
--

Ok, let me try to fix this. 

I'm also going to submit an issue to the Scala community to discuss the 
difference of `toArray` method between 2.12 and 2.13

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39767) Remove UnresolvedDBObjectName and add UnresolvedIdentifier

2022-07-13 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-39767:
---

 Summary: Remove UnresolvedDBObjectName and add UnresolvedIdentifier
 Key: SPARK-39767
 URL: https://issues.apache.org/jira/browse/SPARK-39767
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39767) Remove UnresolvedDBObjectName and add UnresolvedIdentifier

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39767:


Assignee: Apache Spark

> Remove UnresolvedDBObjectName and add UnresolvedIdentifier
> --
>
> Key: SPARK-39767
> URL: https://issues.apache.org/jira/browse/SPARK-39767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39767) Remove UnresolvedDBObjectName and add UnresolvedIdentifier

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566396#comment-17566396
 ] 

Apache Spark commented on SPARK-39767:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37178

> Remove UnresolvedDBObjectName and add UnresolvedIdentifier
> --
>
> Key: SPARK-39767
> URL: https://issues.apache.org/jira/browse/SPARK-39767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39767) Remove UnresolvedDBObjectName and add UnresolvedIdentifier

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39767:


Assignee: (was: Apache Spark)

> Remove UnresolvedDBObjectName and add UnresolvedIdentifier
> --
>
> Key: SPARK-39767
> URL: https://issues.apache.org/jira/browse/SPARK-39767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39768) Strip any CLRF character if lineSep is not set in CSV data source

2022-07-13 Thread Yaohua Zhao (Jira)

Yaohua Zhao created SPARK-39768:
---

 Summary: Strip any CLRF character if lineSep is not set in CSV 
data source
 Key: SPARK-39768
 URL: https://issues.apache.org/jira/browse/SPARK-39768
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yaohua Zhao


If `lineSep` is not set, the line separator is automatically detected. To be 
safe, we should strip any CLRF character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39768) Strip any CLRF character if lineSep is not set in CSV data source

2022-07-13 Thread Yaohua Zhao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566417#comment-17566417
 ] 

Yaohua Zhao commented on SPARK-39768:
-

cc @[~hyukjin.kwon] 

> Strip any CLRF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any CLRF character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-07-13 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566430#comment-17566430
 ] 

pralabhkumar commented on SPARK-39755:
--

[~hyukjin.kwon]  , please let me know if the above suggestion is correct(we are 
facing simillar issue of what mention in Spark-24992) when running Spark on K8s 
. I'll implement the same 

> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source

2022-07-13 Thread Yaohua Zhao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-39768:

Description: If `lineSep` is not set, the line separator is automatically 
detected. To be safe, we should strip any _CRLF_ character at the suffix in the 
column names.  (was: If `lineSep` is not set, the line separator is 
automatically detected. To be safe, we should strip any CLRF character at the 
suffix in the column names.)

> Strip any CRLF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any _CRLF_ character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source

2022-07-13 Thread Yaohua Zhao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-39768:

Summary: Strip any CRLF character if lineSep is not set in CSV data source  
(was: Strip any CLRF character if lineSep is not set in CSV data source)

> Strip any CRLF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any CLRF character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39769) Rename trait Unevaluable

2022-07-13 Thread Ted Yu (Jira)

Ted Yu created SPARK-39769:
--

 Summary: Rename trait Unevaluable
 Key: SPARK-39769
 URL: https://issues.apache.org/jira/browse/SPARK-39769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Ted Yu


I came upon `trait Unevaluable` which is defined in 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

Unevaluable is not a word.

There are `valuable`, `invaluable` but I have never seen Unevaluable.

This issue renames the trait to Unevaluatable



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39768:


Assignee: (was: Apache Spark)

> Strip any CRLF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any _CRLF_ character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39768:


Assignee: Apache Spark

> Strip any CRLF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Assignee: Apache Spark
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any _CRLF_ character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566451#comment-17566451
 ] 

Apache Spark commented on SPARK-39768:
--

User 'Yaohua628' has created a pull request for this issue:
https://github.com/apache/spark/pull/37180

> Strip any CRLF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any _CRLF_ character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566452#comment-17566452
 ] 

Apache Spark commented on SPARK-39768:
--

User 'Yaohua628' has created a pull request for this issue:
https://github.com/apache/spark/pull/37180

> Strip any CRLF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any _CRLF_ character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39758) NPE on invalid patterns from the regexp functions

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566482#comment-17566482
 ] 

Apache Spark commented on SPARK-39758:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37181

> NPE on invalid patterns from the regexp functions
> -
>
> Key: SPARK-39758
> URL: https://issues.apache.org/jira/browse/SPARK-39758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> The example below reproduces the issue:
> {code:sql}
> spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)');
> 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 
> 2b 14m', '(?l)')]
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768)
>  ~[spark-catalyst_2.12-3.3.0.jar:3.3.0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39758) NPE on invalid patterns from the regexp functions

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566483#comment-17566483
 ] 

Apache Spark commented on SPARK-39758:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37181

> NPE on invalid patterns from the regexp functions
> -
>
> Key: SPARK-39758
> URL: https://issues.apache.org/jira/browse/SPARK-39758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> The example below reproduces the issue:
> {code:sql}
> spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)');
> 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 
> 2b 14m', '(?l)')]
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768)
>  ~[spark-catalyst_2.12-3.3.0.jar:3.3.0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39758) NPE on invalid patterns from the regexp functions

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566496#comment-17566496
 ] 

Apache Spark commented on SPARK-39758:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37182

> NPE on invalid patterns from the regexp functions
> -
>
> Key: SPARK-39758
> URL: https://issues.apache.org/jira/browse/SPARK-39758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> The example below reproduces the issue:
> {code:sql}
> spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)');
> 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 
> 2b 14m', '(?l)')]
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768)
>  ~[spark-catalyst_2.12-3.3.0.jar:3.3.0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39758) NPE on invalid patterns from the regexp functions

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566497#comment-17566497
 ] 

Apache Spark commented on SPARK-39758:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37182

> NPE on invalid patterns from the regexp functions
> -
>
> Key: SPARK-39758
> URL: https://issues.apache.org/jira/browse/SPARK-39758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> The example below reproduces the issue:
> {code:sql}
> spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)');
> 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 
> 2b 14m', '(?l)')]
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768)
>  ~[spark-catalyst_2.12-3.3.0.jar:3.3.0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39770) Support Avro schema evolution

2022-07-13 Thread koert kuipers (Jira)

koert kuipers created SPARK-39770:
-

 Summary: Support Avro schema evolution
 Key: SPARK-39770
 URL: https://issues.apache.org/jira/browse/SPARK-39770
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: koert kuipers


currently the avro source in connector/avro does not yet support schema 
evolution.
from source code of AvroUtils:
{code:java}
// Schema evolution is not supported yet. Here we only pick first random 
readable sample file to
// figure out the schema of the whole dataset.
 {code}
i added schema evolution for our inhouse spark version. if there is interest in 
this i could contribute it.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39770) Support Avro schema evolution

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566502#comment-17566502
 ] 

Apache Spark commented on SPARK-39770:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/37183

> Support Avro schema evolution
> -
>
> Key: SPARK-39770
> URL: https://issues.apache.org/jira/browse/SPARK-39770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: koert kuipers
>Priority: Minor
>  Labels: avro
>
> currently the avro source in connector/avro does not yet support schema 
> evolution.
> from source code of AvroUtils:
> {code:java}
> // Schema evolution is not supported yet. Here we only pick first random 
> readable sample file to
> // figure out the schema of the whole dataset.
>  {code}
> i added schema evolution for our inhouse spark version. if there is interest 
> in this i could contribute it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39770) Support Avro schema evolution

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39770:


Assignee: (was: Apache Spark)

> Support Avro schema evolution
> -
>
> Key: SPARK-39770
> URL: https://issues.apache.org/jira/browse/SPARK-39770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: koert kuipers
>Priority: Minor
>  Labels: avro
>
> currently the avro source in connector/avro does not yet support schema 
> evolution.
> from source code of AvroUtils:
> {code:java}
> // Schema evolution is not supported yet. Here we only pick first random 
> readable sample file to
> // figure out the schema of the whole dataset.
>  {code}
> i added schema evolution for our inhouse spark version. if there is interest 
> in this i could contribute it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39770) Support Avro schema evolution

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39770:


Assignee: Apache Spark

> Support Avro schema evolution
> -
>
> Key: SPARK-39770
> URL: https://issues.apache.org/jira/browse/SPARK-39770
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Minor
>  Labels: avro
>
> currently the avro source in connector/avro does not yet support schema 
> evolution.
> from source code of AvroUtils:
> {code:java}
> // Schema evolution is not supported yet. Here we only pick first random 
> readable sample file to
> // figure out the schema of the whole dataset.
>  {code}
> i added schema evolution for our inhouse spark version. if there is interest 
> in this i could contribute it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39747) pandas and pandas on Spark API parameter naming difference

2022-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566562#comment-17566562
 ] 

Hyukjin Kwon commented on SPARK-39747:
--

the reason is that we don't support buffer for now (since that strictly assems 
that the data is in local). But probably we should fallback to pandas for the 
time being.

> pandas and pandas on Spark API parameter naming difference 
> ---
>
> Key: SPARK-39747
> URL: https://issues.apache.org/jira/browse/SPARK-39747
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.3.0
>Reporter: Chenyang Zhang
>Priority: Major
>
> I noticed there are some parameter naming differences between pandas and 
> pandas on Spark. For example, in "read_csv", the path parameter is 
> "filepath_or_buffer" for pandas and "path" for pandas on Spark. I wonder why 
> such a difference exists and may I ask to change it to match exactly the same 
> in pandas. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39771) If spark.default.parallelism is unset, RDD defaultPartitioner may pick a value that is too large to successfully run

2022-07-13 Thread Josh Rosen (Jira)

Josh Rosen created SPARK-39771:
--

 Summary: If spark.default.parallelism is unset, RDD 
defaultPartitioner may pick a value that is too large to successfully run
 Key: SPARK-39771
 URL: https://issues.apache.org/jira/browse/SPARK-39771
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Josh Rosen


[According to its 
docs|https://github.com/apache/spark/blob/899f6c90eb2de5b46a36710a131d7417010ce4b3/core/src/main/scala/org/apache/spark/Partitioner.scala#L45-L65],
 {{Partitioner.defaultPartitioner}} will use the maximum number of RDD 
partitions as its partition count when {{spark.default.parallelism}} is not 
set. If that number of upstream partitions is very large then this can result 
in shuffles where {{{}numMappers * numReducers = numMappers^2{}}}, which can 
cause various problems that prevent the job from successfully running.

To help users identify when they have run into this problem, I think we should 
add warning logs to Spark.

As an example of the problem, let's say that I have an RDD with 100,000 
partitions and then do a {{reduceByKey}} on it without specifying an explicit 
partitioner or partition count. In this case, Spark will plan a reduce stage 
with 100,000 partitions:
{code:java}
scala>  sc.parallelize(1 to 10, 10).map(x => (x, x)).reduceByKey(_ + 
_).toDebugString
res7: String =
(10) ShuffledRDD[21] at reduceByKey at :25 []
   +-(10) MapPartitionsRDD[20] at map at :25 []
| ParallelCollectionRDD[19] at parallelize at :25 []
{code}
This results in the creation of 10 billion shuffle blocks, so if this job 
_does_ run it is likely to be extremely show. However, it's more likely that 
the driver will crash when serializing map output statuses: if we were able to 
use one bit per mapper / reducer pair (which is probably overly optimistic in 
terms of compressibility) then the map statuses would be ~1.25 gigabytes!

I don't think that users are likely to intentionally wind up in this scenario: 
it's more likely that either (a) their job depends on 
{{spark.default.parallelism}} being set but it was run on an environment 
lacking a value for that config, or (b) their input data significantly grew in 
size. These scenarios may be rare, but they can be frustrating to debug 
(especially if a failure occurs midway through a long-running job).

I think we should do something to handle this scenario.

A good starting point might be for {{Partitioner.defaultPartitioner}} to log a 
warning when the default partition size exceeds some threshold.

In addition, I think it might be a good idea to log a similar warning in 
{{MapOutputTrackerMaster}} right before we start trying to serialize map 
statuses: in a real-world situation where this problem cropped up, the map 
stage ran successfully but the driver crashed when serializing map statuses. 
Putting a warning about partition counts here makes it more likely that users 
will spot that error in the logs and be able to identify the source of the 
problem (compared to a warning that appears much earlier in the job and 
therefore much farther from the likely site of a crash).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39771) If spark.default.parallelism is unset, RDD defaultPartitioner may pick a value that is too large to successfully run

2022-07-13 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-39771:
---
Description: 
[According to its 
docs|https://github.com/apache/spark/blob/899f6c90eb2de5b46a36710a131d7417010ce4b3/core/src/main/scala/org/apache/spark/Partitioner.scala#L45-L65],
 {{Partitioner.defaultPartitioner}} will use the maximum number of RDD 
partitions as its partition count when {{spark.default.parallelism}} is not 
set. If that number of upstream partitions is very large then this can result 
in shuffles where {{{}numMappers * numReducers = numMappers^2{}}}, which can 
cause various problems that prevent the job from successfully running.

To help users identify when they have run into this problem, I think we should 
add warning logs to Spark.

As an example of the problem, let's say that I have an RDD with 100,000 
partitions and then do a {{reduceByKey}} on it without specifying an explicit 
partitioner or partition count. In this case, Spark will plan a reduce stage 
with 100,000 partitions:
{code:java}
scala>  sc.parallelize(1 to 10, 10).map(x => (x, x)).reduceByKey(_ + 
_).toDebugString
res7: String =
(10) ShuffledRDD[21] at reduceByKey at :25 []
   +-(10) MapPartitionsRDD[20] at map at :25 []
| ParallelCollectionRDD[19] at parallelize at :25 []
{code}
This results in the creation of 10 billion shuffle blocks, so if this job 
_does_ run it is likely to be extremely show. However, it's more likely that 
the driver will crash when serializing map output statuses: if we were able to 
use one bit per mapper / reducer pair (which is probably overly optimistic in 
terms of compressibility) then the map statuses would be ~1.25 gigabytes (and 
the actual size is probably much larger)!

I don't think that users are likely to intentionally wind up in this scenario: 
it's more likely that either (a) their job depends on 
{{spark.default.parallelism}} being set but it was run on an environment 
lacking a value for that config, or (b) their input data significantly grew in 
size. These scenarios may be rare, but they can be frustrating to debug 
(especially if a failure occurs midway through a long-running job).

I think we should do something to handle this scenario.

A good starting point might be for {{Partitioner.defaultPartitioner}} to log a 
warning when the default partition size exceeds some threshold.

In addition, I think it might be a good idea to log a similar warning in 
{{MapOutputTrackerMaster}} right before we start trying to serialize map 
statuses: in a real-world situation where this problem cropped up, the map 
stage ran successfully but the driver crashed when serializing map statuses. 
Putting a warning about partition counts here makes it more likely that users 
will spot that error in the logs and be able to identify the source of the 
problem (compared to a warning that appears much earlier in the job and 
therefore much farther from the likely site of a crash).

  was:
[According to its 
docs|https://github.com/apache/spark/blob/899f6c90eb2de5b46a36710a131d7417010ce4b3/core/src/main/scala/org/apache/spark/Partitioner.scala#L45-L65],
 {{Partitioner.defaultPartitioner}} will use the maximum number of RDD 
partitions as its partition count when {{spark.default.parallelism}} is not 
set. If that number of upstream partitions is very large then this can result 
in shuffles where {{{}numMappers * numReducers = numMappers^2{}}}, which can 
cause various problems that prevent the job from successfully running.

To help users identify when they have run into this problem, I think we should 
add warning logs to Spark.

As an example of the problem, let's say that I have an RDD with 100,000 
partitions and then do a {{reduceByKey}} on it without specifying an explicit 
partitioner or partition count. In this case, Spark will plan a reduce stage 
with 100,000 partitions:
{code:java}
scala>  sc.parallelize(1 to 10, 10).map(x => (x, x)).reduceByKey(_ + 
_).toDebugString
res7: String =
(10) ShuffledRDD[21] at reduceByKey at :25 []
   +-(10) MapPartitionsRDD[20] at map at :25 []
| ParallelCollectionRDD[19] at parallelize at :25 []
{code}
This results in the creation of 10 billion shuffle blocks, so if this job 
_does_ run it is likely to be extremely show. However, it's more likely that 
the driver will crash when serializing map output statuses: if we were able to 
use one bit per mapper / reducer pair (which is probably overly optimistic in 
terms of compressibility) then the map statuses would be ~1.25 gigabytes!

I don't think that users are likely to intentionally wind up in this scenario: 
it's more likely that either (a) their job depends on 
{{spark.default.parallelism}} being set but it was run on an environment 
lacking a value for that config, or (b) their input data significantly grew in 
size. These scenarios may be rare, but th

[jira] [Created] (SPARK-39772) namespace should not conatin null to avoid potential serde problem

2022-07-13 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-39772:
-

 Summary: namespace should not conatin null to avoid potential 
serde problem
 Key: SPARK-39772
 URL: https://issues.apache.org/jira/browse/SPARK-39772
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39772) namespace should not conatin null to avoid potential serde problem

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566570#comment-17566570
 ] 

Apache Spark commented on SPARK-39772:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37184

> namespace should not conatin null to avoid potential serde problem
> --
>
> Key: SPARK-39772
> URL: https://issues.apache.org/jira/browse/SPARK-39772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39772) namespace should not conatin null to avoid potential serde problem

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39772:


Assignee: Apache Spark

> namespace should not conatin null to avoid potential serde problem
> --
>
> Key: SPARK-39772
> URL: https://issues.apache.org/jira/browse/SPARK-39772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39772) namespace should not conatin null to avoid potential serde problem

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39772:


Assignee: (was: Apache Spark)

> namespace should not conatin null to avoid potential serde problem
> --
>
> Key: SPARK-39772
> URL: https://issues.apache.org/jira/browse/SPARK-39772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39773) Update document of JDBC options for pushDownOffset

2022-07-13 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-39773:
--

 Summary: Update document of JDBC options for pushDownOffset
 Key: SPARK-39773
 URL: https://issues.apache.org/jira/browse/SPARK-39773
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Because the DS v2 pushdown framework added new JDBC option pushDownOffset for 
offset pushdown, we should update sql-data-sources-jdbc.md.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39766:


Assignee: (was: Apache Spark)

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566591#comment-17566591
 ] 

Apache Spark commented on SPARK-39766:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37185

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566592#comment-17566592
 ] 

Apache Spark commented on SPARK-39766:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37185

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39766) For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 is slower than Scala 2.12

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39766:


Assignee: Apache Spark

> For the `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using 
> Scala 2.13 is slower than Scala 2.12
> -
>
> Key: SPARK-39766
> URL: https://issues.apache.org/jira/browse/SPARK-39766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Run `GenericArrayDataBenchmark` with Scala 2.13 and 2.12, for the 
> `arrayOfAnyAsSeq` scenario in `GenericArrayDataBenchmark`, using Scala 2.13 
> is slower than Scala 2.12:
> *Scala 2.12*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
> Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq  25 29
>2395.1   2.5   0.1X{code}
> *Scala 2.13*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
> Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
> constructor:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> arrayOfAnyAsSeq 241243
>1 41.4  24.1   0.0X {code}
> the test code as follows:
> {code:java}
> benchmark.addCase("arrayOfAnyAsSeq") { _ =>
>   val arr: Seq[Any] = new Array[Any](arraySize)
>   var n = 0
>   while (n < valuesPerIteration) {
> new GenericArrayData(arr)
> n += 1
>   }
> } {code}
> the constructor of GenericArrayData as follows:
> {code:java}
> def this(seq: scala.collection.Seq[Any]) = this(seq.toArray) {code}
>  
> The performance difference is due to the following reasons：
> *When using Scala 2.12:*
> The class type of `arr` is `s.c.mutable.WrappedArrayWrappedArray$ofRef`, 
> `toArray` return `array.asInstanceOf[Array[U]]`, there is no memory copy.
> *When using Scala 2.13:*
> The class type of `arr` is `s.c.immutable.ArraySeq$ofRef`, `toArray` will 
> call `IterableOnceOps#toArray`, the corresponding implementation uses memory 
> copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39774) [BUG] char(n) type cannot be directly compared with string equals and needs to be cast to String. hive sql can be compared without conversion

2022-07-13 Thread melin (Jira)

melin created SPARK-39774:
-

 Summary: [BUG] char(n) type cannot be directly compared with 
string equals and needs to be cast to String. hive sql can be compared without 
conversion
 Key: SPARK-39774
 URL: https://issues.apache.org/jira/browse/SPARK-39774
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: melin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39774) [BUG] char(n) type cannot be directly compared with string equals and needs to be cast to String. hive sql can be compared without conversion

2022-07-13 Thread melin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-39774:
--
Attachment: image-2022-07-14-11-23-15-606.png

> [BUG] char(n) type cannot be directly compared with string equals and needs 
> to be cast to String. hive sql can be compared without conversion
> -
>
> Key: SPARK-39774
> URL: https://issues.apache.org/jira/browse/SPARK-39774
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
> Attachments: image-2022-07-14-11-23-15-606.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39774) [BUG] char(n) type cannot be directly compared with string equals and needs to be cast to String. hive sql can be compared without conversion

2022-07-13 Thread melin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566622#comment-17566622
 ] 

melin commented on SPARK-39774:
---

{code:java}
create table tpcds_2g_parquet.reason;
CREATE TABLE `tpcds_2g_parquet`.`reason` (
  `r_reason_sk` BIGINT,
  `r_reason_id` CHAR(16),
  `r_reason_desc` CHAR(100))
USING parquet
TBLPROPERTIES (
  'transient_lastDdlTime' = '1645440621'){code}
 

!image-2022-07-14-11-24-24-247.png!!image-2022-07-14-11-23-15-606.png!

> [BUG] char(n) type cannot be directly compared with string equals and needs 
> to be cast to String. hive sql can be compared without conversion
> -
>
> Key: SPARK-39774
> URL: https://issues.apache.org/jira/browse/SPARK-39774
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
> Attachments: image-2022-07-14-11-23-15-606.png, 
> image-2022-07-14-11-24-24-247.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39774) [BUG] char(n) type cannot be directly compared with string equals and needs to be cast to String. hive sql can be compared without conversion

2022-07-13 Thread melin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-39774:
--
Attachment: image-2022-07-14-11-24-24-247.png

> [BUG] char(n) type cannot be directly compared with string equals and needs 
> to be cast to String. hive sql can be compared without conversion
> -
>
> Key: SPARK-39774
> URL: https://issues.apache.org/jira/browse/SPARK-39774
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
> Attachments: image-2022-07-14-11-23-15-606.png, 
> image-2022-07-14-11-24-24-247.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39774) [BUG] char(n) type cannot be directly compared with string equals and needs to be cast to String. hive sql can be compared without conversion

2022-07-13 Thread melin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566622#comment-17566622
 ] 

melin edited comment on SPARK-39774 at 7/14/22 3:26 AM:


{code:java}
create table tpcds_2g_parquet.reason;
CREATE TABLE `tpcds_2g_parquet`.`reason` (
  `r_reason_sk` BIGINT,
  `r_reason_id` CHAR(16),
  `r_reason_desc` CHAR(100))
USING parquet
TBLPROPERTIES (
  'transient_lastDdlTime' = '1645440621'){code}
 

!image-2022-07-14-11-24-24-247.png!!image-2022-07-14-11-23-15-606.png!

[~hyukjin.kwon] 


was (Author: melin):
{code:java}
create table tpcds_2g_parquet.reason;
CREATE TABLE `tpcds_2g_parquet`.`reason` (
  `r_reason_sk` BIGINT,
  `r_reason_id` CHAR(16),
  `r_reason_desc` CHAR(100))
USING parquet
TBLPROPERTIES (
  'transient_lastDdlTime' = '1645440621'){code}
 

!image-2022-07-14-11-24-24-247.png!!image-2022-07-14-11-23-15-606.png!

> [BUG] char(n) type cannot be directly compared with string equals and needs 
> to be cast to String. hive sql can be compared without conversion
> -
>
> Key: SPARK-39774
> URL: https://issues.apache.org/jira/browse/SPARK-39774
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
> Attachments: image-2022-07-14-11-23-15-606.png, 
> image-2022-07-14-11-24-24-247.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39773) Update document of JDBC options for pushDownOffset

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566623#comment-17566623
 ] 

Apache Spark commented on SPARK-39773:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37186

> Update document of JDBC options for pushDownOffset
> --
>
> Key: SPARK-39773
> URL: https://issues.apache.org/jira/browse/SPARK-39773
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Because the DS v2 pushdown framework added new JDBC option pushDownOffset for 
> offset pushdown, we should update sql-data-sources-jdbc.md.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39773) Update document of JDBC options for pushDownOffset

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39773:


Assignee: Apache Spark

> Update document of JDBC options for pushDownOffset
> --
>
> Key: SPARK-39773
> URL: https://issues.apache.org/jira/browse/SPARK-39773
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Because the DS v2 pushdown framework added new JDBC option pushDownOffset for 
> offset pushdown, we should update sql-data-sources-jdbc.md.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39773) Update document of JDBC options for pushDownOffset

2022-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39773:


Assignee: (was: Apache Spark)

> Update document of JDBC options for pushDownOffset
> --
>
> Key: SPARK-39773
> URL: https://issues.apache.org/jira/browse/SPARK-39773
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Because the DS v2 pushdown framework added new JDBC option pushDownOffset for 
> offset pushdown, we should update sql-data-sources-jdbc.md.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39748) Include the origin logical plan for LogicalRDD if it comes from DataFrame

2022-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566625#comment-17566625
 ] 

Apache Spark commented on SPARK-39748:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37187

> Include the origin logical plan for LogicalRDD if it comes from DataFrame
> -
>
> Key: SPARK-39748
> URL: https://issues.apache.org/jira/browse/SPARK-39748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.4.0
>
>
> When Spark converts the DataFrame to LogicalRDD for some reason (e.g. 
> foreachBatch sink), Spark just picks the RDD from the origin DataFrame and 
> discards the (logical/physical) plan.
> The origin logical plan can be useful for several use cases, including:
> 1. wants to connect the overall logical plan into one
> 2. inherits plan stats from origin logical plan



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39672) NotExists subquery failed with conflicting attributes

2022-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39672.
-
Fix Version/s: 3.3.1
   3.2.2
   3.1.4
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37074
[https://github.com/apache/spark/pull/37074]

> NotExists subquery failed with conflicting attributes
> -
>
> Key: SPARK-39672
> URL: https://issues.apache.org/jira/browse/SPARK-39672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
> Fix For: 3.3.1, 3.2.2, 3.1.4, 3.4.0
>
>
> {code:sql}
> select * from
> (
> select v1.a, v1.b, v2.c
> from v1
> inner join v2
> on v1.a=v2.a) t3
> where not exists (
>   select 1
>   from v2
>   where t3.a=v2.a and t3.b=v2.b and t3.c=v2.c
> ){code}
> This query throws AnalysisException
> {code:java}
> org.apache.spark.sql.AnalysisException: Found conflicting attributes a#266 in 
> the condition joining outer plan:
>   Join Inner, (a#250 = a#266)
> :- Project [_1#243 AS a#250, _2#244 AS b#251]
> :  +- LocalRelation [_1#243, _2#244, _3#245]
> +- Project [_1#259 AS a#266, _3#261 AS c#268]
>    +- LocalRelation [_1#259, _2#260, _3#261]and subplan:
>   Project [1 AS 1#273, _1#259 AS a#266, _2#260 AS b#267, _3#261 AS c#268#277]
> +- LocalRelation [_1#259, _2#260, _3#261] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39672) NotExists subquery failed with conflicting attributes

2022-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39672:
---

Assignee: Manu Zhang

> NotExists subquery failed with conflicting attributes
> -
>
> Key: SPARK-39672
> URL: https://issues.apache.org/jira/browse/SPARK-39672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
>
> {code:sql}
> select * from
> (
> select v1.a, v1.b, v2.c
> from v1
> inner join v2
> on v1.a=v2.a) t3
> where not exists (
>   select 1
>   from v2
>   where t3.a=v2.a and t3.b=v2.b and t3.c=v2.c
> ){code}
> This query throws AnalysisException
> {code:java}
> org.apache.spark.sql.AnalysisException: Found conflicting attributes a#266 in 
> the condition joining outer plan:
>   Join Inner, (a#250 = a#266)
> :- Project [_1#243 AS a#250, _2#244 AS b#251]
> :  +- LocalRelation [_1#243, _2#244, _3#245]
> +- Project [_1#259 AS a#266, _3#261 AS c#268]
>    +- LocalRelation [_1#259, _2#260, _3#261]and subplan:
>   Project [1 AS 1#273, _1#259 AS a#266, _2#260 AS b#267, _3#261 AS c#268#277]
> +- LocalRelation [_1#259, _2#260, _3#261] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39767) Remove UnresolvedDBObjectName and add UnresolvedIdentifier

2022-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39767:
---

Assignee: Wenchen Fan

> Remove UnresolvedDBObjectName and add UnresolvedIdentifier
> --
>
> Key: SPARK-39767
> URL: https://issues.apache.org/jira/browse/SPARK-39767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39767) Remove UnresolvedDBObjectName and add UnresolvedIdentifier

2022-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39767.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37178
[https://github.com/apache/spark/pull/37178]

> Remove UnresolvedDBObjectName and add UnresolvedIdentifier
> --
>
> Key: SPARK-39767
> URL: https://issues.apache.org/jira/browse/SPARK-39767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 107 matches

Mail list logo