date:20220106

[jira] [Updated] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit

2022-01-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37835:

Description: 
Currently, the comments on `SQLQueryTestSuite.scala` says:
{code:java}
 * To re-generate golden files for entire suite, run:
 * {{{
 *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" 
 * }}} {code}
But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
`*SQLQueryTestSuite`, so we'd better to recommend to use 
`org.apache.spark.sql.SQLQueryTestSuite` explicitly.

Also same for `ThriftServerQueryTestSuite.scala`.

  was:
Currently, the comments on `SQLQueryTestSuite.scala` says:
{code:java}
 * To re-generate golden files for entire suite, run:
 * {{{
 *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" 
 * }}} {code}
But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
`*SQLQueryTestSuite`, so we'd better to recommend to use 
`org.apache.spark.sql.SQLQueryTestSuite` explicitly.

Also same for ThriftServerQueryTestSuite.scala.


> Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala 
> to more explicit
> -
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
> `*SQLQueryTestSuite`, so we'd better to recommend to use 
> `org.apache.spark.sql.SQLQueryTestSuite` explicitly.
> Also same for `ThriftServerQueryTestSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit

2022-01-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37835:

Description: 
Currently, the comments on `SQLQueryTestSuite.scala` says:
{code:java}
 * To re-generate golden files for entire suite, run:
 * {{{
 *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" 
 * }}} {code}
But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
`*SQLQueryTestSuite`, so we'd better to recommend to use 
`org.apache.spark.sql.SQLQueryTestSuite` explicitly.

Also same for ThriftServerQueryTestSuite.scala.

  was:
Currently, the comments on `SQLQueryTestSuite.scala` says:
{code:java}
 * To re-generate golden files for entire suite, run:
 * {{{
 *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" 
 * }}} {code}
But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
`*SQLQueryTestSuite`, so we'd better to recommend to use 
`org.apache.spark.sql.SQLQueryTestSuite` explicitly.


> Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala 
> to more explicit
> -
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
> `*SQLQueryTestSuite`, so we'd better to recommend to use 
> `org.apache.spark.sql.SQLQueryTestSuite` explicitly.
> Also same for ThriftServerQueryTestSuite.scala.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit

2022-01-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37835:

Summary: Fix the comments on 
SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit  
(was: Fix the comments on SQLQueryTestSuite.scala to more explicit)

> Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala 
> to more explicit
> -
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
> `*SQLQueryTestSuite`, so we'd better to recommend to use 
> `org.apache.spark.sql.SQLQueryTestSuite` explicitly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to more explicit

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37835:


Assignee: Apache Spark

> Fix the comments on SQLQueryTestSuite.scala to more explicit
> 
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
> `*SQLQueryTestSuite`, so we'd better to recommend to use 
> `org.apache.spark.sql.SQLQueryTestSuite` explicitly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to more explicit

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37835:


Assignee: (was: Apache Spark)

> Fix the comments on SQLQueryTestSuite.scala to more explicit
> 
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
> `*SQLQueryTestSuite`, so we'd better to recommend to use 
> `org.apache.spark.sql.SQLQueryTestSuite` explicitly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to more explicit

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470382#comment-17470382
 ] 

Apache Spark commented on SPARK-37835:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/35129

> Fix the comments on SQLQueryTestSuite.scala to more explicit
> 
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
> `*SQLQueryTestSuite`, so we'd better to recommend to use 
> `org.apache.spark.sql.SQLQueryTestSuite` explicitly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37838) Upgrade scalatestplus artifacts to 3.3.0-SNAP3

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37838:


Assignee: Apache Spark

> Upgrade scalatestplus artifacts to 3.3.0-SNAP3
> --
>
> Key: SPARK-37838
> URL: https://issues.apache.org/jira/browse/SPARK-37838
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> It always throws exception:
> {noformat}
> [error] 
> /Users/yumwang/spark/core/src/test/scala/org/apache/spark/SparkContextInfoSuite.scala:22:8:
>  Symbol 'type org.scalactic.TripleEquals' is missing from the classpath.
> [error] This symbol is required by 'trait org.scalatest.Assertions'.
> [error] Make sure that type TripleEquals is in your classpath and check for 
> conflicting dependencies with `-Ylog-classpath`.
> [error] A full rebuild may help if 'Assertions.class' was compiled against an 
> incompatible version of org.scalactic.
> [error] import org.scalatest.Assertions
> [error]^
> [error] one error found
> [error] (core / Test / compileIncremental) Compilation failed
> {noformat}
> How to reproduce:
> {code:sh}
>  SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStability*Suite"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37838) Upgrade scalatestplus artifacts to 3.3.0-SNAP3

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470376#comment-17470376
 ] 

Apache Spark commented on SPARK-37838:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/35128

> Upgrade scalatestplus artifacts to 3.3.0-SNAP3
> --
>
> Key: SPARK-37838
> URL: https://issues.apache.org/jira/browse/SPARK-37838
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> It always throws exception:
> {noformat}
> [error] 
> /Users/yumwang/spark/core/src/test/scala/org/apache/spark/SparkContextInfoSuite.scala:22:8:
>  Symbol 'type org.scalactic.TripleEquals' is missing from the classpath.
> [error] This symbol is required by 'trait org.scalatest.Assertions'.
> [error] Make sure that type TripleEquals is in your classpath and check for 
> conflicting dependencies with `-Ylog-classpath`.
> [error] A full rebuild may help if 'Assertions.class' was compiled against an 
> incompatible version of org.scalactic.
> [error] import org.scalatest.Assertions
> [error]^
> [error] one error found
> [error] (core / Test / compileIncremental) Compilation failed
> {noformat}
> How to reproduce:
> {code:sh}
>  SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStability*Suite"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37838) Upgrade scalatestplus artifacts to 3.3.0-SNAP3

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37838:


Assignee: (was: Apache Spark)

> Upgrade scalatestplus artifacts to 3.3.0-SNAP3
> --
>
> Key: SPARK-37838
> URL: https://issues.apache.org/jira/browse/SPARK-37838
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> It always throws exception:
> {noformat}
> [error] 
> /Users/yumwang/spark/core/src/test/scala/org/apache/spark/SparkContextInfoSuite.scala:22:8:
>  Symbol 'type org.scalactic.TripleEquals' is missing from the classpath.
> [error] This symbol is required by 'trait org.scalatest.Assertions'.
> [error] Make sure that type TripleEquals is in your classpath and check for 
> conflicting dependencies with `-Ylog-classpath`.
> [error] A full rebuild may help if 'Assertions.class' was compiled against an 
> incompatible version of org.scalactic.
> [error] import org.scalatest.Assertions
> [error]^
> [error] one error found
> [error] (core / Test / compileIncremental) Compilation failed
> {noformat}
> How to reproduce:
> {code:sh}
>  SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStability*Suite"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37838) Upgrade scalatestplus artifacts to 3.3.0-SNAP3

2022-01-06 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-37838:
---

 Summary: Upgrade scalatestplus artifacts to 3.3.0-SNAP3
 Key: SPARK-37838
 URL: https://issues.apache.org/jira/browse/SPARK-37838
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.3.0
Reporter: Yuming Wang


It always throws exception:
{noformat}
[error] 
/Users/yumwang/spark/core/src/test/scala/org/apache/spark/SparkContextInfoSuite.scala:22:8:
 Symbol 'type org.scalactic.TripleEquals' is missing from the classpath.
[error] This symbol is required by 'trait org.scalatest.Assertions'.
[error] Make sure that type TripleEquals is in your classpath and check for 
conflicting dependencies with `-Ylog-classpath`.
[error] A full rebuild may help if 'Assertions.class' was compiled against an 
incompatible version of org.scalactic.
[error] import org.scalatest.Assertions
[error]^
[error] one error found
[error] (core / Test / compileIncremental) Compilation failed
{noformat}

How to reproduce:

{code:sh}
 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStability*Suite"
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37826) Use zstd codec name in ORC file names for hive orc impl

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37826.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35117
[https://github.com/apache/spark/pull/35117]

> Use zstd codec name in ORC file names for hive orc impl
> ---
>
> Key: SPARK-37826
> URL: https://issues.apache.org/jira/browse/SPARK-37826
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.3.0
>
>
> After SPARK-34954, we add zstd codec name in ORC file names for native orc 
> impl w/o hive impl support



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37826) Use zstd codec name in ORC file names for hive orc impl

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37826:
-

Assignee: Kent Yao

> Use zstd codec name in ORC file names for hive orc impl
> ---
>
> Key: SPARK-37826
> URL: https://issues.apache.org/jira/browse/SPARK-37826
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> After SPARK-34954, we add zstd codec name in ORC file names for native orc 
> impl w/o hive impl support



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37837) Enable black formatter in dev Python scripts

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470360#comment-17470360
 ] 

Apache Spark commented on SPARK-37837:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35127

> Enable black formatter in dev Python scripts
> 
>
> Key: SPARK-37837
> URL: https://issues.apache.org/jira/browse/SPARK-37837
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> black formatter is only enabled for python/pyspark to minimize side-effects 
> e.g., reformating auto generated or thrid-party Python scripts.
> This JIRA aims to enable black formatter in dev directory where there's no 
> generated Python scripts to exclude.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37837) Enable black formatter in dev Python scripts

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37837:


Assignee: (was: Apache Spark)

> Enable black formatter in dev Python scripts
> 
>
> Key: SPARK-37837
> URL: https://issues.apache.org/jira/browse/SPARK-37837
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> black formatter is only enabled for python/pyspark to minimize side-effects 
> e.g., reformating auto generated or thrid-party Python scripts.
> This JIRA aims to enable black formatter in dev directory where there's no 
> generated Python scripts to exclude.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37837) Enable black formatter in dev Python scripts

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37837:


Assignee: Apache Spark

> Enable black formatter in dev Python scripts
> 
>
> Key: SPARK-37837
> URL: https://issues.apache.org/jira/browse/SPARK-37837
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> black formatter is only enabled for python/pyspark to minimize side-effects 
> e.g., reformating auto generated or thrid-party Python scripts.
> This JIRA aims to enable black formatter in dev directory where there's no 
> generated Python scripts to exclude.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37837) Enable black formatter in dev Python scripts

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470359#comment-17470359
 ] 

Apache Spark commented on SPARK-37837:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35127

> Enable black formatter in dev Python scripts
> 
>
> Key: SPARK-37837
> URL: https://issues.apache.org/jira/browse/SPARK-37837
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> black formatter is only enabled for python/pyspark to minimize side-effects 
> e.g., reformating auto generated or thrid-party Python scripts.
> This JIRA aims to enable black formatter in dev directory where there's no 
> generated Python scripts to exclude.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37837) Enable black formatter in dev Python scripts

2022-01-06 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-37837:


 Summary: Enable black formatter in dev Python scripts
 Key: SPARK-37837
 URL: https://issues.apache.org/jira/browse/SPARK-37837
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


black formatter is only enabled for python/pyspark to minimize side-effects 
e.g., reformating auto generated or thrid-party Python scripts.

This JIRA aims to enable black formatter in dev directory where there's no 
generated Python scripts to exclude.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37836) Enable more flake8 rules for PEP 8 compliance

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37836:


Assignee: Apache Spark

> Enable more flake8 rules for PEP 8 compliance
> -
>
> Key: SPARK-37836
> URL: https://issues.apache.org/jira/browse/SPARK-37836
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Most of disabled linter rules here:
> https://github.com/apache/spark/blob/master/dev/tox.ini#L19-L31
> should be enabled to comply PEP8.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37836) Enable more flake8 rules for PEP 8 compliance

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470356#comment-17470356
 ] 

Apache Spark commented on SPARK-37836:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35126

> Enable more flake8 rules for PEP 8 compliance
> -
>
> Key: SPARK-37836
> URL: https://issues.apache.org/jira/browse/SPARK-37836
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Most of disabled linter rules here:
> https://github.com/apache/spark/blob/master/dev/tox.ini#L19-L31
> should be enabled to comply PEP8.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37836) Enable more flake8 rules for PEP 8 compliance

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37836:


Assignee: (was: Apache Spark)

> Enable more flake8 rules for PEP 8 compliance
> -
>
> Key: SPARK-37836
> URL: https://issues.apache.org/jira/browse/SPARK-37836
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Most of disabled linter rules here:
> https://github.com/apache/spark/blob/master/dev/tox.ini#L19-L31
> should be enabled to comply PEP8.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37834) Reenable length check in Python linter

2022-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37834:


Assignee: Hyukjin Kwon

> Reenable length check in Python linter
> --
>
> Key: SPARK-37834
> URL: https://issues.apache.org/jira/browse/SPARK-37834
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-37380 mistakenly removed the length check if PySpark in the codebase. 
> We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37834) Reenable length check in Python linter

2022-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37834.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35123
[https://github.com/apache/spark/pull/35123]

> Reenable length check in Python linter
> --
>
> Key: SPARK-37834
> URL: https://issues.apache.org/jira/browse/SPARK-37834
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> SPARK-37380 mistakenly removed the length check if PySpark in the codebase. 
> We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37802) composite field name like `field name` doesn't work with Aggregate push down

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470344#comment-17470344
 ] 

Apache Spark commented on SPARK-37802:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/35125

> composite field name like `field name` doesn't work with Aggregate push down
> 
>
> Key: SPARK-37802
> URL: https://issues.apache.org/jira/browse/SPARK-37802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.3.0
>
>
> {code:java}
> sql("SELECT SUM(`field name`) FROM h2.test.table")
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input 'name' expecting (line 1, pos 9)
>   at 
> org.apache.spark.sql.catalyst.parser.ParseErrorListener$.syntaxError(ParseDriver.scala:212)
>   at 
> org.antlr.v4.runtime.ProxyErrorListener.syntaxError(ProxyErrorListener.java:41)
>   at org.antlr.v4.runtime.Parser.notifyErrorListeners(Parser.java:544)
>   at 
> org.antlr.v4.runtime.DefaultErrorStrategy.reportUnwantedToken(DefaultErrorStrategy.java:377)
>   at 
> org.antlr.v4.runtime.DefaultErrorStrategy.singleTokenDeletion(DefaultErrorStrategy.java:548)
>   at 
> org.antlr.v4.runtime.DefaultErrorStrategy.recoverInline(DefaultErrorStrategy.java:467)
>   at org.antlr.v4.runtime.Parser.match(Parser.java:206)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser.singleMultipartIdentifier(SqlBaseParser.java:519)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to more explicit

2022-01-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37835:

Description: 
Currently, the comments on `SQLQueryTestSuite.scala` says:
{code:java}
 * To re-generate golden files for entire suite, run:
 * {{{
 *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" 
 * }}} {code}
But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
`*SQLQueryTestSuite`, so we'd better to recommend to use 
`org.apache.spark.sql.SQLQueryTestSuite` explicitly.

  was:
Currently, the comments on `SQLQueryTestSuite.scala` says:
{code:java}
 * To re-generate golden files for entire suite, run:
 * {{{
 *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" 
 * }}} {code}
But this comment modifies the "-- Automatically generated by SQLQueryTestSuite" 
part with different test suites names for whole golden files, although the 
contents of golden files is not modified.

So, we'd better to recommend to use `org.apache.spark.sql.SQLQueryTestSuite` 
instead of `*SQLQueryTestSuite`.


> Fix the comments on SQLQueryTestSuite.scala to more explicit
> 
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But we have only `org.apache.spark.sql.SQLQueryTestSuite` for 
> `*SQLQueryTestSuite`, so we'd better to recommend to use 
> `org.apache.spark.sql.SQLQueryTestSuite` explicitly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to more explicit

2022-01-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37835:

Summary: Fix the comments on SQLQueryTestSuite.scala to more explicit  
(was: Fix the comments on SQLQueryTestSuite.scala to generate the golden files 
more gracefully)

> Fix the comments on SQLQueryTestSuite.scala to more explicit
> 
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But this comment modifies the "-- Automatically generated by 
> SQLQueryTestSuite" part with different test suites names for whole golden 
> files, although the contents of golden files is not modified.
> So, we'd better to recommend to use `org.apache.spark.sql.SQLQueryTestSuite` 
> instead of `*SQLQueryTestSuite`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to generate the golden files more gracefully

2022-01-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee reopened SPARK-37835:
-

> Fix the comments on SQLQueryTestSuite.scala to generate the golden files more 
> gracefully
> 
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But this comment modifies the "-- Automatically generated by 
> SQLQueryTestSuite" part with different test suites names for whole golden 
> files, although the contents of golden files is not modified.
> So, we'd better to recommend to use `org.apache.spark.sql.SQLQueryTestSuite` 
> instead of `*SQLQueryTestSuite`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37836) Enable more flake8 rules for PEP 8 compliance

2022-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37836:
-
Summary: Enable more flake8 rules for PEP 8 compliance  (was: Enable more 
flake8 rules)

> Enable more flake8 rules for PEP 8 compliance
> -
>
> Key: SPARK-37836
> URL: https://issues.apache.org/jira/browse/SPARK-37836
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Most of disabled linter rules here:
> https://github.com/apache/spark/blob/master/dev/tox.ini#L19-L31
> should be enabled to comply PEP8.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37820) Replace ApacheCommonBase64 with JavaBase64 for string fucntions

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37820:
-

Assignee: Kent Yao

> Replace ApacheCommonBase64 with JavaBase64 for string fucntions
> ---
>
> Key: SPARK-37820
> URL: https://issues.apache.org/jira/browse/SPARK-37820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Replace dependency on third-party libraries with native 
> support(https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html
> ) for Base-64 encode/decode.
> 1. Performace gain
> http://java-performance.info/base64-encoding-and-decoding-performance/
> 2. reduce dependencies afterward
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37820) Replace ApacheCommonBase64 with JavaBase64 for string fucntions

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37820.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35110
[https://github.com/apache/spark/pull/35110]

> Replace ApacheCommonBase64 with JavaBase64 for string fucntions
> ---
>
> Key: SPARK-37820
> URL: https://issues.apache.org/jira/browse/SPARK-37820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.3.0
>
>
> Replace dependency on third-party libraries with native 
> support(https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html
> ) for Base-64 encode/decode.
> 1. Performace gain
> http://java-performance.info/base64-encoding-and-decoding-performance/
> 2. reduce dependencies afterward
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37836) Enable more flake8 rules

2022-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37836:
-
Issue Type: Improvement  (was: Bug)

> Enable more flake8 rules
> 
>
> Key: SPARK-37836
> URL: https://issues.apache.org/jira/browse/SPARK-37836
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Most of disabled linter rules here:
> https://github.com/apache/spark/blob/master/dev/tox.ini#L19-L31
> should be enabled to comply PEP8.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37836) Enable more flake8 rules

2022-01-06 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-37836:


 Summary: Enable more flake8 rules
 Key: SPARK-37836
 URL: https://issues.apache.org/jira/browse/SPARK-37836
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, PySpark
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Most of disabled linter rules here:

https://github.com/apache/spark/blob/master/dev/tox.ini#L19-L31

should be enabled to comply PEP8.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37734) Upgrade h2 from 1.4.195 to 2.0.202

2022-01-06 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37734.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35013
[https://github.com/apache/spark/pull/35013]

> Upgrade h2 from 1.4.195 to 2.0.202
> --
>
> Key: SPARK-37734
> URL: https://issues.apache.org/jira/browse/SPARK-37734
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, the om.h2database exists 1 vulnerability, ref: 
> https://www.tenable.com/cve/CVE-2021-23463 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37734) Upgrade h2 from 1.4.195 to 2.0.202

2022-01-06 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37734:
---

Assignee: jiaan.geng

> Upgrade h2 from 1.4.195 to 2.0.202
> --
>
> Key: SPARK-37734
> URL: https://issues.apache.org/jira/browse/SPARK-37734
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Currently, the om.h2database exists 1 vulnerability, ref: 
> https://www.tenable.com/cve/CVE-2021-23463 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-37772) [PYSPARK] Publish ApacheSparkGitHubActionImage arm64 docker image

2022-01-06 Thread Yikun Jiang (Jira)



[ https://issues.apache.org/jira/browse/SPARK-37772 ]


Yikun Jiang deleted comment on SPARK-37772:
-

was (Author: yikunkero):
To speed up the installation on pypy (aarch64)

https://github.com/MacPython/numpy-wheels/issues/143
https://github.com/MacPython/scipy-wheels/issues/161
https://github.com/MacPython/pandas-wheels/issues/171

I raised the issue to numpy/scipy/pandas wheels repo.

> [PYSPARK] Publish ApacheSparkGitHubActionImage arm64 docker image
> -
>
> Key: SPARK-37772
> URL: https://issues.apache.org/jira/browse/SPARK-37772
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37802) composite field name like `field name` doesn't work with Aggregate push down

2022-01-06 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37802.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35108
[https://github.com/apache/spark/pull/35108]

> composite field name like `field name` doesn't work with Aggregate push down
> 
>
> Key: SPARK-37802
> URL: https://issues.apache.org/jira/browse/SPARK-37802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.3.0
>
>
> {code:java}
> sql("SELECT SUM(`field name`) FROM h2.test.table")
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input 'name' expecting (line 1, pos 9)
>   at 
> org.apache.spark.sql.catalyst.parser.ParseErrorListener$.syntaxError(ParseDriver.scala:212)
>   at 
> org.antlr.v4.runtime.ProxyErrorListener.syntaxError(ProxyErrorListener.java:41)
>   at org.antlr.v4.runtime.Parser.notifyErrorListeners(Parser.java:544)
>   at 
> org.antlr.v4.runtime.DefaultErrorStrategy.reportUnwantedToken(DefaultErrorStrategy.java:377)
>   at 
> org.antlr.v4.runtime.DefaultErrorStrategy.singleTokenDeletion(DefaultErrorStrategy.java:548)
>   at 
> org.antlr.v4.runtime.DefaultErrorStrategy.recoverInline(DefaultErrorStrategy.java:467)
>   at org.antlr.v4.runtime.Parser.match(Parser.java:206)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser.singleMultipartIdentifier(SqlBaseParser.java:519)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to generate the golden files more gracefully

2022-01-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-37835.
-
Resolution: Won't Do

> Fix the comments on SQLQueryTestSuite.scala to generate the golden files more 
> gracefully
> 
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But this comment modifies the "-- Automatically generated by 
> SQLQueryTestSuite" part with different test suites names for whole golden 
> files, although the contents of golden files is not modified.
> So, we'd better to recommend to use `org.apache.spark.sql.SQLQueryTestSuite` 
> instead of `*SQLQueryTestSuite`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37398) Inline type hints for python/pyspark/ml/classification.py

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470317#comment-17470317
 ] 

Apache Spark commented on SPARK-37398:
--

User 'javierivanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/35124

> Inline type hints for python/pyspark/ml/classification.py
> -
>
> Key: SPARK-37398
> URL: https://issues.apache.org/jira/browse/SPARK-37398
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/classification.pyi to 
> python/pyspark/ml/classification.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37398) Inline type hints for python/pyspark/ml/classification.py

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37398:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/ml/classification.py
> -
>
> Key: SPARK-37398
> URL: https://issues.apache.org/jira/browse/SPARK-37398
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/classification.pyi to 
> python/pyspark/ml/classification.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37398) Inline type hints for python/pyspark/ml/classification.py

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37398:


Assignee: Apache Spark

> Inline type hints for python/pyspark/ml/classification.py
> -
>
> Key: SPARK-37398
> URL: https://issues.apache.org/jira/browse/SPARK-37398
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/ml/classification.pyi to 
> python/pyspark/ml/classification.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37834) Reenable length check in Python linter

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37834:


Assignee: (was: Apache Spark)

> Reenable length check in Python linter
> --
>
> Key: SPARK-37834
> URL: https://issues.apache.org/jira/browse/SPARK-37834
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-37380 mistakenly removed the length check if PySpark in the codebase. 
> We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37834) Reenable length check in Python linter

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37834:


Assignee: Apache Spark

> Reenable length check in Python linter
> --
>
> Key: SPARK-37834
> URL: https://issues.apache.org/jira/browse/SPARK-37834
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-37380 mistakenly removed the length check if PySpark in the codebase. 
> We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37834) Reenable length check in Python linter

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470315#comment-17470315
 ] 

Apache Spark commented on SPARK-37834:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35123

> Reenable length check in Python linter
> --
>
> Key: SPARK-37834
> URL: https://issues.apache.org/jira/browse/SPARK-37834
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-37380 mistakenly removed the length check if PySpark in the codebase. 
> We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to generate the golden files more gracefully

2022-01-06 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470310#comment-17470310
 ] 

Haejoon Lee commented on SPARK-37835:
-

I'm fixing

> Fix the comments on SQLQueryTestSuite.scala to generate the golden files more 
> gracefully
> 
>
> Key: SPARK-37835
> URL: https://issues.apache.org/jira/browse/SPARK-37835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, the comments on `SQLQueryTestSuite.scala` says:
> {code:java}
>  * To re-generate golden files for entire suite, run:
>  * {{{
>  *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
> *SQLQueryTestSuite" 
>  * }}} {code}
> But this comment modifies the "-- Automatically generated by 
> SQLQueryTestSuite" part with different test suites names for whole golden 
> files, although the contents of golden files is not modified.
> So, we'd better to recommend to use `org.apache.spark.sql.SQLQueryTestSuite` 
> instead of `*SQLQueryTestSuite`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37772) [PYSPARK] Publish ApacheSparkGitHubActionImage arm64 docker image

2022-01-06 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470309#comment-17470309
 ] 

Yikun Jiang commented on SPARK-37772:
-

Add the PR in 
https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/pull/6

> [PYSPARK] Publish ApacheSparkGitHubActionImage arm64 docker image
> -
>
> Key: SPARK-37772
> URL: https://issues.apache.org/jira/browse/SPARK-37772
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37835) Fix the comments on SQLQueryTestSuite.scala to generate the golden files more gracefully

2022-01-06 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-37835:
---

 Summary: Fix the comments on SQLQueryTestSuite.scala to generate 
the golden files more gracefully
 Key: SPARK-37835
 URL: https://issues.apache.org/jira/browse/SPARK-37835
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Haejoon Lee


Currently, the comments on `SQLQueryTestSuite.scala` says:
{code:java}
 * To re-generate golden files for entire suite, run:
 * {{{
 *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" 
 * }}} {code}
But this comment modifies the "-- Automatically generated by SQLQueryTestSuite" 
part with different test suites names for whole golden files, although the 
contents of golden files is not modified.

So, we'd better to recommend to use `org.apache.spark.sql.SQLQueryTestSuite` 
instead of `*SQLQueryTestSuite`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37786) StreamingQueryListener should also support SQLConf

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470307#comment-17470307
 ] 

Apache Spark commented on SPARK-37786:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35122

> StreamingQueryListener should also support SQLConf
> --
>
> Key: SPARK-37786
> URL: https://issues.apache.org/jira/browse/SPARK-37786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current QueryExecutionListener only support Spark conf as parameter, some 
> time we want to use SQLConf in it. But hard to use. We can support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37780) QueryExecutionListener should also support SQLConf

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470306#comment-17470306
 ] 

Apache Spark commented on SPARK-37780:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35122

> QueryExecutionListener should also support SQLConf
> --
>
> Key: SPARK-37780
> URL: https://issues.apache.org/jira/browse/SPARK-37780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current QueryExecutionListener only support Spark conf as parameter, some 
> time we want to use SQLConf in it. But hard to use. We can support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37834) Reenable length check in Python linter

2022-01-06 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-37834:


 Summary: Reenable length check in Python linter
 Key: SPARK-37834
 URL: https://issues.apache.org/jira/browse/SPARK-37834
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, PySpark
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


SPARK-37380 mistakenly removed the length check if PySpark in the codebase. We 
should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37601) Could/should sql.DataFrame.transform accept function parameters?

2022-01-06 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470304#comment-17470304
 ] 

Maciej Szymkiewicz commented on SPARK-37601:


Thanks for pointing this out [~dongjoon]!

> Could/should sql.DataFrame.transform accept function parameters?
> 
>
> Key: SPARK-37601
> URL: https://issues.apache.org/jira/browse/SPARK-37601
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> {code:python}
> def foo(df: DataFrame, p: int) -> DataFrame
>   ...
> # current:
> from functools import partial
> df.transform(partial(foo, p=3))
> # or:
> df.transform(lambda df: foo(df, p=3))
> # vs suggested:
> df.transform(foo, p=3)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37833) Add `precondition` jobs for skip the main GitHub Action jobs

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470301#comment-17470301
 ] 

Apache Spark commented on SPARK-37833:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35121

> Add `precondition` jobs for skip the main GitHub Action jobs
> 
>
> Key: SPARK-37833
> URL: https://issues.apache.org/jira/browse/SPARK-37833
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37833) Add `precondition` jobs for skip the main GitHub Action jobs

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37833:


Assignee: Apache Spark

> Add `precondition` jobs for skip the main GitHub Action jobs
> 
>
> Key: SPARK-37833
> URL: https://issues.apache.org/jira/browse/SPARK-37833
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37833) Add `precondition` jobs for skip the main GitHub Action jobs

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37833:


Assignee: (was: Apache Spark)

> Add `precondition` jobs for skip the main GitHub Action jobs
> 
>
> Key: SPARK-37833
> URL: https://issues.apache.org/jira/browse/SPARK-37833
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37833) Add `precondition` jobs for skip the main GitHub Action jobs

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470302#comment-17470302
 ] 

Apache Spark commented on SPARK-37833:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35121

> Add `precondition` jobs for skip the main GitHub Action jobs
> 
>
> Key: SPARK-37833
> URL: https://issues.apache.org/jira/browse/SPARK-37833
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37833) Add `precondition` jobs for skip the main GitHub Action jobs

2022-01-06 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-37833:
-

 Summary: Add `precondition` jobs for skip the main GitHub Action 
jobs
 Key: SPARK-37833
 URL: https://issues.apache.org/jira/browse/SPARK-37833
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, Tests
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37832) Orc struct serializer should look up field converters in an array rather than a linked list

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37832.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35120
[https://github.com/apache/spark/pull/35120]

> Orc struct serializer should look up field converters in an array rather than 
> a linked list
> ---
>
> Key: SPARK-37832
> URL: https://issues.apache.org/jira/browse/SPARK-37832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.3.0
>
>
> The OrcSerializer's struct converter uses an index to look up a field 
> converter in a linked list, resulting in a n*(n/2) average complexity per row 
> (where n is the field count).
> Simply converting the linked list to an array brings performance gains, 
> especially for wide structs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37832) Orc struct serializer should look up field converters in an array rather than a linked list

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37832:
-

Assignee: Bruce Robbins

> Orc struct serializer should look up field converters in an array rather than 
> a linked list
> ---
>
> Key: SPARK-37832
> URL: https://issues.apache.org/jira/browse/SPARK-37832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> The OrcSerializer's struct converter uses an index to look up a field 
> converter in a linked list, resulting in a n*(n/2) average complexity per row 
> (where n is the field count).
> Simply converting the linked list to an array brings performance gains, 
> especially for wide structs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37772) [PYSPARK] Publish ApacheSparkGitHubActionImage arm64 docker image

2022-01-06 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470295#comment-17470295
 ] 

Yikun Jiang commented on SPARK-37772:
-

To speed up the installation on pypy (aarch64)

https://github.com/MacPython/numpy-wheels/issues/143
https://github.com/MacPython/scipy-wheels/issues/161
https://github.com/MacPython/pandas-wheels/issues/171

I raised the issue to numpy/scipy/pandas wheels repo.

> [PYSPARK] Publish ApacheSparkGitHubActionImage arm64 docker image
> -
>
> Key: SPARK-37772
> URL: https://issues.apache.org/jira/browse/SPARK-37772
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37821) spark thrift server RDD ID overflow lead sql execute failed

2022-01-06 Thread muhong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

muhong updated SPARK-37821:
---
Description: 
this problem will happen in long run spark application，such as thrift server；

as only one SparkContext instance in thrift server driver size，so if the 
concurrency of sql request is large or the sql is too complicate（this will 
create a lot of rdd）, the rdd will be generate too fast , the rdd id 
（SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
 ）will be consume fast, after a few months the nextRddId will overflow。the 
newRddId will be negative number，but the rdd's block id need to be positive, so 
this will lead a exception"Failed to parse rdd_-2123452330_2 into block ID"（rdd 
block id formate“val RDD = 
"rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
 can not exchange data during sql execution, and lead sql execute failed

if rddId overflow , when rdd.MapPartition execute , error will occur, the error 
is occur on driver side, when driver deserialize block id from "block message" 
inputstream

when executor invoke rdd.MapPartition, it will call block manager to report 
block status,  the the block id is negative，when the message send back to 
driver , the driver regex will failed match and throw an exception

 

how to fix the problem???

SparkContext.scala

 
{code:java}
...
...
 private val nextShuffleId = new AtomicInteger(0)
  private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()
  private var nextRddId = new AtomicInteger(0) // change happen
  /** Register a new RDD, returning its RDD ID */  
// change happen
private[spark] def newRddId(): Int = {
  var id = nextRddId.getAndIncrement()
  if (id > 0) {
return id
  }
  this.synchronized {
id = nextRddId.getAndIncrement()
   if (id < 0) {       
  nextRddId = new AtomicInteger(0)
  id = nextRddId.getAndIncrement()
}
  }
  id
}
...
...{code}

  was:
this problem will happen in long run spark application，such as thrift server；

as only one SparkContext instance in thrift server driver size，so if the 
concurrency of sql request is large or the sql is too complicate（this will 
create a lot of rdd）, the rdd will be generate too fast , the rdd id 
（SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
 ）will be consume fast, after a few months the nextRddId will overflow。the 
newRddId will be negative number，but the rdd's block id need to be positive, so 
this will lead a exception"Failed to parse rdd_-2123452330_2 into block ID"（rdd 
block id formate“val RDD = 
"rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
 can not exchange data during sql execution, and lead sql execute failed

 

if rddId overflow , when rdd.MapPartition execute , error will occure

 

how to fix the problem???

SparkContext.scala

 
{code:java}
...
...
 private val nextShuffleId = new AtomicInteger(0)
  private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()
  private var nextRddId = new AtomicInteger(0) // change happen
  /** Register a new RDD, returning its RDD ID */  
// change happen
private[spark] def newRddId(): Int = {
  var id = nextRddId.getAndIncrement()
  if (id > 0) {
return id
  }
  this.synchronized {
id = nextRddId.getAndIncrement()
   if (id < 0) {       
  nextRddId = new AtomicInteger(0)
  id = nextRddId.getAndIncrement()
}
  }
  id
}
...
...{code}


> spark thrift server RDD ID overflow lead sql execute failed
> ---
>
> Key: SPARK-37821
> URL: https://issues.apache.org/jira/browse/SPARK-37821
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: muhong
>Priority: Major
>
> this problem will happen in long run spark application，such as thrift server；
> as only one SparkContext instance in thrift server driver size，so if the 
> concurrency of sql request is large or the sql is too complicate（this will 
> create a lot of rdd）, the rdd will be generate too fast , the rdd id 
> （SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
>  ）will be consume fast, after a few months the nextRddId will overflow。the 
> newRddId will be negative number，but the rdd's block id need to be positive, 
> so this will lead a exception"Failed to parse rdd_-2123452330_2 into block 
> ID"（rdd block id formate“val RDD = 
>

[jira] [Updated] (SPARK-37821) spark thrift server RDD ID overflow lead sql execute failed

2022-01-06 Thread muhong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

muhong updated SPARK-37821:
---
Description: 
this problem will happen in long run spark application，such as thrift server；

as only one SparkContext instance in thrift server driver size，so if the 
concurrency of sql request is large or the sql is too complicate（this will 
create a lot of rdd）, the rdd will be generate too fast , the rdd id 
（SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
 ）will be consume fast, after a few months the nextRddId will overflow。the 
newRddId will be negative number，but the rdd's block id need to be positive, so 
this will lead a exception"Failed to parse rdd_-2123452330_2 into block ID"（rdd 
block id formate“val RDD = 
"rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
 can not exchange data during sql execution, and lead sql execute failed

 

if rddId overflow , when rdd.MapPartition execute , error will occure

 

how to fix the problem???

SparkContext.scala

 
{code:java}
...
...
 private val nextShuffleId = new AtomicInteger(0)
  private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()
  private var nextRddId = new AtomicInteger(0) // change happen
  /** Register a new RDD, returning its RDD ID */  
// change happen
private[spark] def newRddId(): Int = {
  var id = nextRddId.getAndIncrement()
  if (id > 0) {
return id
  }
  this.synchronized {
id = nextRddId.getAndIncrement()
   if (id < 0) {       
  nextRddId = new AtomicInteger(0)
  id = nextRddId.getAndIncrement()
}
  }
  id
}
...
...{code}

  was:
this problem will happen in long run spark application，such as thrift server；

as only one SparkContext instance in thrift server driver size，so if the 
concurrency of sql request is large or the sql is too complicate（this will 
create a lot of rdd）, the rdd will be generate too fast , the rdd id 
（SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
 ）will be consume fast, after a few months the nextRddId will overflow。the 
newRddId will be negative number，but the rdd's block id need to be positive, so 
this will lead a exception"Failed to parse rdd_-2123452330_2 into block ID"（rdd 
block id formate“val RDD = 
"rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
 can not exchange data during sql execution, and lead sql execute failed

 

if rddId overflow , when rdd.MapPartition execute , error will occure

 

how to fix the problem???

SparkContext.scala

 
{code:java}
...
...
 private val nextShuffleId = new AtomicInteger(0)
  private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()
  private val nextRddId = new AtomicInteger(0)
  /** Register a new RDD, returning its RDD ID */  private[spark] def 
newRddId(): Int = nextRddId.getAndIncrement()
  /**   * Registers listeners specified in spark.extraListeners, then starts 
the listener bus.   * This should be called after all internal listeners have 
been registered with the listener bus   * (e.g. after the web UI and event 
logging listeners have been registered).   */  private def 
setupAndStartListenerBus(): Unit = {try {  
conf.get(EXTRA_LISTENERS).foreach { classNames =>val listeners = 
Utils.loadExtensions(classOf[SparkListenerInterface], classNames, conf)
listeners.foreach { listener =>  listenerBus.addToSharedQueue(listener) 
 logInfo(s"Registered listener ${listener.getClass().getName()}")   
 }  }} catch {  case e: Exception =>try {  stop()   
 } finally {  throw new SparkException(s"Exception when registering 
SparkListener", e)}}
listenerBus.start(this, _env.metricsSystem)_listenerBusStarted = true  
} 

...
...{code}


> spark thrift server RDD ID overflow lead sql execute failed
> ---
>
> Key: SPARK-37821
> URL: https://issues.apache.org/jira/browse/SPARK-37821
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: muhong
>Priority: Major
>
> this problem will happen in long run spark application，such as thrift server；
> as only one SparkContext instance in thrift server driver size，so if the 
> concurrency of sql request is large or the sql is too complicate（this will 
> create a lot of rdd）, the rdd will be generate too fast , the rdd id 
> （SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
>  ）will be

[jira] [Updated] (SPARK-37821) spark thrift server RDD ID overflow lead sql execute failed

2022-01-06 Thread muhong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

muhong updated SPARK-37821:
---
Description: 
this problem will happen in long run spark application，such as thrift server；

as only one SparkContext instance in thrift server driver size，so if the 
concurrency of sql request is large or the sql is too complicate（this will 
create a lot of rdd）, the rdd will be generate too fast , the rdd id 
（SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
 ）will be consume fast, after a few months the nextRddId will overflow。the 
newRddId will be negative number，but the rdd's block id need to be positive, so 
this will lead a exception"Failed to parse rdd_-2123452330_2 into block ID"（rdd 
block id formate“val RDD = 
"rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
 can not exchange data during sql execution, and lead sql execute failed

 

if rddId overflow , when rdd.MapPartition execute , error will occure

 

how to fix the problem???

SparkContext.scala

 
{code:java}
...
...
 private val nextShuffleId = new AtomicInteger(0)
  private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()
  private val nextRddId = new AtomicInteger(0)
  /** Register a new RDD, returning its RDD ID */  private[spark] def 
newRddId(): Int = nextRddId.getAndIncrement()
  /**   * Registers listeners specified in spark.extraListeners, then starts 
the listener bus.   * This should be called after all internal listeners have 
been registered with the listener bus   * (e.g. after the web UI and event 
logging listeners have been registered).   */  private def 
setupAndStartListenerBus(): Unit = {try {  
conf.get(EXTRA_LISTENERS).foreach { classNames =>val listeners = 
Utils.loadExtensions(classOf[SparkListenerInterface], classNames, conf)
listeners.foreach { listener =>  listenerBus.addToSharedQueue(listener) 
 logInfo(s"Registered listener ${listener.getClass().getName()}")   
 }  }} catch {  case e: Exception =>try {  stop()   
 } finally {  throw new SparkException(s"Exception when registering 
SparkListener", e)}}
listenerBus.start(this, _env.metricsSystem)_listenerBusStarted = true  
} 

...
...{code}

  was:
this problem will happen in long run spark application，such as thrift server；

as only one SparkContext instance in thrift server driver size，so if the 
concurrency of sql request is large or the sql is too complicate（this will 
create a lot of rdd）, the rdd will be generate too fast , the rdd id 
（SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
 ）will be consume fast, after a few months the nextRddId will overflow。the 
newRddId will be negative number，but the rdd's block id need to be positive, so 
this will lead a exception"Failed to parse rdd_-2123452330_2 into block ID"（rdd 
block id formate“val RDD = 
"rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
 can not exchange data during sql execution, and lead sql execute failed

 

if rddId overflow , when rdd.MapPartition execute , error will occure


> spark thrift server RDD ID overflow lead sql execute failed
> ---
>
> Key: SPARK-37821
> URL: https://issues.apache.org/jira/browse/SPARK-37821
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: muhong
>Priority: Major
>
> this problem will happen in long run spark application，such as thrift server；
> as only one SparkContext instance in thrift server driver size，so if the 
> concurrency of sql request is large or the sql is too complicate（this will 
> create a lot of rdd）, the rdd will be generate too fast , the rdd id 
> （SparkContext.scala#nextRddId:[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala]
>  ）will be consume fast, after a few months the nextRddId will overflow。the 
> newRddId will be negative number，but the rdd's block id need to be positive, 
> so this will lead a exception"Failed to parse rdd_-2123452330_2 into block 
> ID"（rdd block id formate“val RDD = 
> "rdd_([0-9]{+})_([0-9]{+})".r”：[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala]），so
>  can not exchange data during sql execution, and lead sql execute failed
>  
> if rddId overflow , when rdd.MapPartition execute , error will occure
>  
> how to fix the problem???
> SparkContext.scala
>  
> {code:java}
> ...
>

[jira] [Assigned] (SPARK-37823) Add `is-changed.py` dev script

2022-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37823:


Assignee: Dongjoon Hyun

> Add `is-changed.py` dev script
> --
>
> Key: SPARK-37823
> URL: https://issues.apache.org/jira/browse/SPARK-37823
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37823) Add `is-changed.py` dev script

2022-01-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37823.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35112
[https://github.com/apache/spark/pull/35112]

> Add `is-changed.py` dev script
> --
>
> Key: SPARK-37823
> URL: https://issues.apache.org/jira/browse/SPARK-37823
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37832) Orc struct serializer should look up field converters in an array rather than a linked list

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37832:


Assignee: (was: Apache Spark)

> Orc struct serializer should look up field converters in an array rather than 
> a linked list
> ---
>
> Key: SPARK-37832
> URL: https://issues.apache.org/jira/browse/SPARK-37832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The OrcSerializer's struct converter uses an index to look up a field 
> converter in a linked list, resulting in a n*(n/2) average complexity per row 
> (where n is the field count).
> Simply converting the linked list to an array brings performance gains, 
> especially for wide structs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37832) Orc struct serializer should look up field converters in an array rather than a linked list

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470259#comment-17470259
 ] 

Apache Spark commented on SPARK-37832:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/35120

> Orc struct serializer should look up field converters in an array rather than 
> a linked list
> ---
>
> Key: SPARK-37832
> URL: https://issues.apache.org/jira/browse/SPARK-37832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The OrcSerializer's struct converter uses an index to look up a field 
> converter in a linked list, resulting in a n*(n/2) average complexity per row 
> (where n is the field count).
> Simply converting the linked list to an array brings performance gains, 
> especially for wide structs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37832) Orc struct serializer should look up field converters in an array rather than a linked list

2022-01-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470258#comment-17470258
 ] 

Apache Spark commented on SPARK-37832:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/35120

> Orc struct serializer should look up field converters in an array rather than 
> a linked list
> ---
>
> Key: SPARK-37832
> URL: https://issues.apache.org/jira/browse/SPARK-37832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The OrcSerializer's struct converter uses an index to look up a field 
> converter in a linked list, resulting in a n*(n/2) average complexity per row 
> (where n is the field count).
> Simply converting the linked list to an array brings performance gains, 
> especially for wide structs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37832) Orc struct serializer should look up field converters in an array rather than a linked list

2022-01-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37832:


Assignee: Apache Spark

> Orc struct serializer should look up field converters in an array rather than 
> a linked list
> ---
>
> Key: SPARK-37832
> URL: https://issues.apache.org/jira/browse/SPARK-37832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> The OrcSerializer's struct converter uses an index to look up a field 
> converter in a linked list, resulting in a n*(n/2) average complexity per row 
> (where n is the field count).
> Simply converting the linked list to an array brings performance gains, 
> especially for wide structs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37832) Orc struct serializer should look up field converters in an array rather than a linked list

2022-01-06 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470235#comment-17470235
 ] 

Bruce Robbins commented on SPARK-37832:
---

I will post of PR shortly.

> Orc struct serializer should look up field converters in an array rather than 
> a linked list
> ---
>
> Key: SPARK-37832
> URL: https://issues.apache.org/jira/browse/SPARK-37832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The OrcSerializer's struct converter uses an index to look up a field 
> converter in a linked list, resulting in a n*(n/2) average complexity per row 
> (where n is the field count).
> Simply converting the linked list to an array brings performance gains, 
> especially for wide structs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37832) Orc struct serializer should look up field converters in an array rather than a linked list

2022-01-06 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-37832:
-

 Summary: Orc struct serializer should look up field converters in 
an array rather than a linked list
 Key: SPARK-37832
 URL: https://issues.apache.org/jira/browse/SPARK-37832
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Bruce Robbins


The OrcSerializer's struct converter uses an index to look up a field converter 
in a linked list, resulting in a n*(n/2) average complexity per row (where n is 
the field count).

Simply converting the linked list to an array brings performance gains, 
especially for wide structs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37710) Add detailed log message for java.io.IOException occurring on Kryo flow

2022-01-06 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-37710:
---

Assignee: Eren Avsarogullari

> Add detailed log message for java.io.IOException occurring on Kryo flow
> ---
>
> Key: SPARK-37710
> URL: https://issues.apache.org/jira/browse/SPARK-37710
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Major
>
> *Input/output error* usually points environmental issues such as disk 
> read/write failures due to disk corruption, network access failures etc. This 
> PR aims to be added detailed error message to catch this kind of 
> environmental cases occurring on problematic BlockManager and logs with 
> *BlockManager hostname, blockId and blockPath* details.
> Following stack-trace occurred on disk corruption:
> {code:java}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: Input/output 
> error
> Serialization trace:
> buffers (org.apache.spark.sql.execution.columnar.DefaultCachedBatch)
>     at com.esotericsoftware.kryo.io.Input.fill(Input.java:166)
>     at com.esotericsoftware.kryo.io.Input.require(Input.java:196)
>     at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:346)
>     at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:326)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:55)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:38)
>     at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:789)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:381)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
>     at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:789)
>     at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:132)
>     at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
>     at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:816)
>     at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296)
>     at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
>     at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>     at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>     at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>     at 
> org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1569)
>     at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:877)
>     at org.apache.spark.storage.BlockManager.get(BlockManager.scala:1163)
> ...
> Caused by: java.io.IOException: Input/output error
>     at java.io.FileInputStream.readBytes(Native Method)
>     at java.io.FileInputStream.read(FileInputStream.java:255)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.tryReadFully(LZ4BlockInputStream.java:269)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:280)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:243)
>     at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
>     at com.esotericsoftware.kryo.io.Input.fill(Input.java:164)
>     ... 87 more {code}
> *Proposed Error Message:*
> {code:java}
> java.io.IOException: Input/output error. BlockManagerId(driver, localhost, 
> 49455, None) - blockId: test_my-block-id - blockDiskPath: 
> /private/var/folders/kj/mccyycwn6mjdwnglw9g3k6pmgq/T/blockmgr-12dba181-771e-4ff9-a2bc-fa3ce6dbabfa/11/test_my-block-id
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37710) Add detailed log message for java.io.IOException occurring on Kryo flow

2022-01-06 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-37710.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34980
[https://github.com/apache/spark/pull/34980]

> Add detailed log message for java.io.IOException occurring on Kryo flow
> ---
>
> Key: SPARK-37710
> URL: https://issues.apache.org/jira/browse/SPARK-37710
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Major
> Fix For: 3.3.0
>
>
> *Input/output error* usually points environmental issues such as disk 
> read/write failures due to disk corruption, network access failures etc. This 
> PR aims to be added detailed error message to catch this kind of 
> environmental cases occurring on problematic BlockManager and logs with 
> *BlockManager hostname, blockId and blockPath* details.
> Following stack-trace occurred on disk corruption:
> {code:java}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: Input/output 
> error
> Serialization trace:
> buffers (org.apache.spark.sql.execution.columnar.DefaultCachedBatch)
>     at com.esotericsoftware.kryo.io.Input.fill(Input.java:166)
>     at com.esotericsoftware.kryo.io.Input.require(Input.java:196)
>     at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:346)
>     at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:326)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:55)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:38)
>     at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:789)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:381)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
>     at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:789)
>     at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:132)
>     at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
>     at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:816)
>     at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296)
>     at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
>     at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>     at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>     at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>     at 
> org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1569)
>     at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:877)
>     at org.apache.spark.storage.BlockManager.get(BlockManager.scala:1163)
> ...
> Caused by: java.io.IOException: Input/output error
>     at java.io.FileInputStream.readBytes(Native Method)
>     at java.io.FileInputStream.read(FileInputStream.java:255)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.tryReadFully(LZ4BlockInputStream.java:269)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:280)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:243)
>     at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
>     at com.esotericsoftware.kryo.io.Input.fill(Input.java:164)
>     ... 87 more {code}
> *Proposed Error Message:*
> {code:java}
> java.io.IOException: Input/output error. BlockManagerId(driver, localhost, 
> 49455, None) - blockId: test_my-block-id - blockDiskPath: 
> /private/var/folders/kj/mccyycwn6mjdwnglw9g3k6pmgq/T/blockmgr-12dba181-771e-4ff9-a2bc-fa3ce6dbabfa/11/test_my-block-id
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37527) Translate more standard aggregate functions for pushdown

2022-01-06 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-37527.

Fix Version/s: 3.3.0
 Assignee: jiaan.geng
   Resolution: Fixed

> Translate more standard aggregate functions for pushdown
> 
>
> Key: SPARK-37527
> URL: https://issues.apache.org/jira/browse/SPARK-37527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, Spark aggregate pushdown will translate some standard aggregate 
> functions, so that compile these functions suitable specify database.
> After this job, users could override JdbcDialect.compileAggregate to 
> implement some aggregate functions supported by some database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37710) Add detailed log message for java.io.IOException occurring on Kryo flow

2022-01-06 Thread Eren Avsarogullari (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-37710:
---
Summary: Add detailed log message for java.io.IOException occurring on Kryo 
flow  (was: Add detailed error message for java.io.IOException occurring on 
Kryo flow)

> Add detailed log message for java.io.IOException occurring on Kryo flow
> ---
>
> Key: SPARK-37710
> URL: https://issues.apache.org/jira/browse/SPARK-37710
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Eren Avsarogullari
>Priority: Major
>
> *Input/output error* usually points environmental issues such as disk 
> read/write failures due to disk corruption, network access failures etc. This 
> PR aims to be added detailed error message to catch this kind of 
> environmental cases occurring on problematic BlockManager and logs with 
> *BlockManager hostname, blockId and blockPath* details.
> Following stack-trace occurred on disk corruption:
> {code:java}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: Input/output 
> error
> Serialization trace:
> buffers (org.apache.spark.sql.execution.columnar.DefaultCachedBatch)
>     at com.esotericsoftware.kryo.io.Input.fill(Input.java:166)
>     at com.esotericsoftware.kryo.io.Input.require(Input.java:196)
>     at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:346)
>     at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:326)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:55)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:38)
>     at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:789)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:381)
>     at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
>     at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:789)
>     at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:132)
>     at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
>     at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:816)
>     at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296)
>     at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
>     at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>     at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>     at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>     at 
> org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1569)
>     at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:877)
>     at org.apache.spark.storage.BlockManager.get(BlockManager.scala:1163)
> ...
> Caused by: java.io.IOException: Input/output error
>     at java.io.FileInputStream.readBytes(Native Method)
>     at java.io.FileInputStream.read(FileInputStream.java:255)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.tryReadFully(LZ4BlockInputStream.java:269)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:280)
>     at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:243)
>     at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
>     at com.esotericsoftware.kryo.io.Input.fill(Input.java:164)
>     ... 87 more {code}
> *Proposed Error Message:*
> {code:java}
> java.io.IOException: Input/output error. BlockManagerId(driver, localhost, 
> 49455, None) - blockId: test_my-block-id - blockDiskPath: 
> /private/var/folders/kj/mccyycwn6mjdwnglw9g3k6pmgq/T/blockmgr-12dba181-771e-4ff9-a2bc-fa3ce6dbabfa/11/test_my-block-id
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37578) DSV2 is not updating Output Metrics

2022-01-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470178#comment-17470178
 ] 

L. C. Hsieh commented on SPARK-37578:
-

Thank you [~dongjoon]

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Sandeep Katta
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36405) Check that error class SQLSTATEs are valid

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36405:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Check that error class SQLSTATEs are valid
> --
>
> Key: SPARK-36405
> URL: https://issues.apache.org/jira/browse/SPARK-36405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Karen Feng
>Assignee: Karen Feng
>Priority: Major
> Fix For: 3.3.0
>
>
> Using the SQLSTATEs in the error class README as the source of truth, we 
> should validate the SQLSTATEs in the error class JSON.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36404) Support nested columns in ORC vectorized reader for data source v2

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36404:
--
Affects Version/s: (was: 3.2.0)

> Support nested columns in ORC vectorized reader for data source v2
> --
>
> Key: SPARK-36404
> URL: https://issues.apache.org/jira/browse/SPARK-36404
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> We added support of nested columns in ORC vectorized reader for data source 
> v1. Data source v2 and v1 both use same underlying implementation for 
> vectorized reader (OrcColumnVector), so we can support data source v2 as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36418:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Use CAST in parsing of dates/timestamps with default pattern
> 
>
> Key: SPARK-36418
> URL: https://issues.apache.org/jira/browse/SPARK-36418
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> In functions, CSV/JSON datasources and other places, when the pattern is 
> default, use CAST logic in parsing strings to dates/timestamps.
> Currently, TimestampFormatter.getFormatter() applies the default pattern 
> *-MM-dd  HH:mm:ss* when the pattern is not set, see 
> https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344
>  . Instead of that, need to create a special formatter which invokes the cast 
> logic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36425) PySpark: support CrossValidatorModel get standard deviation of metrics for each paramMap

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36425:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)

> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap 
> -
>
> Key: SPARK-36425
> URL: https://issues.apache.org/jira/browse/SPARK-36425
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.3.0
>
>
> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36451) Ivy skips looking for source and doc pom

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36451:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Ivy skips looking for source and doc pom
> 
>
> Key: SPARK-36451
> URL: https://issues.apache.org/jira/browse/SPARK-36451
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Because SPARK-35863 Upgrade Ivy to 2.5.0, it supports skip searching the 
> source and doc pom, but the remote repo will still be queried at present.
>  
> org.apache.ivy.plugins.parser.m2.PomModuleDescriptorParser#addSourcesAndJavadocArtifactsIfPresent
> {code:java}
> boolean sourcesLookup = !"false"
> .equals(ivySettings.getVariable("ivy.maven.lookup.sources"));
> boolean javadocLookup = !"false"
> .equals(ivySettings.getVariable("ivy.maven.lookup.javadoc"));
> if (!sourcesLookup && !javadocLookup) {
> Message.debug("Sources and javadocs lookup disabled");
> return;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36420) Use `isEmpty` to improve performance in Pregel's superstep

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36420:
--
Affects Version/s: 3.3.0
   (was: 2.4.7)

> Use `isEmpty` to improve performance in Pregel's superstep
> --
>
> Key: SPARK-36420
> URL: https://issues.apache.org/jira/browse/SPARK-36420
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.3.0
>Reporter: xiepengjie
>Assignee: xiepengjie
>Priority: Minor
> Fix For: 3.3.0
>
>
> When I was running `Graphx.connectedComponents` with 20+ billion vertices and 
> edges, I found that count is very slow.
> {code:java}
> object Pregel extends Logging {
>   ...
>   def apply[VD: ClassTag, ED: ClassTag, A: ClassTag] (...): Graph[VD, ED] = {
> ...
> // Maybe messages.isEmpty() is better than messages.count()
> var activeMessages = messages.count()
> // Loop
> var prevG: Graph[VD, ED] = null
> var i = 0
> while (activeMessages > 0 && i < maxIterations) {
>   ...
>   activeMessages = messages.count()
>   ...
> }
> ...
> g
>   } // end of apply
> } // end of class Pregel
> {code}
> Maybe we only need an action operator here and active-messages are not empty, 
> so we don’t need to use count, it’s better to use isEmpty. I verified it and 
> it worked very well.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36475:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Add doc about spark.shuffle.service.fetch.rdd.enabled
> -
>
> Key: SPARK-36475
> URL: https://issues.apache.org/jira/browse/SPARK-36475
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Add doc about spark.shuffle.service.fetch.rdd.enabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36607) Support BooleanType in UnwrapCastInBinaryComparison

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36607:
--
Affects Version/s: (was: 3.2.0)
   (was: 3.1.2)

> Support BooleanType in UnwrapCastInBinaryComparison
> ---
>
> Key: SPARK-36607
> URL: https://issues.apache.org/jira/browse/SPARK-36607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
>
> Enhancing the previous works from SPARK-24994 and SPARK-32858



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36481) Expose LogisticRegression.setInitialModel

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36481:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Expose LogisticRegression.setInitialModel
> -
>
> Key: SPARK-36481
> URL: https://issues.apache.org/jira/browse/SPARK-36481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.3.0
>
>
> Several Spark ML components already allow setting of an initial model, 
> including KMeans, LogisticRegression, and GaussianMixture. This is useful to 
> begin training from a known reasonably good model.
> However, the method in LogisticRegression is private to Spark. I don't see a 
> good reason why it should be as the others in KMeans et al are not.
> None of these are exposed in Pyspark, which I don't necessarily want to 
> question or deal with now; there are other places one could arguably set an 
> initial model too, but, here just interested in exposing the existing, tested 
> functionality to callers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36644) Push down boolean column filter

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36644:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)
   (was: 3.1.2)

> Push down boolean column filter
> ---
>
> Key: SPARK-36644
> URL: https://issues.apache.org/jira/browse/SPARK-36644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE boolean_field
> ```
> although the following query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE boolean_field = true
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36654) Drop type ignores from numpy imports

2022-01-06 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470161#comment-17470161
 ] 

Dongjoon Hyun commented on SPARK-36654:
---

I assigned this JIRA to you, [~zero323].

> Drop type ignores from numpy imports
> 
>
> Key: SPARK-36654
> URL: https://issues.apache.org/jira/browse/SPARK-36654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently we use {{type: ingore[import]}} on all numpy imports ‒ this was 
> necessary because numpy didn't provide annotations at the time when we added 
> stubs to PySpark.
> Since numpy 1.20 (https://github.com/numpy/numpy/releases/tag/v1.20.0) numpy 
> is PEP 561 compatible and these ignores are no longer necessary (current 
> numpy version is 1.21).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36654) Drop type ignores from numpy imports

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36654:
-

Assignee: Maciej Szymkiewicz

> Drop type ignores from numpy imports
> 
>
> Key: SPARK-36654
> URL: https://issues.apache.org/jira/browse/SPARK-36654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently we use {{type: ingore[import]}} on all numpy imports ‒ this was 
> necessary because numpy didn't provide annotations at the time when we added 
> stubs to PySpark.
> Since numpy 1.20 (https://github.com/numpy/numpy/releases/tag/v1.20.0) numpy 
> is PEP 561 compatible and these ignores are no longer necessary (current 
> numpy version is 1.21).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36654) Drop type ignores from numpy imports

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36654:
--
Affects Version/s: (was: 3.2.0)
   (was: 3.1.2)

> Drop type ignores from numpy imports
> 
>
> Key: SPARK-36654
> URL: https://issues.apache.org/jira/browse/SPARK-36654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently we use {{type: ingore[import]}} on all numpy imports ‒ this was 
> necessary because numpy didn't provide annotations at the time when we added 
> stubs to PySpark.
> Since numpy 1.20 (https://github.com/numpy/numpy/releases/tag/v1.20.0) numpy 
> is PEP 561 compatible and these ignores are no longer necessary (current 
> numpy version is 1.21).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36665) Add more Not operator optimizations

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36665:
--
Affects Version/s: (was: 3.2.0)
   (was: 3.1.2)

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36721) Simplify boolean equalities if one side is literal

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36721:
--
Affects Version/s: (was: 3.2.0)
   (was: 3.1.2)

> Simplify boolean equalities if one side is literal
> --
>
> Key: SPARK-36721
> URL: https://issues.apache.org/jira/browse/SPARK-36721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE (a AND b) = true
> ```
> although the following equivalent query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE (a AND b) 
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36663:
--
Affects Version/s: 3.3.0
   (was: 3.1.2)
   (was: 3.0.3)

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: mcdull_zhang
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint")
> {quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-56-28-846.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36829) Refactor collectionOperation related Null check related code

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36829:
--
Affects Version/s: 3.3.0
   (was: 3.0.2)
   (was: 3.2.0)
   (was: 3.1.2)

> Refactor collectionOperation related Null check related code
> 
>
> Key: SPARK-36829
> URL: https://issues.apache.org/jira/browse/SPARK-36829
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36838) Improve InSet NaN check generated code performance

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36838:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)
   (was: 3.1.2)
   (was: 3.0.3)

> Improve InSet NaN check generated code performance
> --
>
> Key: SPARK-36838
> URL: https://issues.apache.org/jira/browse/SPARK-36838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> Since Set can't check is NaN value is contained in current set.
> With codegen, only when value set contains NaN then we have necessary to 
> check if the value is NaN, or we just need t
> o check is the Set contains the value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36876:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)
   (was: 3.1.2)
   (was: 3.0.3)

> Support Dynamic Partition pruning for HiveTableScanExec
> ---
>
> Key: SPARK-36876
> URL: https://issues.apache.org/jira/browse/SPARK-36876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Support dynamic partition pruning for hive serde scan



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36978:
--
Affects Version/s: 3.3.0
   (was: 3.0.0)
   (was: 3.1.0)
   (was: 3.2.0)

> InferConstraints rule should create IsNotNull constraints on the nested field 
> instead of the root nested type 
> --
>
> Key: SPARK-36978
> URL: https://issues.apache.org/jira/browse/SPARK-36978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
> Fix For: 3.3.0
>
>
> [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
>  optimization rule generates {{IsNotNull}} constraints corresponding to null 
> intolerant predicates. The {{IsNotNull}} constraints are generated on the 
> attribute inside the corresponding predicate. 
>  e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
> constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
> column {{structCol.b}} where {{structCol}} is a struct column results in a 
> constraint {{IsNotNull(structCol)}}.
> This generation of constraints on the root level nested type is extremely 
> conservative as it could lead to materialization of the the entire struct. 
> The constraint should instead be generated on the nested field being 
> referenced by the predicate. In the above example, the constraint should be 
> {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}
>  
> The new constraints also create opportunities for nested pruning. Currently 
> {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
> However the constraint {{IsNotNull(structCol.b)}} could create opportunities 
> to prune {{structCol}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36894:
--
Affects Version/s: (was: 3.2.0)
   (was: 3.1.2)

> RDD.toDF should be synchronized with dispatched variants of 
> SparkSession.createDataFrame
> 
>
> Key: SPARK-36894
> URL: https://issues.apache.org/jira/browse/SPARK-36894
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> There are some variants that are supported:
>  * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects
>  * Providing a schema as a {{Tuple[str, ...]}} names
>  * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or 
> {{AtomicType}} is provided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37044) Add Row to all in pyspark.sql.types

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37044:
--
Affects Version/s: (was: 3.1.0)
   (was: 3.2.0)

> Add Row to __all__ in pyspark.sql.types
> ---
>
> Key: SPARK-37044
> URL: https://issues.apache.org/jira/browse/SPARK-37044
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
> {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
> *}} won't import {{Row}}.
> It might be counter-intuitive, especially when we import {{Row}} from 
> {{types}} in {{examples}}.
> Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37104) RDD and DStream should be covariant

2022-01-06 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470157#comment-17470157
 ] 

Dongjoon Hyun commented on SPARK-37104:
---

I removed `3.1.0` and `3.2.0` from the `Affects Versions` because new feature 
cannot affect the released old versions.

> RDD and DStream should be covariant
> ---
>
> Key: SPARK-37104
> URL: https://issues.apache.org/jira/browse/SPARK-37104
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> At the moment {{RDD}} and {{DStream}} are defined as invariant.
>  
> However, there are immutable and could be marked as covariant.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37176:
--
Affects Version/s: 3.3.0
   (was: 3.2.0)
   (was: 3.1.2)
   (was: 3.0.3)

> JsonSource's infer should have the same exception handle logic as 
> JacksonParser's parse logic
> -
>
> Key: SPARK-37176
> URL: https://issues.apache.org/jira/browse/SPARK-37176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
> Fix For: 3.3.0
>
>
> JacksonParser's exception handle logic is different with 
> org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different 
> can be saw as below:
> {code:java}
> // code JacksonParser's parse
> try {
>   Utils.tryWithResource(createParser(factory, record)) { parser =>
> // a null first token is equivalent to testing for input.trim.isEmpty
> // but it works on any token stream and not just strings
> parser.nextToken() match {
>   case null => None
>   case _ => rootConverter.apply(parser) match {
> case null => throw 
> QueryExecutionErrors.rootConverterReturnNullError()
> case rows => rows.toSeq
>   }
> }
>   }
> } catch {
>   case e: SparkUpgradeException => throw e
>   case e @ (_: RuntimeException | _: JsonProcessingException | _: 
> MalformedInputException) =>
> // JSON parser currently doesn't support partial results for 
> corrupted records.
> // For such records, all fields other than the field configured by
> // `columnNameOfCorruptRecord` are set to `null`.
> throw BadRecordException(() => recordLiteral(record), () => None, e)
>   case e: CharConversionException if options.encoding.isEmpty =>
> val msg =
>   """JSON parser cannot handle a character in its input.
> |Specifying encoding as an input option explicitly might help to 
> resolve the issue.
> |""".stripMargin + e.getMessage
> val wrappedCharException = new CharConversionException(msg)
> wrappedCharException.initCause(e)
> throw BadRecordException(() => recordLiteral(record), () => None, 
> wrappedCharException)
>   case PartialResultException(row, cause) =>
> throw BadRecordException(
>   record = () => recordLiteral(record),
>   partialResult = () => Some(row),
>   cause)
> }
> {code}
> v.s. 
> {code:java}
> // JsonInferSchema's infer logic
> val mergedTypesFromPartitions = json.mapPartitions { iter =>
>   val factory = options.buildJsonFactory()
>   iter.flatMap { row =>
> try {
>   Utils.tryWithResource(createParser(factory, row)) { parser =>
> parser.nextToken()
> Some(inferField(parser))
>   }
> } catch {
>   case  e @ (_: RuntimeException | _: JsonProcessingException) => 
> parseMode match {
> case PermissiveMode =>
>   Some(StructType(Seq(StructField(columnNameOfCorruptRecord, 
> StringType
> case DropMalformedMode =>
>   None
> case FailFastMode =>
>   throw 
> QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e)
>   }
> }
>   }.reduceOption(typeMerger).toIterator
> }
> {code}
> They should have the same exception handle logic, otherwise it may confuse 
> user because of the inconsistency.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37104) RDD and DStream should be covariant

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37104:
--
Affects Version/s: (was: 3.1.0)
   (was: 3.2.0)

> RDD and DStream should be covariant
> ---
>
> Key: SPARK-37104
> URL: https://issues.apache.org/jira/browse/SPARK-37104
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> At the moment {{RDD}} and {{DStream}} are defined as invariant.
>  
> However, there are immutable and could be marked as covariant.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37462) Avoid unnecessary calculating the number of outstanding fetch requests and RPCS

2022-01-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37462:
--
Affects Version/s: 3.3.0
   (was: 3.1.0)
   (was: 3.2.0)

>  Avoid unnecessary calculating the number of outstanding fetch requests and 
> RPCS
> 
>
> Key: SPARK-37462
> URL: https://issues.apache.org/jira/browse/SPARK-37462
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Trivial
> Fix For: 3.3.0
>
>
> It is unnecessary to calculate the number of outstanding fetch requests and 
> RPCS when the IdleStateEvent is not IDLE or the last request is not timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37578) DSV2 is not updating Output Metrics

2022-01-06 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470155#comment-17470155
 ] 

Dongjoon Hyun commented on SPARK-37578:
---

I updated the `Affected Versions` to `3.3.0` because new feature cannot affect 
the old released versions.

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Sandeep Katta
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 127 matches

Mail list logo