[jira] [Created] (SPARK-37285) Add Weight of Evidence and Information value to ml.feature

2021-11-10 Thread Simon Tao (Jira)
Simon Tao created SPARK-37285:
-

 Summary: Add Weight of Evidence and Information value to ml.feature
 Key: SPARK-37285
 URL: https://issues.apache.org/jira/browse/SPARK-37285
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.2.0
Reporter: Simon Tao


The weight of evidence (WOE) and information value (IV) provide a great 
framework for exploratory analysis and variable screening for binary 
classifiers as well as beneficial and help us analyze multiple points as listed 
below:

1. Helps check the linear relationship of a feature with its dependent feature 
to be used in the model.

2. Is a good variable transformation method for both continuous and categorical 
features.

3. Is better than on-hot encoding as this method of variable transformation 
does not increase the complexity of the model.

4. Detect linear and non-linear relationships.

5. Be useful in feature selection.

6. Is a good measure of the predictive power of a feature and it also helps 
point out the suspicious feature.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should

2021-11-10 Thread hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao updated SPARK-37274:

Summary: When the value of this parameter is greater than the maximum value 
of int type, the value will be thrown out of bounds. The document description 
of this parameter should remind the user of this risk point  (was: These 
parameters should be of type long, not int)

> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> These parameters [spark.sql.orc.columnarReaderBatchSize], 
> [spark.sql.inMemoryColumnarStorage.batchSize], 
> [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of 
> type int. when the user sets the value to be greater than the maximum value 
> of type int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should

2021-11-10 Thread hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao updated SPARK-37274:

Description: When the value of this parameter is greater than the maximum 
value of int type, the value will be thrown out of bounds. The document 
description of this parameter should remind the user of this risk point  (was: 
These parameters [spark.sql.orc.columnarReaderBatchSize], 
[spark.sql.inMemoryColumnarStorage.batchSize], 
[spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type 
int. when the user sets the value to be greater than the maximum value of type 
int, an error will be thrown)

> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37282:
-

Assignee: Dongjoon Hyun

> Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
> --
>
> Key: SPARK-37282
> URL: https://issues.apache.org/jira/browse/SPARK-37282
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Java 17 officially support Apple Silicon.
> - JEP 391: macOS/AArch64 Port
> - https://bugs.openjdk.java.net/browse/JDK-8251280
> Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
> natively.
> {code}
> /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable 
> arm64
> /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
> /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
> arm64
> {code}
> Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases 
> fail on M1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37282.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34548
[https://github.com/apache/spark/pull/34548]

> Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
> --
>
> Key: SPARK-37282
> URL: https://issues.apache.org/jira/browse/SPARK-37282
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> Java 17 officially support Apple Silicon.
> - JEP 391: macOS/AArch64 Port
> - https://bugs.openjdk.java.net/browse/JDK-8251280
> Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
> natively.
> {code}
> /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable 
> arm64
> /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
> /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
> arm64
> {code}
> Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases 
> fail on M1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37284) Upgrade Jekyll to 4.2.1

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37284:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade Jekyll to 4.2.1
> ---
>
> Key: SPARK-37284
> URL: https://issues.apache.org/jira/browse/SPARK-37284
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> Jekyll 4.2.1 was released in September, which includes the fix of a 
> regression bug.
> https://github.com/jekyll/jekyll/releases/tag/v4.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37284) Upgrade Jekyll to 4.2.1

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37284:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade Jekyll to 4.2.1
> ---
>
> Key: SPARK-37284
> URL: https://issues.apache.org/jira/browse/SPARK-37284
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Jekyll 4.2.1 was released in September, which includes the fix of a 
> regression bug.
> https://github.com/jekyll/jekyll/releases/tag/v4.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37284) Upgrade Jekyll to 4.2.1

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442114#comment-17442114
 ] 

Apache Spark commented on SPARK-37284:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34552

> Upgrade Jekyll to 4.2.1
> ---
>
> Key: SPARK-37284
> URL: https://issues.apache.org/jira/browse/SPARK-37284
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Jekyll 4.2.1 was released in September, which includes the fix of a 
> regression bug.
> https://github.com/jekyll/jekyll/releases/tag/v4.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37284) Upgrade Jekyll to 4.2.1

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442113#comment-17442113
 ] 

Apache Spark commented on SPARK-37284:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34552

> Upgrade Jekyll to 4.2.1
> ---
>
> Key: SPARK-37284
> URL: https://issues.apache.org/jira/browse/SPARK-37284
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Jekyll 4.2.1 was released in September, which includes the fix of a 
> regression bug.
> https://github.com/jekyll/jekyll/releases/tag/v4.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37263:


Assignee: Apache Spark

> Add PandasAPIOnSparkAdviceWarning class
> ---
>
> Key: SPARK-37263
> URL: https://issues.apache.org/jira/browse/SPARK-37263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
>  now issuing too much warning message, so it might be good to have 
> pandas-on-Spark specific warning class so that users can manually turn it off 
> by using warning.simplefilter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37284) Upgrade Jekyll to 4.2.1

2021-11-10 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-37284:
--

 Summary: Upgrade Jekyll to 4.2.1
 Key: SPARK-37284
 URL: https://issues.apache.org/jira/browse/SPARK-37284
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


Jekyll 4.2.1 was released in September, which includes the fix of a regression 
bug.
https://github.com/jekyll/jekyll/releases/tag/v4.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442112#comment-17442112
 ] 

Apache Spark commented on SPARK-37263:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/34550

> Add PandasAPIOnSparkAdviceWarning class
> ---
>
> Key: SPARK-37263
> URL: https://issues.apache.org/jira/browse/SPARK-37263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
>  now issuing too much warning message, so it might be good to have 
> pandas-on-Spark specific warning class so that users can manually turn it off 
> by using warning.simplefilter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37263:


Assignee: (was: Apache Spark)

> Add PandasAPIOnSparkAdviceWarning class
> ---
>
> Key: SPARK-37263
> URL: https://issues.apache.org/jira/browse/SPARK-37263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
>  now issuing too much warning message, so it might be good to have 
> pandas-on-Spark specific warning class so that users can manually turn it off 
> by using warning.simplefilter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442111#comment-17442111
 ] 

Apache Spark commented on SPARK-37283:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34551

> Don't try to store a V1 table which contains ANSI intervals in Hive 
> compatible format
> -
>
> Key: SPARK-37283
> URL: https://issues.apache.org/jira/browse/SPARK-37283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> If, a table being created contains a column of ANSI interval types and the 
> underlying file format has a corresponding Hive SerDe (e.g. Parquet),
> `HiveExternalcatalog` tries to store the table in Hive compatible format.
> But, as ANSI interval types in Spark and interval type in Hive are not 
> compatible (Hive only supports interval_year_month and interval_day_time), 
> the following warning with stack trace will be logged.
> {code}
> spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet;
> 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist 
> `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore 
> in Spark SQL specific format.
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
> 'interval year to month' but 'interval year to month' is found.
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
>   at 
> 

[jira] [Assigned] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37283:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Don't try to store a V1 table which contains ANSI intervals in Hive 
> compatible format
> -
>
> Key: SPARK-37283
> URL: https://issues.apache.org/jira/browse/SPARK-37283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> If, a table being created contains a column of ANSI interval types and the 
> underlying file format has a corresponding Hive SerDe (e.g. Parquet),
> `HiveExternalcatalog` tries to store the table in Hive compatible format.
> But, as ANSI interval types in Spark and interval type in Hive are not 
> compatible (Hive only supports interval_year_month and interval_day_time), 
> the following warning with stack trace will be logged.
> {code}
> spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet;
> 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist 
> `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore 
> in Spark SQL specific format.
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
> 'interval year to month' but 'interval year to month' is found.
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
>   at 
> 

[jira] [Assigned] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37283:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Don't try to store a V1 table which contains ANSI intervals in Hive 
> compatible format
> -
>
> Key: SPARK-37283
> URL: https://issues.apache.org/jira/browse/SPARK-37283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> If, a table being created contains a column of ANSI interval types and the 
> underlying file format has a corresponding Hive SerDe (e.g. Parquet),
> `HiveExternalcatalog` tries to store the table in Hive compatible format.
> But, as ANSI interval types in Spark and interval type in Hive are not 
> compatible (Hive only supports interval_year_month and interval_day_time), 
> the following warning with stack trace will be logged.
> {code}
> spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet;
> 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist 
> `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore 
> in Spark SQL specific format.
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
> 'interval year to month' but 'interval year to month' is found.
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
>   at 
> 

[jira] [Updated] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class

2021-11-10 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37263:

Description: 
Raised from comment 
[https://github.com/apache/spark/pull/34389#discussion_r741733023].

The advice warning for pandas API on Spark for expensive APIs 
([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
 now issuing too much warning message, so it might be good to have 
pandas-on-Spark specific warning class so that users can manually turn it off 
by using warning.simplefilter.

  was:
Raised from comment 
[https://github.com/apache/spark/pull/34389#discussion_r741733023].

The advice warning for pandas API on Spark for expensive APIs 
([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
 now issuing too much warning message, so it might be good to have option to 
turn this message on/off.


> Add PandasAPIOnSparkAdviceWarning class
> ---
>
> Key: SPARK-37263
> URL: https://issues.apache.org/jira/browse/SPARK-37263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
>  now issuing too much warning message, so it might be good to have 
> pandas-on-Spark specific warning class so that users can manually turn it off 
> by using warning.simplefilter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class

2021-11-10 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37263:

Summary: Add PandasAPIOnSparkAdviceWarning class  (was: Add an option to 
silence advice for pandas API on Spark.)

> Add PandasAPIOnSparkAdviceWarning class
> ---
>
> Key: SPARK-37263
> URL: https://issues.apache.org/jira/browse/SPARK-37263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
>  now issuing too much warning message, so it might be good to have option to 
> turn this message on/off.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format

2021-11-10 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37283:
---
Description: 
If, a table being created contains a column of ANSI interval types and the 
underlying file format has a corresponding Hive SerDe (e.g. Parquet),
`HiveExternalcatalog` tries to store the table in Hive compatible format.
But, as ANSI interval types in Spark and interval type in Hive are not 
compatible (Hive only supports interval_year_month and interval_day_time), the 
following warning with stack trace will be logged.

{code}
spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet;
21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist `default`.`tbl1` 
in a Hive compatible way. Persisting it into Hive metastore in Spark SQL 
specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
'interval year to month' but 'interval year to month' is found.
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376)
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 

[jira] [Created] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format

2021-11-10 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-37283:
--

 Summary: Don't try to store a V1 table which contains ANSI 
intervals in Hive compatible format
 Key: SPARK-37283
 URL: https://issues.apache.org/jira/browse/SPARK-37283
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


If, a table being created contains a column of ANSI interval types and the 
underlying file format has a corresponding Hive SerDe (e.g. Parquet),
`HiveExternalcatalog` tries to store the table in Hive compatible format.
But, as ANSI interval types in Spark and interval type in Hive are not 
compatible (Hive only supports interval_year_month and interval_day_time), the 
following warning with stack trace will be logged.

{code}
spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet;
21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist `default`.`tbl1` 
in a Hive compatible way. Persisting it into Hive metastore in Spark SQL 
specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
'interval year to month' but 'interval year to month' is found.
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376)
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 

[jira] [Assigned] (SPARK-37274) These parameters should be of type long, not int

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37274:


Assignee: Apache Spark

> These parameters should be of type long, not int
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Assignee: Apache Spark
>Priority: Major
>
> These parameters [spark.sql.orc.columnarReaderBatchSize], 
> [spark.sql.inMemoryColumnarStorage.batchSize], 
> [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of 
> type int. when the user sets the value to be greater than the maximum value 
> of type int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37274) These parameters should be of type long, not int

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442102#comment-17442102
 ] 

Apache Spark commented on SPARK-37274:
--

User 'dh20' has created a pull request for this issue:
https://github.com/apache/spark/pull/34549

> These parameters should be of type long, not int
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> These parameters [spark.sql.orc.columnarReaderBatchSize], 
> [spark.sql.inMemoryColumnarStorage.batchSize], 
> [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of 
> type int. when the user sets the value to be greater than the maximum value 
> of type int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37274) These parameters should be of type long, not int

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37274:


Assignee: (was: Apache Spark)

> These parameters should be of type long, not int
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> These parameters [spark.sql.orc.columnarReaderBatchSize], 
> [spark.sql.inMemoryColumnarStorage.batchSize], 
> [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of 
> type int. when the user sets the value to be greater than the maximum value 
> of type int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37263) Add an option to silence advice for pandas API on Spark.

2021-11-10 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37263:

Description: 
Raised from comment 
[https://github.com/apache/spark/pull/34389#discussion_r741733023].

The advice warning for pandas API on Spark for expensive APIs 
([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
 now issuing too much warning message, so it might be good to have option to 
turn this message on/off.

  was:
Raised from comment 
https://github.com/apache/spark/pull/34389#discussion_r741733023.

The advice warning for pandas API on Spark for expensive APIs 
([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
 now issuing too much warning message, since it also issuing the warning when 
the APIs are used for internal usage.


> Add an option to silence advice for pandas API on Spark.
> 
>
> Key: SPARK-37263
> URL: https://issues.apache.org/jira/browse/SPARK-37263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
>  now issuing too much warning message, so it might be good to have option to 
> turn this message on/off.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37263) Add an option to silence advice for pandas API on Spark.

2021-11-10 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37263:

Summary: Add an option to silence advice for pandas API on Spark.  (was: 
Create an option to silence advice for pandas API on Spark.)

> Add an option to silence advice for pandas API on Spark.
> 
>
> Key: SPARK-37263
> URL: https://issues.apache.org/jira/browse/SPARK-37263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> https://github.com/apache/spark/pull/34389#discussion_r741733023.
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
>  now issuing too much warning message, since it also issuing the warning when 
> the APIs are used for internal usage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37263) Create an option to silence advice for pandas API on Spark.

2021-11-10 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37263:

Summary: Create an option to silence advice for pandas API on Spark.  (was: 
Reduce pandas-on-Spark warning for internal usage.)

> Create an option to silence advice for pandas API on Spark.
> ---
>
> Key: SPARK-37263
> URL: https://issues.apache.org/jira/browse/SPARK-37263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> https://github.com/apache/spark/pull/34389#discussion_r741733023.
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).]
>  now issuing too much warning message, since it also issuing the warning when 
> the APIs are used for internal usage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37276) Support YearMonthIntervalType in Arrow

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37276:
-
Description: 
Implements the support of YearMonthIntervalType in Arrow code path:
- pandas UDFs
- pandas functions APIs
- createDataFrame/toPandas w/ Arrow

  was:
Implements the support of YearMonthIntervalType in Arrow code path:
- pandas UDFs
- pandas functions APIs


> Support YearMonthIntervalType in Arrow
> --
>
> Key: SPARK-37276
> URL: https://issues.apache.org/jira/browse/SPARK-37276
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of YearMonthIntervalType in Arrow code path:
> - pandas UDFs
> - pandas functions APIs
> - createDataFrame/toPandas w/ Arrow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37278:
-
Description: 
Implements the support of YearMonthIntervalType in:
- Python UDFs
- createDataFrame/toPandas without Arrow

  was:
Implements the support of YearMonthIntervalType in:
- Python UDFs
- createDataFrame/toPandas


> Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
> -
>
> Key: SPARK-37278
> URL: https://issues.apache.org/jira/browse/SPARK-37278
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of YearMonthIntervalType in:
> - Python UDFs
> - createDataFrame/toPandas without Arrow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37278:
-
Description: 
Implements the support of YearMonthIntervalType in:
- Python UDFs
- createDataFrame/toPandas

  was:
Implements the support of YearMonthIntervalType in:
- Python UDFs
- createDataFrame/toPandas when Arrow is disabled


> Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
> -
>
> Key: SPARK-37278
> URL: https://issues.apache.org/jira/browse/SPARK-37278
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of YearMonthIntervalType in:
> - Python UDFs
> - createDataFrame/toPandas



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37276) Support YearMonthIntervalType in Arrow

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37276:
-
Description: 
Implements the support of YearMonthIntervalType in Arrow code path:
- pandas UDFs
- pandas functions APIs

  was:
Implements the support of YearMonthIntervalType in Arrow code path:
- pandas UDFs
- pandas functions APIs
- createDataFrame/toPandas when Arrow is enabled


> Support YearMonthIntervalType in Arrow
> --
>
> Key: SPARK-37276
> URL: https://issues.apache.org/jira/browse/SPARK-37276
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of YearMonthIntervalType in Arrow code path:
> - pandas UDFs
> - pandas functions APIs



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37282:


Assignee: (was: Apache Spark)

> Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
> --
>
> Key: SPARK-37282
> URL: https://issues.apache.org/jira/browse/SPARK-37282
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Java 17 officially support Apple Silicon.
> - JEP 391: macOS/AArch64 Port
> - https://bugs.openjdk.java.net/browse/JDK-8251280
> Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
> natively.
> {code}
> /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable 
> arm64
> /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
> /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
> arm64
> {code}
> Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases 
> fail on M1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37282:


Assignee: Apache Spark

> Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
> --
>
> Key: SPARK-37282
> URL: https://issues.apache.org/jira/browse/SPARK-37282
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Java 17 officially support Apple Silicon.
> - JEP 391: macOS/AArch64 Port
> - https://bugs.openjdk.java.net/browse/JDK-8251280
> Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
> natively.
> {code}
> /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable 
> arm64
> /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
> /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
> arm64
> {code}
> Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases 
> fail on M1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442084#comment-17442084
 ] 

Apache Spark commented on SPARK-37282:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34548

> Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
> --
>
> Key: SPARK-37282
> URL: https://issues.apache.org/jira/browse/SPARK-37282
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Java 17 officially support Apple Silicon.
> - JEP 391: macOS/AArch64 Port
> - https://bugs.openjdk.java.net/browse/JDK-8251280
> Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
> natively.
> {code}
> /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable 
> arm64
> /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
> /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
> arm64
> {code}
> Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases 
> fail on M1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon

2021-11-10 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37282:
-

 Summary: Add ExtendedLevelDBTest and disable LevelDB tests on 
Apple Silicon
 Key: SPARK-37282
 URL: https://issues.apache.org/jira/browse/SPARK-37282
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Tests
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun


Java 17 officially support Apple Silicon.
- JEP 391: macOS/AArch64 Port
- https://bugs.openjdk.java.net/browse/JDK-8251280

Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
natively.
{code}
/Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64
/Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
/Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
arm64
{code}

Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases 
fail on M1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36073) EquivalentExpressions fixes and improvements

2021-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36073:
---

Assignee: Peter Toth

> EquivalentExpressions fixes and improvements
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>
> Currently `EquivalentExpressions` has 2 issues:
> - identifying common expressions in conditional expressions is not correct in 
> all cases
> - transparently canonicalized expressions (like `PromotePrecision`) are 
> considered common subexpressions



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36073) EquivalentExpressions fixes and improvements

2021-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36073.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33281
[https://github.com/apache/spark/pull/33281]

> EquivalentExpressions fixes and improvements
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently `EquivalentExpressions` has 2 issues:
> - identifying common expressions in conditional expressions is not correct in 
> all cases
> - transparently canonicalized expressions (like `PromotePrecision`) are 
> considered common subexpressions



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36182) Support TimestampNTZ type in Parquet file source

2021-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36182.
-
Resolution: Fixed

Issue resolved by pull request 34495
[https://github.com/apache/spark/pull/34495]

> Support TimestampNTZ type in Parquet file source
> 
>
> Key: SPARK-36182
> URL: https://issues.apache.org/jira/browse/SPARK-36182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> As per 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp,
>  Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current 
> default timestamp type):
> * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ
> * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ
> In Spark 3.1 or prior,  the Parquet writer follows the definition and sets 
> the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t 
> respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as 
> TIMESTAMP_LTZ.
> Since 3.2, with the support of timestamp without time zone type:
> * Parquet writer follows the definition and sets the field `isAdjustedToUTC` 
> as `false` on writing TIMESTAMP_NTZ. 
> * Parquet reader 
> ** For schema inference, Spark converts the Parquet timestamp type to the 
> corresponding catalyst timestamp type according to the timestamp annotation 
> flag `isAdjustedToUTC`.
> ** If merge schema is enabled in schema inference and some of the files are 
> inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type 
> is TIMESTAMP_LTZ  which is considered as the “wider” type
> ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was 
> written as  TIMESTAMP_NTZ type, Spark allows the read operation.
> ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was 
> written as  TIMESTAMP_LTZ type, the read operation is not allowed since the 
> TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36799) Pass queryExecution name in CLI

2021-11-10 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-36799:
---
Summary: Pass queryExecution name in CLI  (was: Pass queryExecution name in 
CLI when only select query)

> Pass queryExecution name in CLI
> ---
>
> Key: SPARK-36799
> URL: https://issues.apache.org/jira/browse/SPARK-36799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Now when in spark-sql CLI, QueryExecutionListener can receive command, but 
> not select query, because queryExecution Name is not passed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36799) Pass queryExecution name in CLI when only select query

2021-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36799:
---

Assignee: dzcxzl

> Pass queryExecution name in CLI when only select query
> --
>
> Key: SPARK-36799
> URL: https://issues.apache.org/jira/browse/SPARK-36799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
>
> Now when in spark-sql CLI, QueryExecutionListener can receive command, but 
> not select query, because queryExecution Name is not passed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36799) Pass queryExecution name in CLI when only select query

2021-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36799.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34041
[https://github.com/apache/spark/pull/34041]

> Pass queryExecution name in CLI when only select query
> --
>
> Key: SPARK-36799
> URL: https://issues.apache.org/jira/browse/SPARK-36799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Now when in spark-sql CLI, QueryExecutionListener can receive command, but 
> not select query, because queryExecution Name is not passed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition

2021-11-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442074#comment-17442074
 ] 

Hyukjin Kwon commented on SPARK-37270:
--

Hm, I can't reproduce this locally. Are you able to reproduce this with running 
locally too? e.g.)

{code}
spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
val frame = Seq((false, 1)).toDF("bool", "number")
frame
  .checkpoint()
  .withColumn("conditions", when(col("bool"), "I am not null"))
  .filter(col("conditions").isNull)
  .show(false)
{code}

> Incorect result of filter using isNull condition
> 
>
> Key: SPARK-37270
> URL: https://issues.apache.org/jira/browse/SPARK-37270
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Tomasz Kus
>Priority: Major
>  Labels: correctness
>
> Simple code that allows to reproduce this issue:
> {code:java}
>  val frame = Seq((false, 1)).toDF("bool", "number")
> frame
>   .checkpoint()
>   .withColumn("conditions", when(col("bool"), "I am not null"))
>   .filter(col("conditions").isNull)
>   .show(false){code}
> Although "conditions" column is null
> {code:java}
>  +-+--+--+
> |bool |number|conditions|
> +-+--+--+
> |false|1     |null      |
> +-+--+--+{code}
> empty result is shown.
> Execution plans:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#252)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Optimized Logical Plan ==
> LocalRelation , [bool#124, number#125, conditions#252]
> == Physical Plan ==
> LocalTableScan , [bool#124, number#125, conditions#252]
>  {code}
> After removing checkpoint proper result is returned  and execution plans are 
> as follow:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#256)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Optimized Logical Plan ==
> LocalRelation [bool#124, number#125, conditions#256]
> == Physical Plan ==
> LocalTableScan [bool#124, number#125, conditions#256]
>  {code}
> It seems that the most important difference is LogicalRDD ->  LocalRelation
> There are following ways (workarounds) to retrieve correct result:
> 1) remove checkpoint
> 2) add explicit .otherwise(null) to when
> 3) add checkpoint() or cache() just before filter
> 4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442068#comment-17442068
 ] 

Hyukjin Kwon commented on SPARK-37278:
--

I am working on this.

> Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
> -
>
> Key: SPARK-37278
> URL: https://issues.apache.org/jira/browse/SPARK-37278
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of YearMonthIntervalType in:
> - Python UDFs
> - createDataFrame/toPandas when Arrow is disabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37275) Support ANSI intervals in PySpark

2021-11-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442066#comment-17442066
 ] 

Hyukjin Kwon commented on SPARK-37275:
--

cc [~maxgekk] FYI

> Support ANSI intervals in PySpark
> -
>
> Key: SPARK-37275
> URL: https://issues.apache.org/jira/browse/SPARK-37275
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to implement ANSI interval types in PySpark:
> - 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala
> - 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/YearMonthIntervalType.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37281) Support DayTimeIntervalType in Py4J

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37281:
-
Description: This PR adds the support of YearMonthIntervalType in Py4J. For 
example, functions.lit(DayTimeIntervalType) should work.

> Support DayTimeIntervalType in Py4J
> ---
>
> Key: SPARK-37281
> URL: https://issues.apache.org/jira/browse/SPARK-37281
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This PR adds the support of YearMonthIntervalType in Py4J. For example, 
> functions.lit(DayTimeIntervalType) should work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37280) Support YearMonthIntervalType in Py4J

2021-11-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37280:


 Summary: Support YearMonthIntervalType in Py4J
 Key: SPARK-37280
 URL: https://issues.apache.org/jira/browse/SPARK-37280
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


This PR adds the support of YearMonthIntervalType in Py4J. For example, 
functions.lit(YearMonthIntervalType) should work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37281) Support DayTimeIntervalType in Py4J

2021-11-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37281:


 Summary: Support DayTimeIntervalType in Py4J
 Key: SPARK-37281
 URL: https://issues.apache.org/jira/browse/SPARK-37281
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37279:


 Summary: Support DayTimeIntervalType in createDataFrame/toPandas 
and Python UDFs
 Key: SPARK-37279
 URL: https://issues.apache.org/jira/browse/SPARK-37279
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Implements the support of DayTimeIntervalType in:
- Python UDFs
- createDataFrame/toPandas when Arrow is disabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37278:
-
Description: 
Implements the support of YearMonthIntervalType in:
- Python UDFs
- createDataFrame/toPandas when Arrow is disabled

  was:
Implements the support of YearMonthIntervalType in Arrow code path:
- Python UDFs
- createDataFrame/toPandas when Arrow is disabled


> Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
> -
>
> Key: SPARK-37278
> URL: https://issues.apache.org/jira/browse/SPARK-37278
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of YearMonthIntervalType in:
> - Python UDFs
> - createDataFrame/toPandas when Arrow is disabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37278:


 Summary: Support YearMonthIntervalType in createDataFrame/toPandas 
and Python UDFs
 Key: SPARK-37278
 URL: https://issues.apache.org/jira/browse/SPARK-37278
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Implements the support of YearMonthIntervalType in Arrow code path:
- Python UDFs
- createDataFrame/toPandas when Arrow is disabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37277) Support DayTimeIntervalType in Arrow

2021-11-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37277:


 Summary: Support DayTimeIntervalType in Arrow
 Key: SPARK-37277
 URL: https://issues.apache.org/jira/browse/SPARK-37277
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Implements the support of DayTimeIntervalType in Arrow code path:
- pandas UDFs
- pandas functions APIs
- createDataFrame/toPandas when Arrow is enabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37276) Support YearMonthIntervalType in Arrow

2021-11-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37276:


 Summary: Support YearMonthIntervalType in Arrow
 Key: SPARK-37276
 URL: https://issues.apache.org/jira/browse/SPARK-37276
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Implements the support of YearMonthIntervalType in Arrow code path:
- pandas UDFs
- pandas functions APIs
- createDataFrame/toPandas when Arrow is enabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37275) Support ANSI intervals in PySpark

2021-11-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37275:


 Summary: Support ANSI intervals in PySpark
 Key: SPARK-37275
 URL: https://issues.apache.org/jira/browse/SPARK-37275
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


This JIRA targets to implement ANSI interval types in PySpark:
- 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala
- 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/YearMonthIntervalType.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37274) These parameters should be of type long, not int

2021-11-10 Thread hao (Jira)
hao created SPARK-37274:
---

 Summary: These parameters should be of type long, not int
 Key: SPARK-37274
 URL: https://issues.apache.org/jira/browse/SPARK-37274
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: hao


These parameters [spark.sql.orc.columnarReaderBatchSize], 
[spark.sql.inMemoryColumnarStorage.batchSize], 
[spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type 
int. when the user sets the value to be greater than the maximum value of type 
int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37255) When Used with PyHive (by dropbox) query timeout doesn't result in propagation to the UI

2021-11-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442058#comment-17442058
 ] 

Hyukjin Kwon commented on SPARK-37255:
--

That's very likely an issue in PyHive.

> When Used with PyHive (by dropbox) query timeout doesn't result in 
> propagation to the UI
> 
>
> Key: SPARK-37255
> URL: https://issues.apache.org/jira/browse/SPARK-37255
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: ramakrishna chilaka
>Priority: Major
>
> When we run a large query and it is timed out by spark thrift server and when 
> it is cancelled, PyHive doesn't show that query is cancelled. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37255) When Used with PyHive (by dropbox) query timeout doesn't result in propagation to the UI

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37255.
--
Resolution: Invalid

> When Used with PyHive (by dropbox) query timeout doesn't result in 
> propagation to the UI
> 
>
> Key: SPARK-37255
> URL: https://issues.apache.org/jira/browse/SPARK-37255
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: ramakrishna chilaka
>Priority: Major
>
> When we run a large query and it is timed out by spark thrift server and when 
> it is cancelled, PyHive doesn't show that query is cancelled. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37273) Hidden File Metadata Support for Spark SQL

2021-11-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442057#comment-17442057
 ] 

Hyukjin Kwon commented on SPARK-37273:
--

Don't we already have this in DSv2? e.g.) SPARK-31255

> Hidden File Metadata Support for Spark SQL
> --
>
> Key: SPARK-37273
> URL: https://issues.apache.org/jira/browse/SPARK-37273
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yaohua Zhao
>Priority: Major
>
> Provide a new interface in Spark SQL that allows users to query the metadata 
> of the input files for all file formats, expose them as *built-in hidden 
> columns* meaning *users can only see them when they explicitly reference 
> them* (e.g. file path, file name)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37273) Hidden File Metadata Support for Spark SQL

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37273.
--
Resolution: Duplicate

> Hidden File Metadata Support for Spark SQL
> --
>
> Key: SPARK-37273
> URL: https://issues.apache.org/jira/browse/SPARK-37273
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yaohua Zhao
>Priority: Major
>
> Provide a new interface in Spark SQL that allows users to query the metadata 
> of the input files for all file formats, expose them as *built-in hidden 
> columns* meaning *users can only see them when they explicitly reference 
> them* (e.g. file path, file name)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37264) Exclude hadoop-client-api transitive dependency from orc-core

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37264:
--
Summary: Exclude hadoop-client-api transitive dependency from orc-core  
(was: [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from 
orc-core)

> Exclude hadoop-client-api transitive dependency from orc-core
> -
>
> Key: SPARK-37264
> URL: https://issues.apache.org/jira/browse/SPARK-37264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0
>
>
> Like hadoop-common and hadoop-hdfs, this PR proposes to exclude 
> hadoop-client-api transitive dependency from orc-core.
> Why are the changes needed?
> Since Apache Hadoop 2.7 doesn't work on Java 17, Apache ORC has a dependency 
> on Hadoop 3.3.1.
> This causes test-dependencies.sh failure on Java 17. As a result, 
> run-tests.py also fails.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37109) Install Java 17 on all of the Jenkins workers

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37109:
--
Parent: (was: SPARK-33772)
Issue Type: Bug  (was: Sub-task)

> Install Java 17 on all of the Jenkins workers
> -
>
> Key: SPARK-37109
> URL: https://issues.apache.org/jira/browse/SPARK-37109
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36900:
-

Assignee: Yang Jie

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-37109) Install Java 17 on all of the Jenkins workers

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-37109.
-

> Install Java 17 on all of the Jenkins workers
> -
>
> Key: SPARK-37109
> URL: https://issues.apache.org/jira/browse/SPARK-37109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37272:
--
Parent: SPARK-33772
Issue Type: Sub-task  (was: Improvement)

> Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
> 
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> Javava 17 officially support Apple Silicon
> - JEP 391: macOS/AArch64 Port
> - https://bugs.openjdk.java.net/browse/JDK-8251280
> Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
> natively.
> {code}
> /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable 
> arm64
> /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
> /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
> arm64
> {code}
> Since RocksDBJNI still doesn't support Apple Silicon natively, the following 
> failures occur on M1.
> {code}
> $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite"
> ...
> [info] Run completed in 23 seconds, 281 milliseconds.
> [info] Total number of tests run: 32
> [info] Suites: completed 2, aborted 2
> [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0
> [info] *** 2 SUITES ABORTED ***
> [info] *** 10 TESTS FAILED ***
> [error] Failed tests:
> [error]   org.apache.spark.sql.streaming.StreamingSessionWindowSuite
> [error]   
> org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite
> [error] Error during tests:
> [error]   org.apache.spark.sql.execution.streaming.state.RocksDBSuite
> [error]   
> org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite
> [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM
> {code}
> This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on 
> Apple Silicon.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37272:
--
Description: 
Java 17 officially support Apple Silicon

- JEP 391: macOS/AArch64 Port
- https://bugs.openjdk.java.net/browse/JDK-8251280

Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
natively.
{code}
/Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64
/Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
/Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
arm64
{code}

Since RocksDBJNI still doesn't support Apple Silicon natively, the following 
failures occur on M1.
{code}
$ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite"
...
[info] Run completed in 23 seconds, 281 milliseconds.
[info] Total number of tests run: 32
[info] Suites: completed 2, aborted 2
[info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0
[info] *** 2 SUITES ABORTED ***
[info] *** 10 TESTS FAILED ***
[error] Failed tests:
[error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite
[error] 
org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite
[error] Error during tests:
[error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite
[error] 
org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite
[error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM
{code}

This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on 
Apple Silicon.

  was:
Javava 17 officially support Apple Silicon

- JEP 391: macOS/AArch64 Port
- https://bugs.openjdk.java.net/browse/JDK-8251280

Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
natively.
{code}
/Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64
/Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
/Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
arm64
{code}

Since RocksDBJNI still doesn't support Apple Silicon natively, the following 
failures occur on M1.
{code}
$ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite"
...
[info] Run completed in 23 seconds, 281 milliseconds.
[info] Total number of tests run: 32
[info] Suites: completed 2, aborted 2
[info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0
[info] *** 2 SUITES ABORTED ***
[info] *** 10 TESTS FAILED ***
[error] Failed tests:
[error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite
[error] 
org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite
[error] Error during tests:
[error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite
[error] 
org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite
[error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM
{code}

This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on 
Apple Silicon.


> Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
> 
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> Java 17 officially support Apple Silicon
> - JEP 391: macOS/AArch64 Port
> - https://bugs.openjdk.java.net/browse/JDK-8251280
> Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
> natively.
> {code}
> /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable 
> arm64
> /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
> /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
> arm64
> {code}
> Since RocksDBJNI still doesn't support Apple Silicon natively, the following 
> failures occur on M1.
> {code}
> $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite"
> ...
> [info] Run completed in 23 seconds, 281 milliseconds.
> [info] Total number of tests run: 32
> [info] Suites: completed 2, aborted 2
> [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0
> [info] *** 2 SUITES ABORTED ***
> [info] *** 10 TESTS FAILED ***
> [error] Failed tests:
> [error]   org.apache.spark.sql.streaming.StreamingSessionWindowSuite
> [error]   
> org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite
> [error] Error during tests:
> 

[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37272:
--
Description: 
Javava 17 officially support Apple Silicon

- JEP 391: macOS/AArch64 Port
- https://bugs.openjdk.java.net/browse/JDK-8251280

Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
natively.
{code}
/Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64
/Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
/Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
arm64
{code}

Since RocksDBJNI still doesn't support Apple Silicon natively, the following 
failures occur on M1.
{code}
$ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite"
...
[info] Run completed in 23 seconds, 281 milliseconds.
[info] Total number of tests run: 32
[info] Suites: completed 2, aborted 2
[info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0
[info] *** 2 SUITES ABORTED ***
[info] *** 10 TESTS FAILED ***
[error] Failed tests:
[error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite
[error] 
org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite
[error] Error during tests:
[error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite
[error] 
org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite
[error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM
{code}

This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on 
Apple Silicon.

> Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
> 
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> Javava 17 officially support Apple Silicon
> - JEP 391: macOS/AArch64 Port
> - https://bugs.openjdk.java.net/browse/JDK-8251280
> Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon 
> natively.
> {code}
> /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable 
> arm64
> /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64
> /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable 
> arm64
> {code}
> Since RocksDBJNI still doesn't support Apple Silicon natively, the following 
> failures occur on M1.
> {code}
> $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite"
> ...
> [info] Run completed in 23 seconds, 281 milliseconds.
> [info] Total number of tests run: 32
> [info] Suites: completed 2, aborted 2
> [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0
> [info] *** 2 SUITES ABORTED ***
> [info] *** 10 TESTS FAILED ***
> [error] Failed tests:
> [error]   org.apache.spark.sql.streaming.StreamingSessionWindowSuite
> [error]   
> org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite
> [error] Error during tests:
> [error]   org.apache.spark.sql.execution.streaming.state.RocksDBSuite
> [error]   
> org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite
> [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM
> {code}
> This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on 
> Apple Silicon.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37272:
--
Summary: Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple 
Silicon  (was: Add ExtendedRocksDBTest)

> Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
> 
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37272) Add ExtendedRocksDBTest

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37272.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34547
[https://github.com/apache/spark/pull/34547]

> Add ExtendedRocksDBTest
> ---
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37272:
-

Assignee: Dongjoon Hyun

> Add ExtendedRocksDBTest
> ---
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37270) Incorect result of filter using isNull condition

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37270:
-
Labels: correctness  (was: )

> Incorect result of filter using isNull condition
> 
>
> Key: SPARK-37270
> URL: https://issues.apache.org/jira/browse/SPARK-37270
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Tomasz Kus
>Priority: Major
>  Labels: correctness
>
> Simple code that allows to reproduce this issue:
> {code:java}
>  val frame = Seq((false, 1)).toDF("bool", "number")
> frame
>   .checkpoint()
>   .withColumn("conditions", when(col("bool"), "I am not null"))
>   .filter(col("conditions").isNull)
>   .show(false){code}
> Although "conditions" column is null
> {code:java}
>  +-+--+--+
> |bool |number|conditions|
> +-+--+--+
> |false|1     |null      |
> +-+--+--+{code}
> empty result is shown.
> Execution plans:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#252)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Optimized Logical Plan ==
> LocalRelation , [bool#124, number#125, conditions#252]
> == Physical Plan ==
> LocalTableScan , [bool#124, number#125, conditions#252]
>  {code}
> After removing checkpoint proper result is returned  and execution plans are 
> as follow:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#256)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Optimized Logical Plan ==
> LocalRelation [bool#124, number#125, conditions#256]
> == Physical Plan ==
> LocalTableScan [bool#124, number#125, conditions#256]
>  {code}
> It seems that the most important difference is LogicalRDD ->  LocalRelation
> There are following ways (workarounds) to retrieve correct result:
> 1) remove checkpoint
> 2) add explicit .otherwise(null) to when
> 3) add checkpoint() or cache() just before filter
> 4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37254) 100% CPU usage on Spark Thrift Server.

2021-11-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442046#comment-17442046
 ] 

Hyukjin Kwon commented on SPARK-37254:
--

it would be much easier to investigate the issue if there're reproducible steps.

> 100% CPU usage on Spark Thrift Server.
> --
>
> Key: SPARK-37254
> URL: https://issues.apache.org/jira/browse/SPARK-37254
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: ramakrishna chilaka
>Priority: Major
>
> We are trying to use Spark thrift server as a distributed sql query engine, 
> the queries work when the resident memory occupied by Spark thrift server 
> identified through HTOP is comparatively less than the driver memory. The 
> same queries result in 100% cpu usage when the resident memory occupied by 
> spark thrift server is greater than the configured driver memory and keeps 
> running at 100% cpu usage. I am using incremental collect as false, as i need 
> faster responses for exploratory queries. I am trying to understand the 
> following points
>  * Why isn't spark thrift server releasing back the memory, when there are no 
> queries. 
>  * What is causing spark thrift server to go into 100% cpu usage on all the 
> cores, when spark thrift server's memory is greater than the driver memory 
> (by 10% usually) and why are queries just stuck.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37233) Inline type hints for files in python/pyspark/mllib

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37233:


Assignee: dch nguyen

> Inline type hints for files in python/pyspark/mllib
> ---
>
> Key: SPARK-37233
> URL: https://issues.apache.org/jira/browse/SPARK-37233
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37260:
-
Fix Version/s: 3.2.1

> PYSPARK Arrow 3.2.0 docs link invalid
> -
>
> Key: SPARK-37260
> URL: https://issues.apache.org/jira/browse/SPARK-37260
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Thomas Graves
>Priority: Major
> Fix For: 3.2.1
>
>
> [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html]
> links to:
> [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html]
> which links to:
> [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst]
> But that is an invalid link.
> I assume its supposed to point to:
> https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37260.
--
Resolution: Fixed

> PYSPARK Arrow 3.2.0 docs link invalid
> -
>
> Key: SPARK-37260
> URL: https://issues.apache.org/jira/browse/SPARK-37260
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Thomas Graves
>Priority: Major
>
> [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html]
> links to:
> [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html]
> which links to:
> [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst]
> But that is an invalid link.
> I assume its supposed to point to:
> https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid

2021-11-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442044#comment-17442044
 ] 

Hyukjin Kwon commented on SPARK-37260:
--

oh yeah. that's fixed via #34475. There are some more ongoing issues on the 
docs. I will fix them up and probably we could initiate spark 3.2.1.

> PYSPARK Arrow 3.2.0 docs link invalid
> -
>
> Key: SPARK-37260
> URL: https://issues.apache.org/jira/browse/SPARK-37260
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Thomas Graves
>Priority: Major
>
> [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html]
> links to:
> [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html]
> which links to:
> [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst]
> But that is an invalid link.
> I assume its supposed to point to:
> https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37272:


Assignee: Apache Spark

> Add ExtendedRocksDBTest
> ---
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37272:


Assignee: (was: Apache Spark)

> Add ExtendedRocksDBTest
> ---
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37272) Add ExtendedRocksDBTest

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442020#comment-17442020
 ] 

Apache Spark commented on SPARK-37272:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34547

> Add ExtendedRocksDBTest
> ---
>
> Key: SPARK-37272
> URL: https://issues.apache.org/jira/browse/SPARK-37272
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37273) Hidden File Metadata Support for Spark SQL

2021-11-10 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37273:
---

 Summary: Hidden File Metadata Support for Spark SQL
 Key: SPARK-37273
 URL: https://issues.apache.org/jira/browse/SPARK-37273
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao


Provide a new interface in Spark SQL that allows users to query the metadata of 
the input files for all file formats, expose them as *built-in hidden columns* 
meaning *users can only see them when they explicitly reference them* (e.g. 
file path, file name)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37272) Add ExtendedRocksDBTest

2021-11-10 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37272:
-

 Summary: Add ExtendedRocksDBTest
 Key: SPARK-37272
 URL: https://issues.apache.org/jira/browse/SPARK-37272
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33502) Large number of SELECT columns causes StackOverflowError

2021-11-10 Thread Arwin S Tio (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236434#comment-17236434
 ] 

Arwin S Tio edited comment on SPARK-33502 at 11/10/21, 7:22 PM:


Note, running my program with "-Xss3072k" fixed it. Giving Spark a bigger stack 
lets you hold more columns in memory.


was (Author: cozos):
Note, running my program with "-Xss3072k" fixed it

> Large number of SELECT columns causes StackOverflowError
> 
>
> Key: SPARK-33502
> URL: https://issues.apache.org/jira/browse/SPARK-33502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Arwin S Tio
>Priority: Minor
>
> On Spark 2.4.7 Standalone Mode on my laptop (Macbook Pro 2015), I ran the 
> following:
> {code:java}
> public class TestSparkStackOverflow {
>   public static void main(String [] args) {
> SparkSession spark = SparkSession
>   .builder()
>   .config("spark.master", "local[8]")
>   .appName(TestSparkStackOverflow.class.getSimpleName())
>   .getOrCreate();
> StructType inputSchema = new StructType();
> inputSchema = inputSchema.add("foo", DataTypes.StringType);
> 
> Dataset inputDf = spark.createDataFrame(
>   Arrays.asList(
> RowFactory.create("1"),
> RowFactory.create("2"),
> RowFactory.create("3")
>   ),
>   inputSchema
> );
>  
> List lotsOfColumns = new ArrayList<>();
> for (int i = 0; i < 3000; i++) {
>   lotsOfColumns.add(lit("").as("field" + i).cast(DataTypes.StringType));
> }
> lotsOfColumns.add(new Column("foo"));
> inputDf
>   
> .select(JavaConverters.collectionAsScalaIterableConverter(lotsOfColumns).asScala().toSeq())
>   .write()
>   .format("csv")
>   .mode(SaveMode.Append)
>   .save("file:///tmp/testoutput");
>   }
> }
>  {code}
>  
> And I get a StackOverflowError:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Job 
> aborted.Exception in thread "main" org.apache.spark.SparkException: Job 
> aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) 
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249) at 
> udp.task.TestSparkStackOverflow.main(TestSparkStackOverflow.java:52)Caused 
> by: java.lang.StackOverflowError at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) 
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
> at 

[jira] [Resolved] (SPARK-35557) Adapt uses of JDK 17 Internal APIs

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35557.
---
Resolution: Duplicate

This is superseded by SPARK-36796 via adding `--add-open` options.

> Adapt uses of JDK 17 Internal APIs
> --
>
> Key: SPARK-35557
> URL: https://issues.apache.org/jira/browse/SPARK-35557
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Ismaël Mejía
>Priority: Major
>
> I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with 
> Spark 2.12.4 on Java 17 and I found this exception:
> {code:java}
> java.lang.ExceptionInInitializerError
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
> ...
> Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
> private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
> not "opens java.nio" to unnamed module @110df513
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:357)
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:297)
>  at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
>  at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
>  at org.apache.spark.unsafe.Platform. (Platform.java:56)
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
> {code}
> It seems that Java 17 will be more strict about uses of JDK Internals 
> [https://openjdk.java.net/jeps/403]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37265) Support Java 17 in `dev/test-dependencies.sh`

2021-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37265.
---
Resolution: Invalid

Let me close this Invalid.

> Support Java 17 in `dev/test-dependencies.sh`
> -
>
> Key: SPARK-37265
> URL: https://issues.apache.org/jira/browse/SPARK-37265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37271) Spark OOM issue

2021-11-10 Thread M Shadab (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M Shadab resolved SPARK-37271.
--
Resolution: Fixed

done

> Spark OOM issue
> ---
>
> Key: SPARK-37271
> URL: https://issues.apache.org/jira/browse/SPARK-37271
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.1.0
>Reporter: M Shadab
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37271) Spark OOM issue

2021-11-10 Thread M Shadab (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441805#comment-17441805
 ] 

M Shadab commented on SPARK-37271:
--

Memory increased for the container

> Spark OOM issue
> ---
>
> Key: SPARK-37271
> URL: https://issues.apache.org/jira/browse/SPARK-37271
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.1.0
>Reporter: M Shadab
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37271) Spark OOM issue

2021-11-10 Thread M Shadab (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M Shadab updated SPARK-37271:
-
Shepherd: M Shadab

> Spark OOM issue
> ---
>
> Key: SPARK-37271
> URL: https://issues.apache.org/jira/browse/SPARK-37271
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.1.0
>Reporter: M Shadab
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37271) Spark OOM issue

2021-11-10 Thread M Shadab (Jira)
M Shadab created SPARK-37271:


 Summary: Spark OOM issue
 Key: SPARK-37271
 URL: https://issues.apache.org/jira/browse/SPARK-37271
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 3.1.0
Reporter: M Shadab






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36575) Executor lost may cause spark stage to hang

2021-11-10 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441796#comment-17441796
 ] 

wuyi commented on SPARK-36575:
--

FYI: the fix is reverted due to test issues.

> Executor lost may cause spark stage to hang
> ---
>
> Key: SPARK-36575
> URL: https://issues.apache.org/jira/browse/SPARK-36575
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.3
>Reporter: hujiahua
>Assignee: hujiahua
>Priority: Major
> Fix For: 3.3.0
>
>
> When a executor finished a task of some stage, the driver will receive a 
> `StatusUpdate` event to handle it. At the same time the driver found the 
> executor heartbeat timed out, so the dirver also need handle ExecutorLost 
> event simultaneously. There was a race condition issues here, which will make 
> the task never been rescheduled again and the stage hang over.
>  The problem is that `TaskResultGetter.enqueueSuccessfulTask` use 
> asynchronous thread to handle successful task, that mean the synchronized 
> lock of `TaskSchedulerImpl` was released prematurely during midway 
> [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61].
>  So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous 
> thread will go on to handle successful task. It cause 
> `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong 
> result. 
> Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, 
> which make `TaskSchedulerImpl.executorLost` was executed twice. 
> `copiesRunning(index) -= 1` were processed in `executorLost`, twice 
> `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. 
> related log when the issue produce: 
>  21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: 
> Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 
> 366724, partition 4004, ANY, 7994 bytes)
>  21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: 
> Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after 
> 140830 ms
>  21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost 
> task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): 
> ExecutorLostFailure (executor 366724 exited caused by one of the running 
> tasks) Reason: Executor heartbeat timed out after 140830 ms
>  21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished 
> task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 
> (executor 366724) (3047/5400)
> 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: 
> Executor 366724 on 10.109.89.3 killed by driver.
>  21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] 
> ExecutorMonitor: Executor 366724 removed (new total is 793)
> 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417416)
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] 
> BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, 
> 10.109.89.3, 43402, None)
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417416)
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor 
> lost: 366724 (epoch 417473)
>  21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] 
> BlockManagerMasterEndpoint: Trying to remove executor 366724 from 
> BlockManagerMaster.
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: 
> Removed 366724 successfully in removeExecutor
>  21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle 
> files lost for executor: 366724 (epoch 417473)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests

2021-11-10 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-37045:


Assignee: Max Gekk

> Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
> 
>
> Key: SPARK-37045
> URL: https://issues.apache.org/jira/browse/SPARK-37045
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Max Gekk
>Priority: Major
>
> Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for 
> V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests

2021-11-10 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441755#comment-17441755
 ] 

Max Gekk commented on SPARK-37045:
--

I am working on this.

> Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
> 
>
> Key: SPARK-37045
> URL: https://issues.apache.org/jira/browse/SPARK-37045
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for 
> V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37236) Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/

2021-11-10 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37236.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34510
[https://github.com/apache/spark/pull/34510]

> Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/
> --
>
> Key: SPARK-37236
> URL: https://issues.apache.org/jira/browse/SPARK-37236
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37236) Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/

2021-11-10 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37236:
--

Assignee: dch nguyen

> Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/
> --
>
> Key: SPARK-37236
> URL: https://issues.apache.org/jira/browse/SPARK-37236
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37270) Incorect result of filter using isNull condition

2021-11-10 Thread Tomasz Kus (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Kus updated SPARK-37270:
---
Component/s: SQL

> Incorect result of filter using isNull condition
> 
>
> Key: SPARK-37270
> URL: https://issues.apache.org/jira/browse/SPARK-37270
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Tomasz Kus
>Priority: Major
>
> Simple code that allows to reproduce this issue:
> {code:java}
>  val frame = Seq((false, 1)).toDF("bool", "number")
> frame
>   .checkpoint()
>   .withColumn("conditions", when(col("bool"), "I am not null"))
>   .filter(col("conditions").isNull)
>   .show(false){code}
> Although "conditions" column is null
> {code:java}
>  +-+--+--+
> |bool |number|conditions|
> +-+--+--+
> |false|1     |null      |
> +-+--+--+{code}
> empty result is shown.
> Execution plans:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#252)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Optimized Logical Plan ==
> LocalRelation , [bool#124, number#125, conditions#252]
> == Physical Plan ==
> LocalTableScan , [bool#124, number#125, conditions#252]
>  {code}
> After removing checkpoint proper result is returned  and execution plans are 
> as follow:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#256)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Optimized Logical Plan ==
> LocalRelation [bool#124, number#125, conditions#256]
> == Physical Plan ==
> LocalTableScan [bool#124, number#125, conditions#256]
>  {code}
> It seems that the most important difference is LogicalRDD ->  LocalRelation
> There are following ways (workarounds) to retrieve correct result:
> 1) remove checkpoint
> 2) add explicit .otherwise(null) to when
> 3) add checkpoint() or cache() just before filter
> 4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37261) Check adding partitions with ANSI intervals

2021-11-10 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37261.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34537
[https://github.com/apache/spark/pull/34537]

> Check adding partitions with ANSI intervals
> ---
>
> Key: SPARK-37261
> URL: https://issues.apache.org/jira/browse/SPARK-37261
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Add tests that should check adding partitions with ANSI intervals via the 
> ALTER TABLE command.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37270) Incorect result of filter using isNull condition

2021-11-10 Thread Tomasz Kus (Jira)
Tomasz Kus created SPARK-37270:
--

 Summary: Incorect result of filter using isNull condition
 Key: SPARK-37270
 URL: https://issues.apache.org/jira/browse/SPARK-37270
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Tomasz Kus


Simple code that allows to reproduce this issue:
{code:java}
 val frame = Seq((false, 1)).toDF("bool", "number")
frame
  .checkpoint()
  .withColumn("conditions", when(col("bool"), "I am not null"))
  .filter(col("conditions").isNull)
  .show(false){code}
Although "conditions" column is null
{code:java}
 +-+--+--+
|bool |number|conditions|
+-+--+--+
|false|1     |null      |
+-+--+--+{code}
empty result is shown.

Execution plans:
{code:java}
== Parsed Logical Plan ==
'Filter isnull('conditions)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#252]
   +- LogicalRDD [bool#124, number#125], false

== Analyzed Logical Plan ==
bool: boolean, number: int, conditions: string
Filter isnull(conditions#252)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#252]
   +- LogicalRDD [bool#124, number#125], false

== Optimized Logical Plan ==
LocalRelation , [bool#124, number#125, conditions#252]

== Physical Plan ==
LocalTableScan , [bool#124, number#125, conditions#252]
 {code}
After removing checkpoint proper result is returned  and execution plans are as 
follow:
{code:java}
== Parsed Logical Plan ==
'Filter isnull('conditions)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#256]
   +- Project [_1#119 AS bool#124, _2#120 AS number#125]
      +- LocalRelation [_1#119, _2#120]

== Analyzed Logical Plan ==
bool: boolean, number: int, conditions: string
Filter isnull(conditions#256)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#256]
   +- Project [_1#119 AS bool#124, _2#120 AS number#125]
      +- LocalRelation [_1#119, _2#120]

== Optimized Logical Plan ==
LocalRelation [bool#124, number#125, conditions#256]

== Physical Plan ==
LocalTableScan [bool#124, number#125, conditions#256]
 {code}
It seems that the most important difference is LogicalRDD ->  LocalRelation

There are following ways (workarounds) to retrieve correct result:

1) remove checkpoint

2) add explicit .otherwise(null) to when

3) add checkpoint() or cache() just before filter

4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37269) The partitionOverwriteMode option is not respected when using insertInto

2021-11-10 Thread David Szakallas (Jira)
David Szakallas created SPARK-37269:
---

 Summary: The partitionOverwriteMode option is not respected when 
using insertInto
 Key: SPARK-37269
 URL: https://issues.apache.org/jira/browse/SPARK-37269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: David Szakallas


>From the documentation of the {{spark.sql.sources.partitionOverwriteMode}} 
>configuration option:
{quote}This can also be set as an output option for a data source using key 
partitionOverwriteMode (which takes precedence over this setting), e.g. 
dataframe.write.option("partitionOverwriteMode", "dynamic").save(path).
{quote}
This is true when using .save(), however .insertInto() does not respect the 
output option.

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37268) Remove unused method call in FileScanRDD

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441669#comment-17441669
 ] 

Apache Spark commented on SPARK-37268:
--

User 'zuston' has created a pull request for this issue:
https://github.com/apache/spark/pull/34545

> Remove unused method call in FileScanRDD
> 
>
> Key: SPARK-37268
> URL: https://issues.apache.org/jira/browse/SPARK-37268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Junfan Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37268) Remove unused method call in FileScanRDD

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37268:


Assignee: (was: Apache Spark)

> Remove unused method call in FileScanRDD
> 
>
> Key: SPARK-37268
> URL: https://issues.apache.org/jira/browse/SPARK-37268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Junfan Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37268) Remove unused method call in FileScanRDD

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37268:


Assignee: Apache Spark

> Remove unused method call in FileScanRDD
> 
>
> Key: SPARK-37268
> URL: https://issues.apache.org/jira/browse/SPARK-37268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Junfan Zhang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37268) Remove unused method call in FileScanRDD

2021-11-10 Thread Junfan Zhang (Jira)
Junfan Zhang created SPARK-37268:


 Summary: Remove unused method call in FileScanRDD
 Key: SPARK-37268
 URL: https://issues.apache.org/jira/browse/SPARK-37268
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441612#comment-17441612
 ] 

Apache Spark commented on SPARK-37022:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34544

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
>  - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
>  - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
>  - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
>  - Reduce effort required to maintain patched forks: smaller diffs + 
> predictable formatting.
> Risks:
>  - Initial reformatting requires quite significant changes.
>  - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
>  - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441610#comment-17441610
 ] 

Apache Spark commented on SPARK-37022:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34544

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
>  - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
>  - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
>  - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
>  - Reduce effort required to maintain patched forks: smaller diffs + 
> predictable formatting.
> Risks:
>  - Initial reformatting requires quite significant changes.
>  - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
>  - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >