[jira] [Created] (SPARK-37285) Add Weight of Evidence and Information value to ml.feature
Simon Tao created SPARK-37285: - Summary: Add Weight of Evidence and Information value to ml.feature Key: SPARK-37285 URL: https://issues.apache.org/jira/browse/SPARK-37285 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 3.2.0 Reporter: Simon Tao The weight of evidence (WOE) and information value (IV) provide a great framework for exploratory analysis and variable screening for binary classifiers as well as beneficial and help us analyze multiple points as listed below: 1. Helps check the linear relationship of a feature with its dependent feature to be used in the model. 2. Is a good variable transformation method for both continuous and categorical features. 3. Is better than on-hot encoding as this method of variable transformation does not increase the complexity of the model. 4. Detect linear and non-linear relationships. 5. Be useful in feature selection. 6. Is a good measure of the predictive power of a feature and it also helps point out the suspicious feature. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao updated SPARK-37274: Summary: When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should remind the user of this risk point (was: These parameters should be of type long, not int) > When the value of this parameter is greater than the maximum value of int > type, the value will be thrown out of bounds. The document description of > this parameter should remind the user of this risk point > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Priority: Major > > These parameters [spark.sql.orc.columnarReaderBatchSize], > [spark.sql.inMemoryColumnarStorage.batchSize], > [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of > type int. when the user sets the value to be greater than the maximum value > of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao updated SPARK-37274: Description: When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should remind the user of this risk point (was: These parameters [spark.sql.orc.columnarReaderBatchSize], [spark.sql.inMemoryColumnarStorage.batchSize], [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type int. when the user sets the value to be greater than the maximum value of type int, an error will be thrown) > When the value of this parameter is greater than the maximum value of int > type, the value will be thrown out of bounds. The document description of > this parameter should remind the user of this risk point > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Priority: Major > > When the value of this parameter is greater than the maximum value of int > type, the value will be thrown out of bounds. The document description of > this parameter should remind the user of this risk point -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37282: - Assignee: Dongjoon Hyun > Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon > -- > > Key: SPARK-37282 > URL: https://issues.apache.org/jira/browse/SPARK-37282 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Java 17 officially support Apple Silicon. > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases > fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37282. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34548 [https://github.com/apache/spark/pull/34548] > Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon > -- > > Key: SPARK-37282 > URL: https://issues.apache.org/jira/browse/SPARK-37282 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > Java 17 officially support Apple Silicon. > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases > fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37284) Upgrade Jekyll to 4.2.1
[ https://issues.apache.org/jira/browse/SPARK-37284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37284: Assignee: Apache Spark (was: Kousuke Saruta) > Upgrade Jekyll to 4.2.1 > --- > > Key: SPARK-37284 > URL: https://issues.apache.org/jira/browse/SPARK-37284 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > Jekyll 4.2.1 was released in September, which includes the fix of a > regression bug. > https://github.com/jekyll/jekyll/releases/tag/v4.2.1 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37284) Upgrade Jekyll to 4.2.1
[ https://issues.apache.org/jira/browse/SPARK-37284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37284: Assignee: Kousuke Saruta (was: Apache Spark) > Upgrade Jekyll to 4.2.1 > --- > > Key: SPARK-37284 > URL: https://issues.apache.org/jira/browse/SPARK-37284 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Jekyll 4.2.1 was released in September, which includes the fix of a > regression bug. > https://github.com/jekyll/jekyll/releases/tag/v4.2.1 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37284) Upgrade Jekyll to 4.2.1
[ https://issues.apache.org/jira/browse/SPARK-37284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442114#comment-17442114 ] Apache Spark commented on SPARK-37284: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34552 > Upgrade Jekyll to 4.2.1 > --- > > Key: SPARK-37284 > URL: https://issues.apache.org/jira/browse/SPARK-37284 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Jekyll 4.2.1 was released in September, which includes the fix of a > regression bug. > https://github.com/jekyll/jekyll/releases/tag/v4.2.1 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37284) Upgrade Jekyll to 4.2.1
[ https://issues.apache.org/jira/browse/SPARK-37284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442113#comment-17442113 ] Apache Spark commented on SPARK-37284: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34552 > Upgrade Jekyll to 4.2.1 > --- > > Key: SPARK-37284 > URL: https://issues.apache.org/jira/browse/SPARK-37284 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Jekyll 4.2.1 was released in September, which includes the fix of a > regression bug. > https://github.com/jekyll/jekyll/releases/tag/v4.2.1 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37263: Assignee: Apache Spark > Add PandasAPIOnSparkAdviceWarning class > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have > pandas-on-Spark specific warning class so that users can manually turn it off > by using warning.simplefilter. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37284) Upgrade Jekyll to 4.2.1
Kousuke Saruta created SPARK-37284: -- Summary: Upgrade Jekyll to 4.2.1 Key: SPARK-37284 URL: https://issues.apache.org/jira/browse/SPARK-37284 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Jekyll 4.2.1 was released in September, which includes the fix of a regression bug. https://github.com/jekyll/jekyll/releases/tag/v4.2.1 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442112#comment-17442112 ] Apache Spark commented on SPARK-37263: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/34550 > Add PandasAPIOnSparkAdviceWarning class > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have > pandas-on-Spark specific warning class so that users can manually turn it off > by using warning.simplefilter. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37263: Assignee: (was: Apache Spark) > Add PandasAPIOnSparkAdviceWarning class > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have > pandas-on-Spark specific warning class so that users can manually turn it off > by using warning.simplefilter. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442111#comment-17442111 ] Apache Spark commented on SPARK-37283: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34551 > Don't try to store a V1 table which contains ANSI intervals in Hive > compatible format > - > > Key: SPARK-37283 > URL: https://issues.apache.org/jira/browse/SPARK-37283 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > If, a table being created contains a column of ANSI interval types and the > underlying file format has a corresponding Hive SerDe (e.g. Parquet), > `HiveExternalcatalog` tries to store the table in Hive compatible format. > But, as ANSI interval types in Spark and interval type in Hive are not > compatible (Hive only supports interval_year_month and interval_day_time), > the following warning with stack trace will be logged. > {code} > spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; > 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist > `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore > in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.IllegalArgumentException: Error: type expected at the position 0 of > 'interval year to month' but 'interval year to month' is found. > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) > at > org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) > at >
[jira] [Assigned] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37283: Assignee: Apache Spark (was: Kousuke Saruta) > Don't try to store a V1 table which contains ANSI intervals in Hive > compatible format > - > > Key: SPARK-37283 > URL: https://issues.apache.org/jira/browse/SPARK-37283 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > If, a table being created contains a column of ANSI interval types and the > underlying file format has a corresponding Hive SerDe (e.g. Parquet), > `HiveExternalcatalog` tries to store the table in Hive compatible format. > But, as ANSI interval types in Spark and interval type in Hive are not > compatible (Hive only supports interval_year_month and interval_day_time), > the following warning with stack trace will be logged. > {code} > spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; > 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist > `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore > in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.IllegalArgumentException: Error: type expected at the position 0 of > 'interval year to month' but 'interval year to month' is found. > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) > at > org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) > at >
[jira] [Assigned] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37283: Assignee: Kousuke Saruta (was: Apache Spark) > Don't try to store a V1 table which contains ANSI intervals in Hive > compatible format > - > > Key: SPARK-37283 > URL: https://issues.apache.org/jira/browse/SPARK-37283 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > If, a table being created contains a column of ANSI interval types and the > underlying file format has a corresponding Hive SerDe (e.g. Parquet), > `HiveExternalcatalog` tries to store the table in Hive compatible format. > But, as ANSI interval types in Spark and interval type in Hive are not > compatible (Hive only supports interval_year_month and interval_day_time), > the following warning with stack trace will be logged. > {code} > spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; > 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist > `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore > in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.IllegalArgumentException: Error: type expected at the position 0 of > 'interval year to month' but 'interval year to month' is found. > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) > at > org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) > at >
[jira] [Updated] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Description: Raised from comment [https://github.com/apache/spark/pull/34389#discussion_r741733023]. The advice warning for pandas API on Spark for expensive APIs ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] now issuing too much warning message, so it might be good to have pandas-on-Spark specific warning class so that users can manually turn it off by using warning.simplefilter. was: Raised from comment [https://github.com/apache/spark/pull/34389#discussion_r741733023]. The advice warning for pandas API on Spark for expensive APIs ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] now issuing too much warning message, so it might be good to have option to turn this message on/off. > Add PandasAPIOnSparkAdviceWarning class > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have > pandas-on-Spark specific warning class so that users can manually turn it off > by using warning.simplefilter. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Summary: Add PandasAPIOnSparkAdviceWarning class (was: Add an option to silence advice for pandas API on Spark.) > Add PandasAPIOnSparkAdviceWarning class > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have option to > turn this message on/off. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37283: --- Description: If, a table being created contains a column of ANSI interval types and the underlying file format has a corresponding Hive SerDe (e.g. Parquet), `HiveExternalcatalog` tries to store the table in Hive compatible format. But, as ANSI interval types in Spark and interval type in Hive are not compatible (Hive only supports interval_year_month and interval_day_time), the following warning with stack trace will be logged. {code} spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'interval year to month' but 'interval year to month' is found. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) at org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) at org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at
[jira] [Created] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
Kousuke Saruta created SPARK-37283: -- Summary: Don't try to store a V1 table which contains ANSI intervals in Hive compatible format Key: SPARK-37283 URL: https://issues.apache.org/jira/browse/SPARK-37283 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta If, a table being created contains a column of ANSI interval types and the underlying file format has a corresponding Hive SerDe (e.g. Parquet), `HiveExternalcatalog` tries to store the table in Hive compatible format. But, as ANSI interval types in Spark and interval type in Hive are not compatible (Hive only supports interval_year_month and interval_day_time), the following warning with stack trace will be logged. {code} spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'interval year to month' but 'interval year to month' is found. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) at org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) at org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at
[jira] [Assigned] (SPARK-37274) These parameters should be of type long, not int
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37274: Assignee: Apache Spark > These parameters should be of type long, not int > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Assignee: Apache Spark >Priority: Major > > These parameters [spark.sql.orc.columnarReaderBatchSize], > [spark.sql.inMemoryColumnarStorage.batchSize], > [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of > type int. when the user sets the value to be greater than the maximum value > of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37274) These parameters should be of type long, not int
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442102#comment-17442102 ] Apache Spark commented on SPARK-37274: -- User 'dh20' has created a pull request for this issue: https://github.com/apache/spark/pull/34549 > These parameters should be of type long, not int > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Priority: Major > > These parameters [spark.sql.orc.columnarReaderBatchSize], > [spark.sql.inMemoryColumnarStorage.batchSize], > [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of > type int. when the user sets the value to be greater than the maximum value > of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37274) These parameters should be of type long, not int
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37274: Assignee: (was: Apache Spark) > These parameters should be of type long, not int > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Priority: Major > > These parameters [spark.sql.orc.columnarReaderBatchSize], > [spark.sql.inMemoryColumnarStorage.batchSize], > [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of > type int. when the user sets the value to be greater than the maximum value > of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37263) Add an option to silence advice for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Description: Raised from comment [https://github.com/apache/spark/pull/34389#discussion_r741733023]. The advice warning for pandas API on Spark for expensive APIs ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] now issuing too much warning message, so it might be good to have option to turn this message on/off. was: Raised from comment https://github.com/apache/spark/pull/34389#discussion_r741733023. The advice warning for pandas API on Spark for expensive APIs ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] now issuing too much warning message, since it also issuing the warning when the APIs are used for internal usage. > Add an option to silence advice for pandas API on Spark. > > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have option to > turn this message on/off. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37263) Add an option to silence advice for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Summary: Add an option to silence advice for pandas API on Spark. (was: Create an option to silence advice for pandas API on Spark.) > Add an option to silence advice for pandas API on Spark. > > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > https://github.com/apache/spark/pull/34389#discussion_r741733023. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, since it also issuing the warning when > the APIs are used for internal usage. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37263) Create an option to silence advice for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Summary: Create an option to silence advice for pandas API on Spark. (was: Reduce pandas-on-Spark warning for internal usage.) > Create an option to silence advice for pandas API on Spark. > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > https://github.com/apache/spark/pull/34389#discussion_r741733023. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, since it also issuing the warning when > the APIs are used for internal usage. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37276) Support YearMonthIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37276: - Description: Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs - createDataFrame/toPandas w/ Arrow was: Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs > Support YearMonthIntervalType in Arrow > -- > > Key: SPARK-37276 > URL: https://issues.apache.org/jira/browse/SPARK-37276 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs > - createDataFrame/toPandas w/ Arrow -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37278: - Description: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas without Arrow was: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas > Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs > - > > Key: SPARK-37278 > URL: https://issues.apache.org/jira/browse/SPARK-37278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in: > - Python UDFs > - createDataFrame/toPandas without Arrow -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37278: - Description: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas was: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas when Arrow is disabled > Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs > - > > Key: SPARK-37278 > URL: https://issues.apache.org/jira/browse/SPARK-37278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in: > - Python UDFs > - createDataFrame/toPandas -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37276) Support YearMonthIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37276: - Description: Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs was: Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs - createDataFrame/toPandas when Arrow is enabled > Support YearMonthIntervalType in Arrow > -- > > Key: SPARK-37276 > URL: https://issues.apache.org/jira/browse/SPARK-37276 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37282: Assignee: (was: Apache Spark) > Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon > -- > > Key: SPARK-37282 > URL: https://issues.apache.org/jira/browse/SPARK-37282 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Java 17 officially support Apple Silicon. > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases > fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37282: Assignee: Apache Spark > Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon > -- > > Key: SPARK-37282 > URL: https://issues.apache.org/jira/browse/SPARK-37282 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > Java 17 officially support Apple Silicon. > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases > fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442084#comment-17442084 ] Apache Spark commented on SPARK-37282: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34548 > Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon > -- > > Key: SPARK-37282 > URL: https://issues.apache.org/jira/browse/SPARK-37282 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Java 17 officially support Apple Silicon. > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases > fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
Dongjoon Hyun created SPARK-37282: - Summary: Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon Key: SPARK-37282 URL: https://issues.apache.org/jira/browse/SPARK-37282 Project: Spark Issue Type: Sub-task Components: Spark Core, Tests Affects Versions: 3.3.0 Reporter: Dongjoon Hyun Java 17 officially support Apple Silicon. - JEP 391: macOS/AArch64 Port - https://bugs.openjdk.java.net/browse/JDK-8251280 Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon natively. {code} /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable arm64 {code} Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36073: --- Assignee: Peter Toth > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > > Currently `EquivalentExpressions` has 2 issues: > - identifying common expressions in conditional expressions is not correct in > all cases > - transparently canonicalized expressions (like `PromotePrecision`) are > considered common subexpressions -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36073. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33281 [https://github.com/apache/spark/pull/33281] > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.3.0 > > > Currently `EquivalentExpressions` has 2 issues: > - identifying common expressions in conditional expressions is not correct in > all cases > - transparently canonicalized expressions (like `PromotePrecision`) are > considered common subexpressions -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36182) Support TimestampNTZ type in Parquet file source
[ https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36182. - Resolution: Fixed Issue resolved by pull request 34495 [https://github.com/apache/spark/pull/34495] > Support TimestampNTZ type in Parquet file source > > > Key: SPARK-36182 > URL: https://issues.apache.org/jira/browse/SPARK-36182 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > As per > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp, > Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current > default timestamp type): > * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ > * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ > In Spark 3.1 or prior, the Parquet writer follows the definition and sets > the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t > respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as > TIMESTAMP_LTZ. > Since 3.2, with the support of timestamp without time zone type: > * Parquet writer follows the definition and sets the field `isAdjustedToUTC` > as `false` on writing TIMESTAMP_NTZ. > * Parquet reader > ** For schema inference, Spark converts the Parquet timestamp type to the > corresponding catalyst timestamp type according to the timestamp annotation > flag `isAdjustedToUTC`. > ** If merge schema is enabled in schema inference and some of the files are > inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type > is TIMESTAMP_LTZ which is considered as the “wider” type > ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was > written as TIMESTAMP_NTZ type, Spark allows the read operation. > ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was > written as TIMESTAMP_LTZ type, the read operation is not allowed since the > TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36799) Pass queryExecution name in CLI
[ https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-36799: --- Summary: Pass queryExecution name in CLI (was: Pass queryExecution name in CLI when only select query) > Pass queryExecution name in CLI > --- > > Key: SPARK-36799 > URL: https://issues.apache.org/jira/browse/SPARK-36799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > > Now when in spark-sql CLI, QueryExecutionListener can receive command, but > not select query, because queryExecution Name is not passed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36799) Pass queryExecution name in CLI when only select query
[ https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36799: --- Assignee: dzcxzl > Pass queryExecution name in CLI when only select query > -- > > Key: SPARK-36799 > URL: https://issues.apache.org/jira/browse/SPARK-36799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > > Now when in spark-sql CLI, QueryExecutionListener can receive command, but > not select query, because queryExecution Name is not passed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36799) Pass queryExecution name in CLI when only select query
[ https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36799. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34041 [https://github.com/apache/spark/pull/34041] > Pass queryExecution name in CLI when only select query > -- > > Key: SPARK-36799 > URL: https://issues.apache.org/jira/browse/SPARK-36799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > > Now when in spark-sql CLI, QueryExecutionListener can receive command, but > not select query, because queryExecution Name is not passed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442074#comment-17442074 ] Hyukjin Kwon commented on SPARK-37270: -- Hm, I can't reproduce this locally. Are you able to reproduce this with running locally too? e.g.) {code} spark.sparkContext.setCheckpointDir("/tmp/checkpoints") val frame = Seq((false, 1)).toDF("bool", "number") frame .checkpoint() .withColumn("conditions", when(col("bool"), "I am not null")) .filter(col("conditions").isNull) .show(false) {code} > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442068#comment-17442068 ] Hyukjin Kwon commented on SPARK-37278: -- I am working on this. > Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs > - > > Key: SPARK-37278 > URL: https://issues.apache.org/jira/browse/SPARK-37278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in: > - Python UDFs > - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37275) Support ANSI intervals in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442066#comment-17442066 ] Hyukjin Kwon commented on SPARK-37275: -- cc [~maxgekk] FYI > Support ANSI intervals in PySpark > - > > Key: SPARK-37275 > URL: https://issues.apache.org/jira/browse/SPARK-37275 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to implement ANSI interval types in PySpark: > - > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala > - > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/YearMonthIntervalType.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37281) Support DayTimeIntervalType in Py4J
[ https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37281: - Description: This PR adds the support of YearMonthIntervalType in Py4J. For example, functions.lit(DayTimeIntervalType) should work. > Support DayTimeIntervalType in Py4J > --- > > Key: SPARK-37281 > URL: https://issues.apache.org/jira/browse/SPARK-37281 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This PR adds the support of YearMonthIntervalType in Py4J. For example, > functions.lit(DayTimeIntervalType) should work. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37280) Support YearMonthIntervalType in Py4J
Hyukjin Kwon created SPARK-37280: Summary: Support YearMonthIntervalType in Py4J Key: SPARK-37280 URL: https://issues.apache.org/jira/browse/SPARK-37280 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon This PR adds the support of YearMonthIntervalType in Py4J. For example, functions.lit(YearMonthIntervalType) should work. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37281) Support DayTimeIntervalType in Py4J
Hyukjin Kwon created SPARK-37281: Summary: Support DayTimeIntervalType in Py4J Key: SPARK-37281 URL: https://issues.apache.org/jira/browse/SPARK-37281 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
Hyukjin Kwon created SPARK-37279: Summary: Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs Key: SPARK-37279 URL: https://issues.apache.org/jira/browse/SPARK-37279 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Implements the support of DayTimeIntervalType in: - Python UDFs - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37278: - Description: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas when Arrow is disabled was: Implements the support of YearMonthIntervalType in Arrow code path: - Python UDFs - createDataFrame/toPandas when Arrow is disabled > Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs > - > > Key: SPARK-37278 > URL: https://issues.apache.org/jira/browse/SPARK-37278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in: > - Python UDFs > - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
Hyukjin Kwon created SPARK-37278: Summary: Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs Key: SPARK-37278 URL: https://issues.apache.org/jira/browse/SPARK-37278 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Implements the support of YearMonthIntervalType in Arrow code path: - Python UDFs - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37277) Support DayTimeIntervalType in Arrow
Hyukjin Kwon created SPARK-37277: Summary: Support DayTimeIntervalType in Arrow Key: SPARK-37277 URL: https://issues.apache.org/jira/browse/SPARK-37277 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Implements the support of DayTimeIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37276) Support YearMonthIntervalType in Arrow
Hyukjin Kwon created SPARK-37276: Summary: Support YearMonthIntervalType in Arrow Key: SPARK-37276 URL: https://issues.apache.org/jira/browse/SPARK-37276 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37275) Support ANSI intervals in PySpark
Hyukjin Kwon created SPARK-37275: Summary: Support ANSI intervals in PySpark Key: SPARK-37275 URL: https://issues.apache.org/jira/browse/SPARK-37275 Project: Spark Issue Type: Umbrella Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon This JIRA targets to implement ANSI interval types in PySpark: - https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala - https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/YearMonthIntervalType.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37274) These parameters should be of type long, not int
hao created SPARK-37274: --- Summary: These parameters should be of type long, not int Key: SPARK-37274 URL: https://issues.apache.org/jira/browse/SPARK-37274 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: hao These parameters [spark.sql.orc.columnarReaderBatchSize], [spark.sql.inMemoryColumnarStorage.batchSize], [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type int. when the user sets the value to be greater than the maximum value of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37255) When Used with PyHive (by dropbox) query timeout doesn't result in propagation to the UI
[ https://issues.apache.org/jira/browse/SPARK-37255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442058#comment-17442058 ] Hyukjin Kwon commented on SPARK-37255: -- That's very likely an issue in PyHive. > When Used with PyHive (by dropbox) query timeout doesn't result in > propagation to the UI > > > Key: SPARK-37255 > URL: https://issues.apache.org/jira/browse/SPARK-37255 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: ramakrishna chilaka >Priority: Major > > When we run a large query and it is timed out by spark thrift server and when > it is cancelled, PyHive doesn't show that query is cancelled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37255) When Used with PyHive (by dropbox) query timeout doesn't result in propagation to the UI
[ https://issues.apache.org/jira/browse/SPARK-37255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37255. -- Resolution: Invalid > When Used with PyHive (by dropbox) query timeout doesn't result in > propagation to the UI > > > Key: SPARK-37255 > URL: https://issues.apache.org/jira/browse/SPARK-37255 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: ramakrishna chilaka >Priority: Major > > When we run a large query and it is timed out by spark thrift server and when > it is cancelled, PyHive doesn't show that query is cancelled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37273) Hidden File Metadata Support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-37273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442057#comment-17442057 ] Hyukjin Kwon commented on SPARK-37273: -- Don't we already have this in DSv2? e.g.) SPARK-31255 > Hidden File Metadata Support for Spark SQL > -- > > Key: SPARK-37273 > URL: https://issues.apache.org/jira/browse/SPARK-37273 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yaohua Zhao >Priority: Major > > Provide a new interface in Spark SQL that allows users to query the metadata > of the input files for all file formats, expose them as *built-in hidden > columns* meaning *users can only see them when they explicitly reference > them* (e.g. file path, file name) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37273) Hidden File Metadata Support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-37273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37273. -- Resolution: Duplicate > Hidden File Metadata Support for Spark SQL > -- > > Key: SPARK-37273 > URL: https://issues.apache.org/jira/browse/SPARK-37273 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yaohua Zhao >Priority: Major > > Provide a new interface in Spark SQL that allows users to query the metadata > of the input files for all file formats, expose them as *built-in hidden > columns* meaning *users can only see them when they explicitly reference > them* (e.g. file path, file name) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37264) Exclude hadoop-client-api transitive dependency from orc-core
[ https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37264: -- Summary: Exclude hadoop-client-api transitive dependency from orc-core (was: [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from orc-core) > Exclude hadoop-client-api transitive dependency from orc-core > - > > Key: SPARK-37264 > URL: https://issues.apache.org/jira/browse/SPARK-37264 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0 > > > Like hadoop-common and hadoop-hdfs, this PR proposes to exclude > hadoop-client-api transitive dependency from orc-core. > Why are the changes needed? > Since Apache Hadoop 2.7 doesn't work on Java 17, Apache ORC has a dependency > on Hadoop 3.3.1. > This causes test-dependencies.sh failure on Java 17. As a result, > run-tests.py also fails. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37109) Install Java 17 on all of the Jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37109: -- Parent: (was: SPARK-33772) Issue Type: Bug (was: Sub-task) > Install Java 17 on all of the Jenkins workers > - > > Key: SPARK-37109 > URL: https://issues.apache.org/jira/browse/SPARK-37109 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36900: - Assignee: Yang Jie > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-37109) Install Java 17 on all of the Jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-37109. - > Install Java 17 on all of the Jenkins workers > - > > Key: SPARK-37109 > URL: https://issues.apache.org/jira/browse/SPARK-37109 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37272: -- Parent: SPARK-33772 Issue Type: Sub-task (was: Improvement) > Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon > > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > Javava 17 officially support Apple Silicon > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since RocksDBJNI still doesn't support Apple Silicon natively, the following > failures occur on M1. > {code} > $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" > ... > [info] Run completed in 23 seconds, 281 milliseconds. > [info] Total number of tests run: 32 > [info] Suites: completed 2, aborted 2 > [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 > [info] *** 2 SUITES ABORTED *** > [info] *** 10 TESTS FAILED *** > [error] Failed tests: > [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite > [error] Error during tests: > [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite > [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM > {code} > This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on > Apple Silicon. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37272: -- Description: Java 17 officially support Apple Silicon - JEP 391: macOS/AArch64 Port - https://bugs.openjdk.java.net/browse/JDK-8251280 Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon natively. {code} /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable arm64 {code} Since RocksDBJNI still doesn't support Apple Silicon natively, the following failures occur on M1. {code} $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" ... [info] Run completed in 23 seconds, 281 milliseconds. [info] Total number of tests run: 32 [info] Suites: completed 2, aborted 2 [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 [info] *** 2 SUITES ABORTED *** [info] *** 10 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite [error] Error during tests: [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM {code} This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on Apple Silicon. was: Javava 17 officially support Apple Silicon - JEP 391: macOS/AArch64 Port - https://bugs.openjdk.java.net/browse/JDK-8251280 Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon natively. {code} /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable arm64 {code} Since RocksDBJNI still doesn't support Apple Silicon natively, the following failures occur on M1. {code} $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" ... [info] Run completed in 23 seconds, 281 milliseconds. [info] Total number of tests run: 32 [info] Suites: completed 2, aborted 2 [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 [info] *** 2 SUITES ABORTED *** [info] *** 10 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite [error] Error during tests: [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM {code} This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on Apple Silicon. > Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon > > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > Java 17 officially support Apple Silicon > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since RocksDBJNI still doesn't support Apple Silicon natively, the following > failures occur on M1. > {code} > $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" > ... > [info] Run completed in 23 seconds, 281 milliseconds. > [info] Total number of tests run: 32 > [info] Suites: completed 2, aborted 2 > [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 > [info] *** 2 SUITES ABORTED *** > [info] *** 10 TESTS FAILED *** > [error] Failed tests: > [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite > [error] Error during tests: >
[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37272: -- Description: Javava 17 officially support Apple Silicon - JEP 391: macOS/AArch64 Port - https://bugs.openjdk.java.net/browse/JDK-8251280 Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon natively. {code} /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable arm64 {code} Since RocksDBJNI still doesn't support Apple Silicon natively, the following failures occur on M1. {code} $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" ... [info] Run completed in 23 seconds, 281 milliseconds. [info] Total number of tests run: 32 [info] Suites: completed 2, aborted 2 [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 [info] *** 2 SUITES ABORTED *** [info] *** 10 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite [error] Error during tests: [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM {code} This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on Apple Silicon. > Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon > > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > Javava 17 officially support Apple Silicon > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since RocksDBJNI still doesn't support Apple Silicon natively, the following > failures occur on M1. > {code} > $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" > ... > [info] Run completed in 23 seconds, 281 milliseconds. > [info] Total number of tests run: 32 > [info] Suites: completed 2, aborted 2 > [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 > [info] *** 2 SUITES ABORTED *** > [info] *** 10 TESTS FAILED *** > [error] Failed tests: > [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite > [error] Error during tests: > [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite > [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM > {code} > This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on > Apple Silicon. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37272: -- Summary: Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon (was: Add ExtendedRocksDBTest) > Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon > > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37272. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34547 [https://github.com/apache/spark/pull/34547] > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37272: - Assignee: Dongjoon Hyun > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37270: - Labels: correctness (was: ) > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37254) 100% CPU usage on Spark Thrift Server.
[ https://issues.apache.org/jira/browse/SPARK-37254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442046#comment-17442046 ] Hyukjin Kwon commented on SPARK-37254: -- it would be much easier to investigate the issue if there're reproducible steps. > 100% CPU usage on Spark Thrift Server. > -- > > Key: SPARK-37254 > URL: https://issues.apache.org/jira/browse/SPARK-37254 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: ramakrishna chilaka >Priority: Major > > We are trying to use Spark thrift server as a distributed sql query engine, > the queries work when the resident memory occupied by Spark thrift server > identified through HTOP is comparatively less than the driver memory. The > same queries result in 100% cpu usage when the resident memory occupied by > spark thrift server is greater than the configured driver memory and keeps > running at 100% cpu usage. I am using incremental collect as false, as i need > faster responses for exploratory queries. I am trying to understand the > following points > * Why isn't spark thrift server releasing back the memory, when there are no > queries. > * What is causing spark thrift server to go into 100% cpu usage on all the > cores, when spark thrift server's memory is greater than the driver memory > (by 10% usually) and why are queries just stuck. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37233) Inline type hints for files in python/pyspark/mllib
[ https://issues.apache.org/jira/browse/SPARK-37233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37233: Assignee: dch nguyen > Inline type hints for files in python/pyspark/mllib > --- > > Key: SPARK-37233 > URL: https://issues.apache.org/jira/browse/SPARK-37233 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid
[ https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37260: - Fix Version/s: 3.2.1 > PYSPARK Arrow 3.2.0 docs link invalid > - > > Key: SPARK-37260 > URL: https://issues.apache.org/jira/browse/SPARK-37260 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Priority: Major > Fix For: 3.2.1 > > > [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html] > links to: > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > which links to: > [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst] > But that is an invalid link. > I assume its supposed to point to: > https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid
[ https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37260. -- Resolution: Fixed > PYSPARK Arrow 3.2.0 docs link invalid > - > > Key: SPARK-37260 > URL: https://issues.apache.org/jira/browse/SPARK-37260 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Priority: Major > > [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html] > links to: > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > which links to: > [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst] > But that is an invalid link. > I assume its supposed to point to: > https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid
[ https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442044#comment-17442044 ] Hyukjin Kwon commented on SPARK-37260: -- oh yeah. that's fixed via #34475. There are some more ongoing issues on the docs. I will fix them up and probably we could initiate spark 3.2.1. > PYSPARK Arrow 3.2.0 docs link invalid > - > > Key: SPARK-37260 > URL: https://issues.apache.org/jira/browse/SPARK-37260 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Priority: Major > > [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html] > links to: > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > which links to: > [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst] > But that is an invalid link. > I assume its supposed to point to: > https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37272: Assignee: Apache Spark > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37272: Assignee: (was: Apache Spark) > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17442020#comment-17442020 ] Apache Spark commented on SPARK-37272: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34547 > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37273) Hidden File Metadata Support for Spark SQL
Yaohua Zhao created SPARK-37273: --- Summary: Hidden File Metadata Support for Spark SQL Key: SPARK-37273 URL: https://issues.apache.org/jira/browse/SPARK-37273 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yaohua Zhao Provide a new interface in Spark SQL that allows users to query the metadata of the input files for all file formats, expose them as *built-in hidden columns* meaning *users can only see them when they explicitly reference them* (e.g. file path, file name) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37272) Add ExtendedRocksDBTest
Dongjoon Hyun created SPARK-37272: - Summary: Add ExtendedRocksDBTest Key: SPARK-37272 URL: https://issues.apache.org/jira/browse/SPARK-37272 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33502) Large number of SELECT columns causes StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-33502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236434#comment-17236434 ] Arwin S Tio edited comment on SPARK-33502 at 11/10/21, 7:22 PM: Note, running my program with "-Xss3072k" fixed it. Giving Spark a bigger stack lets you hold more columns in memory. was (Author: cozos): Note, running my program with "-Xss3072k" fixed it > Large number of SELECT columns causes StackOverflowError > > > Key: SPARK-33502 > URL: https://issues.apache.org/jira/browse/SPARK-33502 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: Arwin S Tio >Priority: Minor > > On Spark 2.4.7 Standalone Mode on my laptop (Macbook Pro 2015), I ran the > following: > {code:java} > public class TestSparkStackOverflow { > public static void main(String [] args) { > SparkSession spark = SparkSession > .builder() > .config("spark.master", "local[8]") > .appName(TestSparkStackOverflow.class.getSimpleName()) > .getOrCreate(); > StructType inputSchema = new StructType(); > inputSchema = inputSchema.add("foo", DataTypes.StringType); > > Dataset inputDf = spark.createDataFrame( > Arrays.asList( > RowFactory.create("1"), > RowFactory.create("2"), > RowFactory.create("3") > ), > inputSchema > ); > > List lotsOfColumns = new ArrayList<>(); > for (int i = 0; i < 3000; i++) { > lotsOfColumns.add(lit("").as("field" + i).cast(DataTypes.StringType)); > } > lotsOfColumns.add(new Column("foo")); > inputDf > > .select(JavaConverters.collectionAsScalaIterableConverter(lotsOfColumns).asScala().toSeq()) > .write() > .format("csv") > .mode(SaveMode.Append) > .save("file:///tmp/testoutput"); > } > } > {code} > > And I get a StackOverflowError: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Job > aborted.Exception in thread "main" org.apache.spark.SparkException: Job > aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249) at > udp.task.TestSparkStackOverflow.main(TestSparkStackOverflow.java:52)Caused > by: java.lang.StackOverflowError at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at
[jira] [Resolved] (SPARK-35557) Adapt uses of JDK 17 Internal APIs
[ https://issues.apache.org/jira/browse/SPARK-35557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35557. --- Resolution: Duplicate This is superseded by SPARK-36796 via adding `--add-open` options. > Adapt uses of JDK 17 Internal APIs > -- > > Key: SPARK-35557 > URL: https://issues.apache.org/jira/browse/SPARK-35557 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Ismaël Mejía >Priority: Major > > I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with > Spark 2.12.4 on Java 17 and I found this exception: > {code:java} > java.lang.ExceptionInInitializerError > at org.apache.spark.unsafe.array.ByteArrayMethods. > (ByteArrayMethods.java:54) > at org.apache.spark.internal.config.package$. (package.scala:1149) > at org.apache.spark.SparkConf$. (SparkConf.scala:654) > at org.apache.spark.SparkConf.contains (SparkConf.scala:455) > ... > Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make > private java.nio.DirectByteBuffer(long,int) accessible: module java.base does > not "opens java.nio" to unnamed module @110df513 > at java.lang.reflect.AccessibleObject.checkCanSetAccessible > (AccessibleObject.java:357) > at java.lang.reflect.AccessibleObject.checkCanSetAccessible > (AccessibleObject.java:297) > at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188) > at java.lang.reflect.Constructor.setAccessible (Constructor.java:181) > at org.apache.spark.unsafe.Platform. (Platform.java:56) > at org.apache.spark.unsafe.array.ByteArrayMethods. > (ByteArrayMethods.java:54) > at org.apache.spark.internal.config.package$. (package.scala:1149) > at org.apache.spark.SparkConf$. (SparkConf.scala:654) > at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}} > {code} > It seems that Java 17 will be more strict about uses of JDK Internals > [https://openjdk.java.net/jeps/403] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37265) Support Java 17 in `dev/test-dependencies.sh`
[ https://issues.apache.org/jira/browse/SPARK-37265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37265. --- Resolution: Invalid Let me close this Invalid. > Support Java 17 in `dev/test-dependencies.sh` > - > > Key: SPARK-37265 > URL: https://issues.apache.org/jira/browse/SPARK-37265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37271) Spark OOM issue
[ https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M Shadab resolved SPARK-37271. -- Resolution: Fixed done > Spark OOM issue > --- > > Key: SPARK-37271 > URL: https://issues.apache.org/jira/browse/SPARK-37271 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.1.0 >Reporter: M Shadab >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37271) Spark OOM issue
[ https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441805#comment-17441805 ] M Shadab commented on SPARK-37271: -- Memory increased for the container > Spark OOM issue > --- > > Key: SPARK-37271 > URL: https://issues.apache.org/jira/browse/SPARK-37271 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.1.0 >Reporter: M Shadab >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37271) Spark OOM issue
[ https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M Shadab updated SPARK-37271: - Shepherd: M Shadab > Spark OOM issue > --- > > Key: SPARK-37271 > URL: https://issues.apache.org/jira/browse/SPARK-37271 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.1.0 >Reporter: M Shadab >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37271) Spark OOM issue
M Shadab created SPARK-37271: Summary: Spark OOM issue Key: SPARK-37271 URL: https://issues.apache.org/jira/browse/SPARK-37271 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 3.1.0 Reporter: M Shadab -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36575) Executor lost may cause spark stage to hang
[ https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441796#comment-17441796 ] wuyi commented on SPARK-36575: -- FYI: the fix is reverted due to test issues. > Executor lost may cause spark stage to hang > --- > > Key: SPARK-36575 > URL: https://issues.apache.org/jira/browse/SPARK-36575 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.3.3 >Reporter: hujiahua >Assignee: hujiahua >Priority: Major > Fix For: 3.3.0 > > > When a executor finished a task of some stage, the driver will receive a > `StatusUpdate` event to handle it. At the same time the driver found the > executor heartbeat timed out, so the dirver also need handle ExecutorLost > event simultaneously. There was a race condition issues here, which will make > the task never been rescheduled again and the stage hang over. > The problem is that `TaskResultGetter.enqueueSuccessfulTask` use > asynchronous thread to handle successful task, that mean the synchronized > lock of `TaskSchedulerImpl` was released prematurely during midway > [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61]. > So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous > thread will go on to handle successful task. It cause > `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong > result. > Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, > which make `TaskSchedulerImpl.executorLost` was executed twice. > `copiesRunning(index) -= 1` were processed in `executorLost`, twice > `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. > related log when the issue produce: > 21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: > Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor > 366724, partition 4004, ANY, 7994 bytes) > 21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: > Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after > 140830 ms > 21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost > task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): > ExecutorLostFailure (executor 366724 exited caused by one of the running > tasks) Reason: Executor heartbeat timed out after 140830 ms > 21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished > task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 > (executor 366724) (3047/5400) > 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: > Executor 366724 on 10.109.89.3 killed by driver. > 21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] > ExecutorMonitor: Executor 366724 removed (new total is 793) > 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor > lost: 366724 (epoch 417416) > 21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] > BlockManagerMasterEndpoint: Trying to remove executor 366724 from > BlockManagerMaster. > 21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] > BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, > 10.109.89.3, 43402, None) > 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: > Removed 366724 successfully in removeExecutor > 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle > files lost for executor: 366724 (epoch 417416) > 21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor > lost: 366724 (epoch 417473) > 21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] > BlockManagerMasterEndpoint: Trying to remove executor 366724 from > BlockManagerMaster. > 21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: > Removed 366724 successfully in removeExecutor > 21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle > files lost for executor: 366724 (epoch 417473) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
[ https://issues.apache.org/jira/browse/SPARK-37045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-37045: Assignee: Max Gekk > Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests > > > Key: SPARK-37045 > URL: https://issues.apache.org/jira/browse/SPARK-37045 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Max Gekk >Priority: Major > > Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for > V1 and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
[ https://issues.apache.org/jira/browse/SPARK-37045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441755#comment-17441755 ] Max Gekk commented on SPARK-37045: -- I am working on this. > Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests > > > Key: SPARK-37045 > URL: https://issues.apache.org/jira/browse/SPARK-37045 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for > V1 and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37236) Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/
[ https://issues.apache.org/jira/browse/SPARK-37236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-37236. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34510 [https://github.com/apache/spark/pull/34510] > Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/ > -- > > Key: SPARK-37236 > URL: https://issues.apache.org/jira/browse/SPARK-37236 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37236) Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/
[ https://issues.apache.org/jira/browse/SPARK-37236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37236: -- Assignee: dch nguyen > Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/ > -- > > Key: SPARK-37236 > URL: https://issues.apache.org/jira/browse/SPARK-37236 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Kus updated SPARK-37270: --- Component/s: SQL > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37261) Check adding partitions with ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-37261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-37261. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34537 [https://github.com/apache/spark/pull/34537] > Check adding partitions with ANSI intervals > --- > > Key: SPARK-37261 > URL: https://issues.apache.org/jira/browse/SPARK-37261 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Add tests that should check adding partitions with ANSI intervals via the > ALTER TABLE command. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37270) Incorect result of filter using isNull condition
Tomasz Kus created SPARK-37270: -- Summary: Incorect result of filter using isNull condition Key: SPARK-37270 URL: https://issues.apache.org/jira/browse/SPARK-37270 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Tomasz Kus Simple code that allows to reproduce this issue: {code:java} val frame = Seq((false, 1)).toDF("bool", "number") frame .checkpoint() .withColumn("conditions", when(col("bool"), "I am not null")) .filter(col("conditions").isNull) .show(false){code} Although "conditions" column is null {code:java} +-+--+--+ |bool |number|conditions| +-+--+--+ |false|1 |null | +-+--+--+{code} empty result is shown. Execution plans: {code:java} == Parsed Logical Plan == 'Filter isnull('conditions) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252] +- LogicalRDD [bool#124, number#125], false == Analyzed Logical Plan == bool: boolean, number: int, conditions: string Filter isnull(conditions#252) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252] +- LogicalRDD [bool#124, number#125], false == Optimized Logical Plan == LocalRelation , [bool#124, number#125, conditions#252] == Physical Plan == LocalTableScan , [bool#124, number#125, conditions#252] {code} After removing checkpoint proper result is returned and execution plans are as follow: {code:java} == Parsed Logical Plan == 'Filter isnull('conditions) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256] +- Project [_1#119 AS bool#124, _2#120 AS number#125] +- LocalRelation [_1#119, _2#120] == Analyzed Logical Plan == bool: boolean, number: int, conditions: string Filter isnull(conditions#256) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256] +- Project [_1#119 AS bool#124, _2#120 AS number#125] +- LocalRelation [_1#119, _2#120] == Optimized Logical Plan == LocalRelation [bool#124, number#125, conditions#256] == Physical Plan == LocalTableScan [bool#124, number#125, conditions#256] {code} It seems that the most important difference is LogicalRDD -> LocalRelation There are following ways (workarounds) to retrieve correct result: 1) remove checkpoint 2) add explicit .otherwise(null) to when 3) add checkpoint() or cache() just before filter 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37269) The partitionOverwriteMode option is not respected when using insertInto
David Szakallas created SPARK-37269: --- Summary: The partitionOverwriteMode option is not respected when using insertInto Key: SPARK-37269 URL: https://issues.apache.org/jira/browse/SPARK-37269 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: David Szakallas >From the documentation of the {{spark.sql.sources.partitionOverwriteMode}} >configuration option: {quote}This can also be set as an output option for a data source using key partitionOverwriteMode (which takes precedence over this setting), e.g. dataframe.write.option("partitionOverwriteMode", "dynamic").save(path). {quote} This is true when using .save(), however .insertInto() does not respect the output option. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37268) Remove unused method call in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441669#comment-17441669 ] Apache Spark commented on SPARK-37268: -- User 'zuston' has created a pull request for this issue: https://github.com/apache/spark/pull/34545 > Remove unused method call in FileScanRDD > > > Key: SPARK-37268 > URL: https://issues.apache.org/jira/browse/SPARK-37268 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37268) Remove unused method call in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37268: Assignee: (was: Apache Spark) > Remove unused method call in FileScanRDD > > > Key: SPARK-37268 > URL: https://issues.apache.org/jira/browse/SPARK-37268 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37268) Remove unused method call in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37268: Assignee: Apache Spark > Remove unused method call in FileScanRDD > > > Key: SPARK-37268 > URL: https://issues.apache.org/jira/browse/SPARK-37268 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Junfan Zhang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37268) Remove unused method call in FileScanRDD
Junfan Zhang created SPARK-37268: Summary: Remove unused method call in FileScanRDD Key: SPARK-37268 URL: https://issues.apache.org/jira/browse/SPARK-37268 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441612#comment-17441612 ] Apache Spark commented on SPARK-37022: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34544 > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > - Reduce effort required to maintain patched forks: smaller diffs + > predictable formatting. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441610#comment-17441610 ] Apache Spark commented on SPARK-37022: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34544 > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > - Reduce effort required to maintain patched forks: smaller diffs + > predictable formatting. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org