[jira] [Commented] (SPARK-33858) Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests
[ https://issues.apache.org/jira/browse/SPARK-33858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17252373#comment-17252373 ] Maxim Gekk commented on SPARK-33858: I am working on this. > Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests > - > > Key: SPARK-33858 > URL: https://issues.apache.org/jira/browse/SPARK-33858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Extract ALTER TABLE .. RENAME PARTITION tests to the common place to run them > for V1 and v2 datasources. Some tests can be places to V1 and V2 specific > test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33858) Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests
[ https://issues.apache.org/jira/browse/SPARK-33858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33858: --- Description: Extract ALTER TABLE .. RENAME PARTITION tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites. > Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests > - > > Key: SPARK-33858 > URL: https://issues.apache.org/jira/browse/SPARK-33858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Extract ALTER TABLE .. RENAME PARTITION tests to the common place to run them > for V1 and v2 datasources. Some tests can be places to V1 and V2 specific > test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33858) Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests
Maxim Gekk created SPARK-33858: -- Summary: Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests Key: SPARK-33858 URL: https://issues.apache.org/jira/browse/SPARK-33858 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33849) Unify v1 and v2 DROP TABLE tests
[ https://issues.apache.org/jira/browse/SPARK-33849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33849: --- Description: Extract DROP TABLE tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites. > Unify v1 and v2 DROP TABLE tests > > > Key: SPARK-33849 > URL: https://issues.apache.org/jira/browse/SPARK-33849 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Extract DROP TABLE tests to the common place to run them for V1 and v2 > datasources. Some tests can be places to V1 and V2 specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33849) Unify v1 and v2 DROP TABLE tests
Maxim Gekk created SPARK-33849: -- Summary: Unify v1 and v2 DROP TABLE tests Key: SPARK-33849 URL: https://issues.apache.org/jira/browse/SPARK-33849 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33838) Add comments about `PURGE` in DropTable and in AlterTableDropPartition
[ https://issues.apache.org/jira/browse/SPARK-33838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33838: --- Summary: Add comments about `PURGE` in DropTable and in AlterTableDropPartition (was: Add comment about `PURGE` in DropTable and in AlterTableDropPartition) > Add comments about `PURGE` in DropTable and in AlterTableDropPartition > -- > > Key: SPARK-33838 > URL: https://issues.apache.org/jira/browse/SPARK-33838 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Update comments for the commands AlterTableDropPartition and DropTable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33838) Add comment about `PURGE` in DropTable and in AlterTableDropPartition
Maxim Gekk created SPARK-33838: -- Summary: Add comment about `PURGE` in DropTable and in AlterTableDropPartition Key: SPARK-33838 URL: https://issues.apache.org/jira/browse/SPARK-33838 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Update comments for the commands AlterTableDropPartition and DropTable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PARTITION
[ https://issues.apache.org/jira/browse/SPARK-33830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33830: --- Summary: Describe the PURGE option of ALTER TABLE .. DROP PARTITION (was: Describe the PURGE option of ALTER TABLE .. DROP PATITION) > Describe the PURGE option of ALTER TABLE .. DROP PARTITION > -- > > Key: SPARK-33830 > URL: https://issues.apache.org/jira/browse/SPARK-33830 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PATITION
Maxim Gekk created SPARK-33830: -- Summary: Describe the PURGE option of ALTER TABLE .. DROP PATITION Key: SPARK-33830 URL: https://issues.apache.org/jira/browse/SPARK-33830 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33789) Refactor unified V1 and V2 datasource tests
Maxim Gekk created SPARK-33789: -- Summary: Refactor unified V1 and V2 datasource tests Key: SPARK-33789 URL: https://issues.apache.org/jira/browse/SPARK-33789 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Extract common utils methods and settings, and place them to common traits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33788) Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions()
[ https://issues.apache.org/jira/browse/SPARK-33788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33788: --- Description: HiveExternalCatalog.dropPartitions and as a consequence of that V1 Hive catalog throws AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that throw NoSuchPartitionsException. (was: HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that throw PartitionsAlreadyExistException.) > Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions() > - > > Key: SPARK-33788 > URL: https://issues.apache.org/jira/browse/SPARK-33788 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > HiveExternalCatalog.dropPartitions and as a consequence of that V1 Hive > catalog throws AnalysisException. The behavior deviates from V1/V2 in-memory > catalogs that throw NoSuchPartitionsException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33788) Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions()
Maxim Gekk created SPARK-33788: -- Summary: Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions() Key: SPARK-33788 URL: https://issues.apache.org/jira/browse/SPARK-33788 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.7, 3.0.1, 3.1.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 2.4.8, 3.0.2, 3.1.0 HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that throw PartitionsAlreadyExistException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33787) Add `purge` to `dropPartition` in `SupportsPartitionManagement`
Maxim Gekk created SPARK-33787: -- Summary: Add `purge` to `dropPartition` in `SupportsPartitionManagement` Key: SPARK-33787 URL: https://issues.apache.org/jira/browse/SPARK-33787 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Add the `purge` parameter to the `dropPartition` in `SupportsPartitionManagement` and to the `dropPartitions` in `SupportsAtomicPartitionManagement`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33777) Sort output of V2 SHOW PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249120#comment-17249120 ] Maxim Gekk commented on SPARK-33777: I am working on this. > Sort output of V2 SHOW PARTITIONS > - > > Key: SPARK-33777 > URL: https://issues.apache.org/jira/browse/SPARK-33777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > V1 SHOW PARTITIONS command sorts its results. Both V1 implementations > in-memory and Hive catalog (according to Hive docs > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)] > perform sorting. V2 should have the same behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33777) Sort output of SHOW PARTITIONS V2
Maxim Gekk created SPARK-33777: -- Summary: Sort output of SHOW PARTITIONS V2 Key: SPARK-33777 URL: https://issues.apache.org/jira/browse/SPARK-33777 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk V1 SHOW PARTITIONS command sorts its results. Both V1 implementations in-memory and Hive catalog (according to Hive docs [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)] perform sorting. V2 should have the same behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33777) Sort output of V2 SHOW PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33777: --- Summary: Sort output of V2 SHOW PARTITIONS (was: Sort output of SHOW PARTITIONS V2) > Sort output of V2 SHOW PARTITIONS > - > > Key: SPARK-33777 > URL: https://issues.apache.org/jira/browse/SPARK-33777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > V1 SHOW PARTITIONS command sorts its results. Both V1 implementations > in-memory and Hive catalog (according to Hive docs > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)] > perform sorting. V2 should have the same behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33770) Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of partition path
Maxim Gekk created SPARK-33770: -- Summary: Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of partition path Key: SPARK-33770 URL: https://issues.apache.org/jira/browse/SPARK-33770 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk For example: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132719/testReport/org.apache.spark.sql.hive.execution.command/AlterTableAddPartitionSuite/ALTER_TABLEADD_PARTITION_Hive_V1__SPARK_33521__universal_type_conversions_of_partition_values/ {code:java} org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of partition values sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-38fe2706-33e5-469a-ba3a-682391e02179 does not exist; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.dropPartitions(ExternalCatalogWithListener.scala:211) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropPartitions(SessionCatalog.scala:1036) at org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:582) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33768) Remove unused parameter `retainData` from AlterTableDropPartition
Maxim Gekk created SPARK-33768: -- Summary: Remove unused parameter `retainData` from AlterTableDropPartition Key: SPARK-33768 URL: https://issues.apache.org/jira/browse/SPARK-33768 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk The parameter is hard-coded to false while parsing in AstBuilder. The parameter can be removed from the logical node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33767) Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests
[ https://issues.apache.org/jira/browse/SPARK-33767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33767: --- Description: Extract ALTER TABLE .. DROP PARTITION tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites. (was: Extract ALTER TABLE .. ADD PARTITION tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites.) > Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests > --- > > Key: SPARK-33767 > URL: https://issues.apache.org/jira/browse/SPARK-33767 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > Extract ALTER TABLE .. DROP PARTITION tests to the common place to run them > for V1 and v2 datasources. Some tests can be places to V1 and V2 specific > test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33767) Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests
Maxim Gekk created SPARK-33767: -- Summary: Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests Key: SPARK-33767 URL: https://issues.apache.org/jira/browse/SPARK-33767 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.1.0 Extract ALTER TABLE .. ADD PARTITION tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33742) Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions()
Maxim Gekk created SPARK-33742: -- Summary: Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions() Key: SPARK-33742 URL: https://issues.apache.org/jira/browse/SPARK-33742 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.1, 2.4.7, 3.1.0 Reporter: Maxim Gekk HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that throw PartitionsAlreadyExistException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246717#comment-17246717 ] Maxim Gekk commented on SPARK-33571: > The behavior of the to be introduced in Spark 3.1 > `spark.sql.legacy.parquet.int96RebaseModeIn*` is the same as for > `datetimeRebaseModeIn*`? Yes. > So Spark will check the parquet metadata for Spark version and the > `datetimeRebaseModeInRead` metadata key and use the correct behavior. Correct, except of names of metadata keys. Spark checks , see https://github.com/MaxGekk/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala#L58-L68 > If those are not set it will raise an exception and ask the user to define > the mode. Is that correct? Yes. Spark should raise the exception if it is not clear which calendar the writer used. > but from my testing Spark 3 does the same by default, not sure if that aligns > with your findings? Spark 3.0.0-SNAPSHOT saved timestamps as TIMESTAMP_MICROS in parquet till https://github.com/apache/spark/pull/28450 . I just wanted to say that the configs datetimeRebaseModeIn* you pointed out don't impact on INT96 in Spark 3.0. > What is the expected behavior for TIMESTAMP_MICROS and TIMESTAMP_MILLIS with > regards to this? The same as for DATE type. Spark takes into account the same SQL configs and metdata keys from parquet files. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > Fix For: 3.1.0 > > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compared to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-reba
[jira] [Updated] (SPARK-33558) Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests
[ https://issues.apache.org/jira/browse/SPARK-33558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33558: --- Description: Extract ALTER TABLE .. ADD PARTITION tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites. (was: Extract ALTER TABLE .. PARTITION tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites.) > Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests > -- > > Key: SPARK-33558 > URL: https://issues.apache.org/jira/browse/SPARK-33558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Extract ALTER TABLE .. ADD PARTITION tests to the common place to run them > for V1 and v2 datasources. Some tests can be places to V1 and V2 specific > test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33558) Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests
[ https://issues.apache.org/jira/browse/SPARK-33558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33558: --- Summary: Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests (was: Unify v1 and v2 ALTER TABLE .. PARTITION tests) > Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests > -- > > Key: SPARK-33558 > URL: https://issues.apache.org/jira/browse/SPARK-33558 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Extract ALTER TABLE .. PARTITION tests to the common place to run them for V1 > and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33706) Require fully specified partition identifier in partitionExists()
Maxim Gekk created SPARK-33706: -- Summary: Require fully specified partition identifier in partitionExists() Key: SPARK-33706 URL: https://issues.apache.org/jira/browse/SPARK-33706 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Currently, partitionExists() from SupportsPartitionManagement accept any partition identifier even which is not fully specified. This ticket aim to add a check for the length of partition schema and partition identifier, and require exact matching. So, we should prohibit not fully specified IDs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33441) Add "unused-import" compile arg to scalac and remove all unused imports in Scala code
[ https://issues.apache.org/jira/browse/SPARK-33441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245723#comment-17245723 ] Maxim Gekk commented on SPARK-33441: Would it be possible to check unused imports in Java code? > Add "unused-import" compile arg to scalac and remove all unused imports in > Scala code > -- > > Key: SPARK-33441 > URL: https://issues.apache.org/jira/browse/SPARK-33441 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.1.0 > > > * Add new scala compile arg to defense against new unused imports: > ** "-Ywarn-unused-import" for Scala 2.12 > ** "-Wconf:cat=unused-imports:ws" or "-Wconf:cat=unused-imports:error" for > Scala 2.13 > * Remove all unused imports in Scala code -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33670) Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED
[ https://issues.apache.org/jira/browse/SPARK-33670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245174#comment-17245174 ] Maxim Gekk commented on SPARK-33670: [~hyukjin.kwon] Just in case, which "Affects Version" should be pointed out - already released or current unreleased version. For example, I pointed out 3.0.2 but it has been not released yet. Maybe, I should set 3.0.1? > Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED > --- > > Key: SPARK-33670 > URL: https://issues.apache.org/jira/browse/SPARK-33670 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Invoke the check verifyPartitionProviderIsHive() from v1 implementation of > SHOW TABLE EXTENDED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33688) Migrate SHOW TABLE EXTENDED to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245143#comment-17245143 ] Maxim Gekk commented on SPARK-33688: I am working on this. > Migrate SHOW TABLE EXTENDED to new resolution framework > --- > > Key: SPARK-33688 > URL: https://issues.apache.org/jira/browse/SPARK-33688 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > # Create the Command logical node for SHOW TABLE EXTENDED > # Remove ShowTableStatement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33688) Migrate SHOW TABLE EXTENDED to new resolution framework
Maxim Gekk created SPARK-33688: -- Summary: Migrate SHOW TABLE EXTENDED to new resolution framework Key: SPARK-33688 URL: https://issues.apache.org/jira/browse/SPARK-33688 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk # Create the Command logical node for SHOW TABLE EXTENDED # Remove ShowTableStatement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33676) Require exact matched partition spec to schema in ADD/DROP PARTITION
Maxim Gekk created SPARK-33676: -- Summary: Require exact matched partition spec to schema in ADD/DROP PARTITION Key: SPARK-33676 URL: https://issues.apache.org/jira/browse/SPARK-33676 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The V1 implementation of ALTER TABLE .. ADD/DROP PARTITION fails when the partition spec doesn't exactly match to the partition schema: {code:sql} ALTER TABLE tab1 ADD PARTITION (A='9') Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`dbx`.`tab1`'; org.apache.spark.sql.AnalysisException: Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`dbx`.`tab1`'; at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$requireExactMatchedPartitionSpec$1(SessionCatalog.scala:1173) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$requireExactMatchedPartitionSpec$1$adapted(SessionCatalog.scala:1171) at scala.collection.immutable.List.foreach(List.scala:392) {code} for a table partitioned by "a", "b" but the V2 implementation add the wrong partition silently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33672) Check SQLContext.tables() for V2 session catalog
[ https://issues.apache.org/jira/browse/SPARK-33672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244542#comment-17244542 ] Maxim Gekk commented on SPARK-33672: [~cloud_fan] FYI > Check SQLContext.tables() for V2 session catalog > > > Key: SPARK-33672 > URL: https://issues.apache.org/jira/browse/SPARK-33672 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > V1 ShowTablesCommand is hard coded in SQLContext: > https://github.com/apache/spark/blob/a088a801ed8c17171545c196a3f26ce415de0cd1/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L671 > The ticket aims to checks tables() behavior for V2 session catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33672) Check SQLContext.tables() for V2 session catalog
Maxim Gekk created SPARK-33672: -- Summary: Check SQLContext.tables() for V2 session catalog Key: SPARK-33672 URL: https://issues.apache.org/jira/browse/SPARK-33672 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk V1 ShowTablesCommand is hard coded in SQLContext: https://github.com/apache/spark/blob/a088a801ed8c17171545c196a3f26ce415de0cd1/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L671 The ticket aims to checks tables() behavior for V2 session catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33671) Remove VIEW checks from V1 table commands
Maxim Gekk created SPARK-33671: -- Summary: Remove VIEW checks from V1 table commands Key: SPARK-33671 URL: https://issues.apache.org/jira/browse/SPARK-33671 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.2, 3.1.0 Reporter: Maxim Gekk Checking of VIEWs is performed earlier, see https://github.com/apache/spark/pull/30461 . So, the checks can be removed from some V1 commands. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33670) Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED
Maxim Gekk created SPARK-33670: -- Summary: Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED Key: SPARK-33670 URL: https://issues.apache.org/jira/browse/SPARK-33670 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.2, 3.1.0 Reporter: Maxim Gekk Invoke the check verifyPartitionProviderIsHive() from v1 implementation of SHOW TABLE EXTENDED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-33667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33667: --- Description: SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config *spark.sql.caseSensitive* which is false by default, for instance: {code:sql} spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > PARTITIONED BY (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS; {code} was: SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config *spark.sql.caseSensitive* which is true by default, for instance: {code:sql} spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > PARTITIONED BY (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS; {code} > Respect case sensitivity in V1 SHOW PARTITIONS > -- > > Key: SPARK-33667 > URL: https://issues.apache.org/jira/browse/SPARK-33667 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config > *spark.sql.caseSensitive* which is false by default, for instance: > {code:sql} > spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > > USING parquet > > PARTITIONED BY (year, month); > spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; > spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); > Error in query: Non-partitioning column(s) [YEAR, Month] are specified for > SHOW PARTITIONS; > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS
Maxim Gekk created SPARK-33667: -- Summary: Respect case sensitivity in V1 SHOW PARTITIONS Key: SPARK-33667 URL: https://issues.apache.org/jira/browse/SPARK-33667 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.8, 3.0.2, 3.1.0 Reporter: Maxim Gekk SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config *spark.sql.caseSensitive* which is true by default, for instance: {code:sql} spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > PARTITIONED BY (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243441#comment-17243441 ] Maxim Gekk commented on SPARK-33571: I opened the PR [https://github.com/apache/spark/pull/30596] with some improvements for config docs. [~hyukjin.kwon] [~cloud_fan] could you review it, please. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33650) Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table
Maxim Gekk created SPARK-33650: -- Summary: Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table Key: SPARK-33650 URL: https://issues.apache.org/jira/browse/SPARK-33650 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk For a V2 table that doesn't support partition management, ALTER TABLE .. ADD/DROP PARTITION throws misleading exception: {code:java} PartitionSpecs are not resolved;; 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false +- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable@5d3ff859 org.apache.spark.sql.AnalysisException: PartitionSpecs are not resolved;; 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false +- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable@5d3ff859 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) {code} The error should say that the table doesn't support partition management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241408#comment-17241408 ] Maxim Gekk commented on SPARK-33571: [~simonvanderveldt] Looking at the dates, you tested, both dates 1880-10-01 and 2020-10-01 belong to the Gregorian calendar, so, should be no diffs. For the date 0220-10-01, please, have a look at the table which I built in the PR: https://github.com/apache/spark/pull/28067 . The table shows that there is no diffs between 2 calendars for the year. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241400#comment-17241400 ] Maxim Gekk commented on SPARK-33571: Spark 3.0.1 shows different results as well: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_275) scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) 20/12/01 12:31:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) +--+--+ |dict |plain | +--+--+ |1001-01-01|1001-01-01| |1001-01-01|1001-01-02| |1001-01-01|1001-01-03| |1001-01-01|1001-01-04| |1001-01-01|1001-01-05| |1001-01-01|1001-01-06| |1001-01-01|1001-01-07| |1001-01-01|1001-01-08| +--+--+ scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED") scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) +--+--+ |dict |plain | +--+--+ |1001-01-07|1001-01-07| |1001-01-07|1001-01-08| |1001-01-07|1001-01-09| |1001-01-07|1001-01-10| |1001-01-07|1001-01-11| |1001-01-07|1001-01-12| |1001-01-07|1001-01-13| |1001-01-07|1001-01-14| +--+--+ {code} > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog po
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241379#comment-17241379 ] Maxim Gekk commented on SPARK-33571: I have tried to reproduce the issue on the master branch by reading the file saved by Spark 2.4.5 (https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data): {code:scala} test("SPARK-33571: read ancient dates saved by Spark 2.4.5") { withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> LEGACY.toString) { val path = getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet") val df = spark.read.parquet(path) df.show(false) } withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> CORRECTED.toString) { val path = getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet") val df = spark.read.parquet(path) df.show(false) } } {code} The results are different in LEGACY and in CORRECTED modes: {code} +--+--+ |dict |plain | +--+--+ |1001-01-01|1001-01-01| |1001-01-01|1001-01-02| |1001-01-01|1001-01-03| |1001-01-01|1001-01-04| |1001-01-01|1001-01-05| |1001-01-01|1001-01-06| |1001-01-01|1001-01-07| |1001-01-01|1001-01-08| +--+--+ +--+--+ |dict |plain | +--+--+ |1001-01-07|1001-01-07| |1001-01-07|1001-01-08| |1001-01-07|1001-01-09| |1001-01-07|1001-01-10| |1001-01-07|1001-01-11| |1001-01-07|1001-01-12| |1001-01-07|1001-01-13| |1001-01-07|1001-01-14| +--+--+ {code} > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've ma
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241339#comment-17241339 ] Maxim Gekk commented on SPARK-33571: [~simonvanderveldt] Thank you for the detailed description and your investigation. Let me clarify a few things: > From our testing we're seeing several issues: > Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that > contains fields of type `TimestampType` which contain timestamps before the > above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config `datetimeRebaseModeInRead` does not influence on reading such types in Spark 3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added separate configs for INT96: * https://github.com/apache/spark/pull/30056 * https://github.com/apache/spark/pull/30121 The changes will be released with Spark 3.1.0. > Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that > contains fields of type `TimestampType` or `DateType` which contain dates or > timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. For INT96, it seems it is correct behavior. We should observe different results for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config spark.sql.parquet.outputTimestampType. The DATE case is more interesting as we must see a difference in results for ancient dates. I will investigate this case. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the
[jira] [Updated] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33591: --- Issue Type: Bug (was: Improvement) > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33591: --- Issue Type: Improvement (was: Bug) > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33591) NULL is recognized as the "null" string in partition specs
Maxim Gekk created SPARK-33591: -- Summary: NULL is recognized as the "null" string in partition specs Key: SPARK-33591 URL: https://issues.apache.org/jira/browse/SPARK-33591 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk For example: {code:sql} spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false {code} The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
[ https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33585: --- Affects Version/s: (was: 2.4.7) (was: 3.0.1) 3.0.2 2.4.8 > The comment for SQLContext.tables() doesn't mention the `database` column > - > > Key: SPARK-33585 > URL: https://issues.apache.org/jira/browse/SPARK-33585 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > The comment says: "The returned DataFrame has two columns, tableName and > isTemporary": > https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 > but actually the dataframe has 3 columns: > {code:scala} > scala> spark.range(10).createOrReplaceTempView("view1") > scala> val tables = spark.sqlContext.tables() > tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string > ... 1 more field] > scala> tables.printSchema > root > |-- database: string (nullable = false) > |-- tableName: string (nullable = false) > |-- isTemporary: boolean (nullable = false) > scala> tables.show > ++-+---+ > |database|tableName|isTemporary| > ++-+---+ > | default| t1| false| > | default| t2| false| > | default| ymd| false| > ||view1| true| > ++-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
Maxim Gekk created SPARK-33588: -- Summary: Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive` Key: SPARK-33588 URL: https://issues.apache.org/jira/browse/SPARK-33588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.8, 3.0.2, 3.1.0 Reporter: Maxim Gekk For example: {code:sql} spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > partitioned by (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); Error in query: Partition spec is invalid. The spec (YEAR, Month) must match the partition spec (year, month) defined in table '`default`.`tbl1`'; {code} The spark.sql.caseSensitive flag is false by default, so, the partition spec is valid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column
Maxim Gekk created SPARK-33585: -- Summary: The comment for SQLContext.tables() doesn't mention the `database` column Key: SPARK-33585 URL: https://issues.apache.org/jira/browse/SPARK-33585 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.0.1, 2.4.7, 3.1.0 Reporter: Maxim Gekk The comment says: "The returned DataFrame has two columns, tableName and isTemporary": https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664 but actually the dataframe has 3 columns: {code:scala} scala> spark.range(10).createOrReplaceTempView("view1") scala> val tables = spark.sqlContext.tables() tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field] scala> tables.printSchema root |-- database: string (nullable = false) |-- tableName: string (nullable = false) |-- isTemporary: boolean (nullable = false) scala> tables.show ++-+---+ |database|tableName|isTemporary| ++-+---+ | default| t1| false| | default| t2| false| | default| ymd| false| ||view1| true| ++-+---+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33569) Remove getting partitions by only ident
Maxim Gekk created SPARK-33569: -- Summary: Remove getting partitions by only ident Key: SPARK-33569 URL: https://issues.apache.org/jira/browse/SPARK-33569 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk This is a follow up of SPARK-33509 which added a function for getting partitions by names and ident. The function which gets partitions by ident is not used anymore, and it can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33558) Unify v1 and v2 ALTER TABLE .. PARTITION tests
Maxim Gekk created SPARK-33558: -- Summary: Unify v1 and v2 ALTER TABLE .. PARTITION tests Key: SPARK-33558 URL: https://issues.apache.org/jira/browse/SPARK-33558 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Extract ALTER TABLE .. PARTITION tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33529) Resolver of V2 partition specs doesn't handle __HIVE_DEFAULT_PARTITION__
[ https://issues.apache.org/jira/browse/SPARK-33529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33529: --- Description: The partition value '__HIVE_DEFAULT_PARTITION__' should be handled as null - the same as DSv1 does. For example in DSv1: {code:java} spark-sql> CREATE TABLE tbl11 (id int, part0 string) USING parquet PARTITIONED BY (part0); spark-sql> ALTER TABLE tbl11 ADD PARTITION (part0 = '__HIVE_DEFAULT_PARTITION__'); spark-sql> INSERT INTO tbl11 PARTITION (part0='__HIVE_DEFAULT_PARTITION__') SELECT 1; spark-sql> SELECT * FROM tbl11; 1 NULL {code} was:The partition value '__HIVE_DEFAULT_PARTITION__' should be handled as null - the same as DSv1 does. > Resolver of V2 partition specs doesn't handle __HIVE_DEFAULT_PARTITION__ > > > Key: SPARK-33529 > URL: https://issues.apache.org/jira/browse/SPARK-33529 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The partition value '__HIVE_DEFAULT_PARTITION__' should be handled as null - > the same as DSv1 does. > For example in DSv1: > {code:java} > spark-sql> CREATE TABLE tbl11 (id int, part0 string) USING parquet > PARTITIONED BY (part0); > spark-sql> ALTER TABLE tbl11 ADD PARTITION (part0 = > '__HIVE_DEFAULT_PARTITION__'); > spark-sql> INSERT INTO tbl11 PARTITION (part0='__HIVE_DEFAULT_PARTITION__') > SELECT 1; > spark-sql> SELECT * FROM tbl11; > 1 NULL > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33529) Resolver of V2 partition specs doesn't handle __HIVE_DEFAULT_PARTITION__
Maxim Gekk created SPARK-33529: -- Summary: Resolver of V2 partition specs doesn't handle __HIVE_DEFAULT_PARTITION__ Key: SPARK-33529 URL: https://issues.apache.org/jira/browse/SPARK-33529 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The partition value '__HIVE_DEFAULT_PARTITION__' should be handled as null - the same as DSv1 does. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33521) Universal type conversion of V2 partition values
Maxim Gekk created SPARK-33521: -- Summary: Universal type conversion of V2 partition values Key: SPARK-33521 URL: https://issues.apache.org/jira/browse/SPARK-33521 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Support other types while resolving partition specs in https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33511) Respect case sensitivity in resolving partition specs V2
Maxim Gekk created SPARK-33511: -- Summary: Respect case sensitivity in resolving partition specs V2 Key: SPARK-33511 URL: https://issues.apache.org/jira/browse/SPARK-33511 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk DSv1 DDL commands respect the SQL config spark.sql.caseSensitive, for example {code:java} spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet PARTITIONED BY (id); spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1); spark-sql> SHOW PARTITIONS tbl1; id=1 {code} but the same ALTER TABLE command fails on DSv2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33509) List partition by names from V2 tables that support partition management
Maxim Gekk created SPARK-33509: -- Summary: List partition by names from V2 tables that support partition management Key: SPARK-33509 URL: https://issues.apache.org/jira/browse/SPARK-33509 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Currently, the SupportsPartitionManagement interface exposes only the listPartitionIdentifiers() method which does not allow to list partition by names. So, it is hard to implement: {code:java} SHOW PARTITIONS table PARTITION(month=2) {code} from the table like: {code:java} CREATE TABLE $table (price int, qty int, year int, month int) USING parquet partitioned by (year, month) {code} because listPartitionIdentifiers() requires to specify value for *year* . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33505) Fix insert into `InMemoryPartitionTable`
Maxim Gekk created SPARK-33505: -- Summary: Fix insert into `InMemoryPartitionTable` Key: SPARK-33505 URL: https://issues.apache.org/jira/browse/SPARK-33505 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Currently, INSERT INTO a partitioned table in V2 in-memory catalog doesn't create partitions. The example below demonstrates the issue: {code:scala} test("insert into partitioned table") { val t = "testpart.ns1.ns2.tbl" withTable(t) { spark.sql( s""" |CREATE TABLE $t (id bigint, name string, data string) |USING foo |PARTITIONED BY (id, name)""".stripMargin) spark.sql(s"INSERT INTO $t PARTITION(id = 1, name = 'Max') SELECT 'abc'") val partTable = catalog("testpart").asTableCatalog .loadTable(Identifier.of(Array("ns1", "ns2"), "tbl")).asInstanceOf[InMemoryPartitionTable] assert(partTable.partitionExists(InternalRow.fromSeq(Seq(1, UTF8String.fromString("Max") } } {code} The partitionExists() function return false for the partitions that must be created. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33453) Unify v1 and v2 SHOW PARTITIONS tests
Maxim Gekk created SPARK-33453: -- Summary: Unify v1 and v2 SHOW PARTITIONS tests Key: SPARK-33453 URL: https://issues.apache.org/jira/browse/SPARK-33453 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Gather common tests for DSv1 and DSv2 SHOW PARTITIONS command to a common test. Mix this trait to datasource specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33452) Create a V2 SHOW PARTITIONS execution node
Maxim Gekk created SPARK-33452: -- Summary: Create a V2 SHOW PARTITIONS execution node Key: SPARK-33452 URL: https://issues.apache.org/jira/browse/SPARK-33452 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk There is the V1 SHOW PARTITIONS implementation: https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L975 The ticket aims to add V2 implementation with similar behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33452) Create a V2 SHOW PARTITIONS execution node
[ https://issues.apache.org/jira/browse/SPARK-33452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232095#comment-17232095 ] Maxim Gekk commented on SPARK-33452: I plan to work on this soon. > Create a V2 SHOW PARTITIONS execution node > -- > > Key: SPARK-33452 > URL: https://issues.apache.org/jira/browse/SPARK-33452 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > There is the V1 SHOW PARTITIONS implementation: > https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L975 > The ticket aims to add V2 implementation with similar behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33393) Support SHOW TABLE EXTENDED in DSv2
[ https://issues.apache.org/jira/browse/SPARK-33393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232094#comment-17232094 ] Maxim Gekk commented on SPARK-33393: I plan to work on this soon. > Support SHOW TABLE EXTENDED in DSv2 > --- > > Key: SPARK-33393 > URL: https://issues.apache.org/jira/browse/SPARK-33393 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Current implementation of DSv2 SHOW TABLE doesn't support the EXTENDED mode > in: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala#L33 > which is supported in DSv1: > https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L870 > Need to add the same functionality to ShowTablesExec. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230514#comment-17230514 ] Maxim Gekk commented on SPARK-33430: [~cloud_fan] [~huaxingao] It would be nice to support namespaces, WDYT? > Support namespaces in JDBC v2 Table Catalog > --- > > Key: SPARK-33430 > URL: https://issues.apache.org/jira/browse/SPARK-33430 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > When I extend JDBCTableCatalogSuite by > org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance: > {code:scala} > import org.apache.spark.sql.execution.command.v2.ShowTablesSuite > class JDBCTableCatalogSuite extends ShowTablesSuite { > override def version: String = "JDBC V2" > override def catalog: String = "h2" > ... > {code} > some tests from JDBCTableCatalogSuite fail with: > {code} > [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 > seconds, 502 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does > not support namespaces; > [info] at > org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog
Maxim Gekk created SPARK-33430: -- Summary: Support namespaces in JDBC v2 Table Catalog Key: SPARK-33430 URL: https://issues.apache.org/jira/browse/SPARK-33430 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk When I extend JDBCTableCatalogSuite by org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance: {code:scala} import org.apache.spark.sql.execution.command.v2.ShowTablesSuite class JDBCTableCatalogSuite extends ShowTablesSuite { override def version: String = "JDBC V2" override def catalog: String = "h2" ... {code} some tests from JDBCTableCatalogSuite fail with: {code} [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 seconds, 502 milliseconds) [info] org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does not support namespaces; [info] at org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83) [info] at org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208) [info] at org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33426) Unify Hive SHOW TABLES tests
Maxim Gekk created SPARK-33426: -- Summary: Unify Hive SHOW TABLES tests Key: SPARK-33426 URL: https://issues.apache.org/jira/browse/SPARK-33426 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk 1. Move Hive SHOW TABLES tests to a separate test suite 2. Extend the common SHOW TABLES trait by the new test suite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33403) DSv2 SHOW TABLES doesn't show `default`
Maxim Gekk created SPARK-33403: -- Summary: DSv2 SHOW TABLES doesn't show `default` Key: SPARK-33403 URL: https://issues.apache.org/jira/browse/SPARK-33403 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk DSv1: {code:scala} test("namespace is not specified and the default catalog is set") { withSQLConf(SQLConf.DEFAULT_CATALOG.key -> catalog) { withTable("table") { spark.sql(s"CREATE TABLE table (id bigint, data string) $defaultUsing") sql("SHOW TABLES").show() } } } {code} {code} ++-+---+ |database|tableName|isTemporary| ++-+---+ | default|table| false| ++-+---+ {code} DSv2: {code} +-+-+ |namespace|tableName| +-+-+ | |table| +-+-+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33394) Throw `NoSuchDatabaseException` for not existing namespace in DSv2 SHOW TABLES
Maxim Gekk created SPARK-33394: -- Summary: Throw `NoSuchDatabaseException` for not existing namespace in DSv2 SHOW TABLES Key: SPARK-33394 URL: https://issues.apache.org/jira/browse/SPARK-33394 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Current implementation of DSv2 SHOW TABLES return an empty result for not existing database/namespace. This implementation should be aligned to DSv1 which throws the `NoSuchDatabaseException` exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33393) Support SHOW TABLE EXTENDED in DSv2
Maxim Gekk created SPARK-33393: -- Summary: Support SHOW TABLE EXTENDED in DSv2 Key: SPARK-33393 URL: https://issues.apache.org/jira/browse/SPARK-33393 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Current implementation of DSv2 SHOW TABLE doesn't support the EXTENDED mode in: https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala#L33 which is supported in DSv1: https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L870 Need to add the same functionality to ShowTablesExec. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33364) Expose purge option in TableCatalog.dropTable
[ https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33364: --- Parent: SPARK-33392 Issue Type: Sub-task (was: New Feature) > Expose purge option in TableCatalog.dropTable > - > > Key: SPARK-33364 > URL: https://issues.apache.org/jira/browse/SPARK-33364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.1.0 > > > TableCatalog.dropTable currently does not support the purge option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33305: --- Parent: SPARK-33392 Issue Type: Sub-task (was: Bug) > DSv2: DROP TABLE command should also invalidate cache > - > > Key: SPARK-33305 > URL: https://issues.apache.org/jira/browse/SPARK-33305 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > > Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the > table but doesn't invalidate all caches referencing the table. We should make > the behavior consistent between v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33392) Align DSv2 commands to DSv1 implementation
Maxim Gekk created SPARK-33392: -- Summary: Align DSv2 commands to DSv1 implementation Key: SPARK-33392 URL: https://issues.apache.org/jira/browse/SPARK-33392 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The purpose of this umbrella ticket is: # Implement missing features of datasource v1 commands in DSv2 # Align behavior of DSv2 commands to the current implementation of DSv1 commands as much as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests
Maxim Gekk created SPARK-33382: -- Summary: Unify v1 and v2 SHOW TABLES tests Key: SPARK-33382 URL: https://issues.apache.org/jira/browse/SPARK-33382 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. Mix this trait to datasource specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33381) Unify dsv1 and dsv2 command tests
Maxim Gekk created SPARK-33381: -- Summary: Unify dsv1 and dsv2 command tests Key: SPARK-33381 URL: https://issues.apache.org/jira/browse/SPARK-33381 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Create unified test suites for DSv1 and DSv2 commands like CREATE TABLE, SHOW TABLES and etc. Put datasource specific tests to separate test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33381) Unify DSv1 and DSv2 command tests
[ https://issues.apache.org/jira/browse/SPARK-33381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33381: --- Summary: Unify DSv1 and DSv2 command tests (was: Unify dsv1 and dsv2 command tests) > Unify DSv1 and DSv2 command tests > - > > Key: SPARK-33381 > URL: https://issues.apache.org/jira/browse/SPARK-33381 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Create unified test suites for DSv1 and DSv2 commands like CREATE TABLE, SHOW > TABLES and etc. Put datasource specific tests to separate test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs
Maxim Gekk created SPARK-33299: -- Summary: Unify schema parsing in from_json/from_csv across all APIs Key: SPARK-33299 URL: https://issues.apache.org/jira/browse/SPARK-33299 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Currently, from_json() has extra capability in Scala API. It accepts schema in JSON format but other API (SQL, Python, R) lacks the feature. The ticket aims to unify all APIs, and support schemas in JSON format everywhere. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33286) Misleading error messages of `from_json`
Maxim Gekk created SPARK-33286: -- Summary: Misleading error messages of `from_json` Key: SPARK-33286 URL: https://issues.apache.org/jira/browse/SPARK-33286 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Schema parsing of from_json output misleading error messages. It output the error message only from the fallback parser. And the message does not show actual root cause. For example: {code:scala} val df = Seq("""{"a":1}""").toDF("json") val invalidJsonSchema = """{"fields": [{"a":123}], "type": "struct"}""" df.select(from_json($"json", invalidJsonSchema, Map.empty[String, String])).show {code} {code} mismatched input '{' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TIME', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'ZONE', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '{' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MS
[jira] [Created] (SPARK-33281) Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression
Maxim Gekk created SPARK-33281: -- Summary: Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression Key: SPARK-33281 URL: https://issues.apache.org/jira/browse/SPARK-33281 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk SPARK-33270 changed the schema format from catalog string to SQL format. This ticket aims to unify output of schema_of_json and schema_of_csv. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33270) from_json() cannot parse schema from schema_of_json()
Maxim Gekk created SPARK-33270: -- Summary: from_json() cannot parse schema from schema_of_json() Key: SPARK-33270 URL: https://issues.apache.org/jira/browse/SPARK-33270 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2, 3.1.0 Reporter: Maxim Gekk For example: {code:scala} val in = Seq("""{"a b": 1}""").toDS() in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed") {code} This fails with the exception: {code:java} org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '<' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TIME', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'ZONE', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) == SQL == struct --^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131) at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:537) at org.apache.spark.sql.functions$.from_json(functions.scala:4141) at org.apache.spark.sql.functions$.from_json(functions.scala:4124) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31342) Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582
[ https://issues.apache.org/jira/browse/SPARK-31342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220274#comment-17220274 ] Maxim Gekk commented on SPARK-31342: [~cloud_fan] I think we can close this because it has been already implemented. > Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582 > > > Key: SPARK-31342 > URL: https://issues.apache.org/jira/browse/SPARK-31342 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > Some users may not know they are creating and/or reading DATE or TIMESTAMP > data from before October 15, 1582 (because of data encoding libraries, etc.). > Therefore, it may not be clear whether they need to toggle the two > rebaseDateTime config settings. > By default, Spark should fail if it reads or writes data from October 15, > 1582 or before. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
Maxim Gekk created SPARK-33210: -- Summary: Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default Key: SPARK-33210 URL: https://issues.apache.org/jira/browse/SPARK-33210 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The ticket aims to set the following SQL configs: - spark.sql.legacy.parquet.int96RebaseModeInWrite - spark.sql.legacy.parquet.int96RebaseModeInRead to EXCEPTION by default. The reason is let users to decide should Spark modify loaded/saved timestamps instead of silently shifting timestamps while rebasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33169) Check propagation of datasource options to underlying file system for built-in file-based datasources
Maxim Gekk created SPARK-33169: -- Summary: Check propagation of datasource options to underlying file system for built-in file-based datasources Key: SPARK-33169 URL: https://issues.apache.org/jira/browse/SPARK-33169 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Add a common trait with a test to check that datasource options are propagated to underlying file systems. Individual tests were already added by SPARK-33094 and SPARK-33089. The ticket aims to de-duplicate the tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33163) Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files
Maxim Gekk created SPARK-33163: -- Summary: Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files Key: SPARK-33163 URL: https://issues.apache.org/jira/browse/SPARK-33163 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Add tests to check that the metadata key 'org.apache.spark.legacyDateTime' is saved correctly depending on the SQL configs: # spark.sql.legacy.avro.datetimeRebaseModeInWrite # spark.sql.legacy.parquet.datetimeRebaseModeInWrite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33160) Allow saving/loading INT96 in parquet w/o rebasing
[ https://issues.apache.org/jira/browse/SPARK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214643#comment-17214643 ] Maxim Gekk commented on SPARK-33160: I am working on this. > Allow saving/loading INT96 in parquet w/o rebasing > -- > > Key: SPARK-33160 > URL: https://issues.apache.org/jira/browse/SPARK-33160 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, Spark always performs rebasing of INT96 columns in Parquet > datasource but this is not required by parquet spec. This tickets aims to > allow users to turn off rebasing via SQL config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33160) Allow saving/loading INT96 in parquet w/o rebasing
Maxim Gekk created SPARK-33160: -- Summary: Allow saving/loading INT96 in parquet w/o rebasing Key: SPARK-33160 URL: https://issues.apache.org/jira/browse/SPARK-33160 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Currently, Spark always performs rebasing of INT96 columns in Parquet datasource but this is not required by parquet spec. This tickets aims to allow users to turn off rebasing via SQL config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33134: --- Description: The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} was: The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 583651}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [19], "playerId": > 123456}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.ge
[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33134: --- Description: The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 583651}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} was: The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 583651}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [11], "playerId": > 583651}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRo
[jira] [Created] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
Maxim Gekk created SPARK-33134: -- Summary: Incorrect nested complex JSON fields raise an exception Key: SPARK-33134 URL: https://issues.apache.org/jira/browse/SPARK-33134 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.2, 3.1.0 Reporter: Maxim Gekk The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 583651}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31294) Benchmark the performance regression
[ https://issues.apache.org/jira/browse/SPARK-31294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212199#comment-17212199 ] Maxim Gekk commented on SPARK-31294: [~cloud_fan] Could you close this ticket since the PR was merged. > Benchmark the performance regression > > > Key: SPARK-31294 > URL: https://issues.apache.org/jira/browse/SPARK-31294 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33101: --- Description: When running: {code:java} spark.read.format("libsvm").options(conf).load(path) {code} The underlying file system will not receive the `conf` options. was: When running: {code:java} spark.read.format("orc").options(conf).load(path) {code} The underlying file system will not receive the `conf` options. > LibSVM format does not propagate Hadoop config from DS options to underlying > HDFS file system > - > > Key: SPARK-33101 > URL: https://issues.apache.org/jira/browse/SPARK-33101 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > When running: > {code:java} > spark.read.format("libsvm").options(conf).load(path) > {code} > The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system
Maxim Gekk created SPARK-33101: -- Summary: LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system Key: SPARK-33101 URL: https://issues.apache.org/jira/browse/SPARK-33101 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.1.0 When running: {code:java} spark.read.format("orc").options(conf).load(path) {code} The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system
Maxim Gekk created SPARK-33094: -- Summary: ORC format does not propagate Hadoop config from DS options to underlying HDFS file system Key: SPARK-33094 URL: https://issues.apache.org/jira/browse/SPARK-33094 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Assignee: Yuning Zhang Fix For: 3.0.2, 3.1.0 When running: {code:java} spark.read.format("avro").options(conf).load(path) {code} The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33094: --- Description: When running: {code:java} spark.read.format("orc").options(conf).load(path) {code} The underlying file system will not receive the `conf` options. was: When running: {code:java} spark.read.format("avro").options(conf).load(path) {code} The underlying file system will not receive the `conf` options. > ORC format does not propagate Hadoop config from DS options to underlying > HDFS file system > -- > > Key: SPARK-33094 > URL: https://issues.apache.org/jira/browse/SPARK-33094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Yuning Zhang >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > When running: > {code:java} > spark.read.format("orc").options(conf).load(path) > {code} > The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33074: --- Parent: SPARK-24907 Issue Type: Sub-task (was: Improvement) > Classify dialect exceptions in JDBC v2 Table Catalog > > > Key: SPARK-33074 > URL: https://issues.apache.org/jira/browse/SPARK-33074 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The current implementation of v2.jdbc.JDBCTableCatalog don't care of > exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at > all like > * NoSuchNamespaceException > * NoSuchTableException > * TableAlreadyExistsException > it either throw dialect exception or generic exception AnalysisException. > Since we split forming of dialect specific statements and their execution, we > should extend dialect APIs and ask them how to convert their exceptions to > TableCatalog exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog
Maxim Gekk created SPARK-33074: -- Summary: Classify dialect exceptions in JDBC v2 Table Catalog Key: SPARK-33074 URL: https://issues.apache.org/jira/browse/SPARK-33074 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The current implementation of v2.jdbc.JDBCTableCatalog don't care of exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at all like * NoSuchNamespaceException * NoSuchTableException * TableAlreadyExistsException it either throw dialect exception or generic exception AnalysisException. Since we split forming of dialect specific statements and their execution, we should extend dialect APIs and ask them how to convert their exceptions to TableCatalog exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests
Maxim Gekk created SPARK-33067: -- Summary: Add negative checks to JDBC v2 Table Catalog tests Key: SPARK-33067 URL: https://issues.apache.org/jira/browse/SPARK-33067 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Add checks when JDBC v2 commands fail -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33066) Port docker integration tests to JDBC v2
Maxim Gekk created SPARK-33066: -- Summary: Port docker integration tests to JDBC v2 Key: SPARK-33066 URL: https://issues.apache.org/jira/browse/SPARK-33066 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Port existing docker integration tests like org.apache.spark.sql.jdbc.OracleIntegrationSuite to JDBC v2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33034) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (Oracle dialect)
[ https://issues.apache.org/jira/browse/SPARK-33034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33034: --- Summary: Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (Oracle dialect) (was: Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns) > Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and > nullability of columns (Oracle dialect) > -- > > Key: SPARK-33034 > URL: https://issues.apache.org/jira/browse/SPARK-33034 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Override the default SQL strings for: > - ALTER TABLE ADD COLUMN > - ALTER TABLE UPDATE COLUMN TYPE > - ALTER TABLE UPDATE COLUMN NULLABILITY > in the following Oracle JDBC dialect according to official documentation. > Write Oracle integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33034) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns
[ https://issues.apache.org/jira/browse/SPARK-33034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204672#comment-17204672 ] Maxim Gekk commented on SPARK-33034: I am working on this. > Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and > nullability of columns > - > > Key: SPARK-33034 > URL: https://issues.apache.org/jira/browse/SPARK-33034 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Override the default SQL strings for: > - ALTER TABLE ADD COLUMN > - ALTER TABLE UPDATE COLUMN TYPE > - ALTER TABLE UPDATE COLUMN NULLABILITY > in the following Oracle JDBC dialect according to official documentation. > Write Oracle integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33034) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns
Maxim Gekk created SPARK-33034: -- Summary: Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns Key: SPARK-33034 URL: https://issues.apache.org/jira/browse/SPARK-33034 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Override the default SQL strings for: - ALTER TABLE ADD COLUMN - ALTER TABLE UPDATE COLUMN TYPE - ALTER TABLE UPDATE COLUMN NULLABILITY in the following Oracle JDBC dialect according to official documentation. Write Oracle integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33015) Compute the current date only once
Maxim Gekk created SPARK-33015: -- Summary: Compute the current date only once Key: SPARK-33015 URL: https://issues.apache.org/jira/browse/SPARK-33015 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2, 3.1.0 Reporter: Maxim Gekk According to the doc for current_date(), it must compute the current date at the start of query evaluation: http://spark.apache.org/docs/latest/api/sql/#current_date but it can compute it multiple times: https://github.com/apache/spark/blob/0df8dd60733066076967f0525210bbdb5e12415a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L85 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33015) Compute the current date only once
[ https://issues.apache.org/jira/browse/SPARK-33015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203084#comment-17203084 ] Maxim Gekk commented on SPARK-33015: I am working on this > Compute the current date only once > -- > > Key: SPARK-33015 > URL: https://issues.apache.org/jira/browse/SPARK-33015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > According to the doc for current_date(), it must compute the current date at > the start of query evaluation: > http://spark.apache.org/docs/latest/api/sql/#current_date but it can compute > it multiple times: > https://github.com/apache/spark/blob/0df8dd60733066076967f0525210bbdb5e12415a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L85 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199952#comment-17199952 ] Maxim Gekk commented on SPARK-32306: I opened PR https://github.com/apache/spark/pull/29835 with clarification. > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org