[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
lucabem commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1357606131 Hi @Virmaline, I have checked other tables and looks like it cannot read more than four parquets. When I add four or more files, it shows my this error. Is it a know bug? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
lucabem commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1356756712 Hi @Virmaline, it is quite strage. I have downloaded a full table on AWS that gives me 4 parquets (lets call them A, B, C, ,D). I have tested your configuration and works fine with all combinations unless read all of them at the same time. | Combination | Result | |-|| | A | OK | | B | OK | | C | OK | | D | OK | | A, B| OK | | A, C| OK | | A, D| OK | | B, C| OK | | B, D| OK | | C, D| OK | | A, B, C | OK | | A, B, D | OK | | A, C, D | OK | | B, C, D | OK | | A, B, C, D | KO | But if I read first three parquets (A, B, C) and then I readlast one (D), it works. It looks like is loosing spark-conf somewhere. This is my code of spark-submit ``` spark-submit \ --jars jars/hudi-ext-0.12.1.jar,jars/avro-1.11.1.jar \ --conf spark.driver.memory=12g \ --conf spark.driver.maxResultSize=12g \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED \ --conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED \ --conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED \ --conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \ --conf spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED \ --driver-cores 8 \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer jars/hudi-utilities-bundle_2.12-0.12.1.jar \ --table-type COPY_ON_WRITE \ --op BULK_INSERT \ --source-ordering-field dms_timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path /home/luis/parquet/consolidation/gccc_demand_cond/ \ --target-table gccc_demand_cond \ --hoodie-conf hoodie.datasource.write.recordkey.field=id_demand_cond \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \ --hoodie-conf hoodie.datasource.write.partitionpath.field= \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=/home/luis/parquet/data/gccc_demand_cond \ --hoodie-conf hoodie.datasource.write.drop.partition.columns=true \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.cleaner.commits.retained=1800 \ --hoodie-conf clean.retain_commits=1800 \ --hoodie-conf archive.min_commits=2000 \ --hoodie-conf archive.max_commits=2010 \ --hoodie-conf hoodie.keep.min.commits=2000 \ --hoodie-conf hoodie.keep.max.commits=2010 \ --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \ --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \ --enable-sync \ --sync-tool-classes org.apache.hudi.hive.HiveSyncTool \ --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1 \ --hoodie-conf hoodie.datasource.hive_sync.enable=true \ --hoodie-conf hoodie.datasource.hive_sync.database=consolidation \ --hoodie-conf hoodie.datasource.hive_sync.table=gccc_demand_cond \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true ``` And this is my parquet schema: ``` file meta data created_by: AWS num_columns: 11 num_rows: 1011052 num_row_groups: 2023 format_version: 1.0 serialized_size: 1897645 Columns dms_timestamp create_date update_date update_user update_program optimist_lock id_demand_cond ini_date end_date id_sector_supply cod_demand_type Column(dms_timestamp) name: dms_timestamp path: dms_timestamp max_definition_level: 0 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 Column(create_date) name: create_date path: create_date max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false) converted_type (legacy): TIMESTAMP_MICROS Column(update_date) name: update_date
[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
lucabem commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1356180670 Not in my case, Im still having this issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
lucabem commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1331884460 Hi @alexeykudinkin, Im using hudi 0.12.1 and spark 3.1.2. Im trying to execute this command: ``` spark-submit \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \ --conf spark.driver.memory=12g \ --conf spark.driver.maxResultSize=12g \ --driver-cores 8 \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer jars/hudi-utilities-bundle_2.12-0.12.1.jar \ --table-type COPY_ON_WRITE \ --op INSERT \ --source-ordering-field dms_timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path /home/luis/parquet/test_table \ --target-table gccom_demand_cond \ --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \ --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \ --hoodie-conf hoodie.datasource.write.recordkey.field=id_key \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \ --hoodie-conf hoodie.datasource.write.partitionpath.field= \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=/home/luis/parquet/data \ --hoodie-conf hoodie.datasource.write.drop.partition.columns=true \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.cleaner.commits.retained=1800 \ --hoodie-conf clean.retain_commits=1800 \ --hoodie-conf archive.min_commits=2000 \ --hoodie-conf archive.max_commits=2010 \ --hoodie-conf hoodie.keep.min.commits=2000 \ --hoodie-conf hoodie.keep.max.commits=2010 \ --enable-sync \ --sync-tool-classes org.apache.hudi.hive.HiveSyncTool \ --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1 \ --hoodie-conf hoodie.datasource.hive_sync.enable=true \ --hoodie-conf hoodie.datasource.hive_sync.database=database \ --hoodie-conf hoodie.datasource.hive_sync.table=test_table \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true ``` and when I open spark-ui Environment tab, conf vars appear setted but then when Hudi (Javalin) is executed it throws the exception ``` Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. ``` Rigth now, the only solution I have found is reading-writing source parquet using spark 3.1.2 (just read / write) with spark.legacy conf, and then use this parquet output as input of DeltaStreamer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
lucabem commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1330626172 Hi, in deltastreamer this issue still exists :( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org