lucabem commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1331884460
Hi @alexeykudinkin, Im using hudi 0.12.1 and spark 3.1.2. Im trying to execute this command: ``` spark-submit \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \ --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \ --conf spark.driver.memory=12g \ --conf spark.driver.maxResultSize=12g \ --driver-cores 8 \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer jars/hudi-utilities-bundle_2.12-0.12.1.jar \ --table-type COPY_ON_WRITE \ --op INSERT \ --source-ordering-field dms_timestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path /home/luis/parquet/test_table \ --target-table gccom_demand_cond \ --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \ --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \ --hoodie-conf hoodie.datasource.write.recordkey.field=id_key \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \ --hoodie-conf hoodie.datasource.write.partitionpath.field= \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=/home/luis/parquet/data \ --hoodie-conf hoodie.datasource.write.drop.partition.columns=true \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.cleaner.commits.retained=1800 \ --hoodie-conf clean.retain_commits=1800 \ --hoodie-conf archive.min_commits=2000 \ --hoodie-conf archive.max_commits=2010 \ --hoodie-conf hoodie.keep.min.commits=2000 \ --hoodie-conf hoodie.keep.max.commits=2010 \ --enable-sync \ --sync-tool-classes org.apache.hudi.hive.HiveSyncTool \ --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 \ --hoodie-conf hoodie.datasource.hive_sync.enable=true \ --hoodie-conf hoodie.datasource.hive_sync.database=database \ --hoodie-conf hoodie.datasource.hive_sync.table=test_table \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true ``` and when I open spark-ui Environment tab, conf vars appear setted but then when Hudi (Javalin) is executed it throws the exception ``` Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. ``` Rigth now, the only solution I have found is reading-writing source parquet using spark 3.1.2 (just read / write) with spark.legacy conf, and then use this parquet output as input of DeltaStreamer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org