[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

GitBox Wed, 30 Nov 2022 01:44:04 -0800


lucabem commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1331884460


   Hi @alexeykudinkin, Im using hudi 0.12.1 and spark 3.1.2. Im trying to 
execute this command:
   ```
   spark-submit \
       --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \
       --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \
       --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \
       --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \
       --conf spark.driver.memory=12g \
       --conf spark.driver.maxResultSize=12g \
       --driver-cores 8 \
       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
jars/hudi-utilities-bundle_2.12-0.12.1.jar  \
       --table-type COPY_ON_WRITE  \
       --op INSERT  \
       --source-ordering-field dms_timestamp  \
       --source-class org.apache.hudi.utilities.sources.ParquetDFSSource  \
       --target-base-path /home/luis/parquet/test_table  \
       --target-table gccom_demand_cond  \
       --transformer-class 
org.apache.hudi.utilities.transform.AWSDmsTransformer  \
       --payload-class org.apache.hudi.payload.AWSDmsAvroPayload  \
       --hoodie-conf hoodie.datasource.write.recordkey.field=id_key  \
       --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
  \
       --hoodie-conf hoodie.datasource.write.partitionpath.field=  \
       --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=/home/luis/parquet/data \
       --hoodie-conf hoodie.datasource.write.drop.partition.columns=true  \
       --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true  \
       --hoodie-conf hoodie.cleaner.commits.retained=1800  \
       --hoodie-conf clean.retain_commits=1800  \
       --hoodie-conf archive.min_commits=2000  \
       --hoodie-conf archive.max_commits=2010  \
       --hoodie-conf hoodie.keep.min.commits=2000  \
       --hoodie-conf hoodie.keep.max.commits=2010  \
       --enable-sync  \
       --sync-tool-classes org.apache.hudi.hive.HiveSyncTool \
       --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000  \
       --hoodie-conf hoodie.datasource.hive_sync.enable=true  \
       --hoodie-conf hoodie.datasource.hive_sync.database=database  \
       --hoodie-conf hoodie.datasource.hive_sync.table=test_table  \
       --hoodie-conf hoodie.datasource.hive_sync.mode=hms  \
       --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true
   ```
   and when I open spark-ui Environment tab, conf vars appear setted but then 
when Hudi (Javalin) is executed it throws the exception
   ```
   Caused by: org.apache.spark.SparkUpgradeException: 
   You may get a different result due to the upgrading of Spark 3.0: 
   reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z 
from Parquet files can be ambiguous, as the files may be written by Spark 2.x 
or legacy versions of Hive, which uses a legacy hybrid calendar that is 
different from Spark 3.0+'s Proleptic Gregorian calendar.
   See more details in SPARK-31404. 
   You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to 
rebase the datetime values w.r.t. the calendar difference during reading. Or 
set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read 
the datetime values as it is.
   ```
   
   Rigth now, the only solution I have found is reading-writing source parquet 
using spark 3.1.2 (just read / write) with spark.legacy conf, and then use this 
parquet output as input of DeltaStreamer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

Reply via email to