[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-12-19 Thread GitBox


lucabem commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1357606131

   Hi @Virmaline, I have checked other tables and looks like it cannot read 
more than four parquets. When I add four or more files, it shows my this error.
   
   Is it a know bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-12-18 Thread GitBox


lucabem commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1356756712

   Hi @Virmaline, it is quite strage. I have downloaded a full table on AWS 
that gives me 4 parquets (lets call them A, B, C, ,D). I have tested your 
configuration and works fine with all combinations unless read all of them at 
the same time.
   
   | Combination | Result |
   |-||
   | A   | OK |
   | B   | OK |
   | C   | OK |
   | D   | OK |
   | A, B| OK |
   | A, C| OK |
   | A, D| OK |
   | B, C| OK |
   | B, D| OK |
   | C, D| OK |
   | A, B, C | OK |
   | A, B, D | OK |
   | A, C, D | OK |
   | B, C, D | OK |
   | A, B, C, D  | KO |
   
   But if I read first three parquets (A, B, C) and then I readlast one (D), it 
works. It looks like is loosing spark-conf somewhere. This is my code of 
spark-submit
   ```
   spark-submit \
   --jars jars/hudi-ext-0.12.1.jar,jars/avro-1.11.1.jar \
   --conf spark.driver.memory=12g \
   --conf spark.driver.maxResultSize=12g \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED \
   --conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED \
   --conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED \
   --conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED \
   --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \
   --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \
   --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \
   --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \
   --conf spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED \
   --driver-cores 8 \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
jars/hudi-utilities-bundle_2.12-0.12.1.jar  \
   --table-type COPY_ON_WRITE  \
   --op BULK_INSERT  \
   --source-ordering-field dms_timestamp \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource  \
   --target-base-path /home/luis/parquet/consolidation/gccc_demand_cond/  \
   --target-table gccc_demand_cond  \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id_demand_cond  \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
  \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=  \
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=/home/luis/parquet/data/gccc_demand_cond \
   --hoodie-conf hoodie.datasource.write.drop.partition.columns=true  \
   --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true  \
   --hoodie-conf hoodie.cleaner.commits.retained=1800  \
   --hoodie-conf clean.retain_commits=1800  \
   --hoodie-conf archive.min_commits=2000  \
   --hoodie-conf archive.max_commits=2010  \
   --hoodie-conf hoodie.keep.min.commits=2000  \
   --hoodie-conf hoodie.keep.max.commits=2010  \
   --transformer-class 
org.apache.hudi.utilities.transform.AWSDmsTransformer \
   --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
   --enable-sync  \
   --sync-tool-classes org.apache.hudi.hive.HiveSyncTool \
   --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1  \
   --hoodie-conf hoodie.datasource.hive_sync.enable=true  \
   --hoodie-conf hoodie.datasource.hive_sync.database=consolidation  \
   --hoodie-conf hoodie.datasource.hive_sync.table=gccc_demand_cond  \
   --hoodie-conf hoodie.datasource.hive_sync.mode=hms  \
   --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true
   
   ```
   
   And this is my parquet schema:
   ```
    file meta data 
   created_by: AWS
   num_columns: 11
   num_rows: 1011052
   num_row_groups: 2023
   format_version: 1.0
   serialized_size: 1897645
   
   
    Columns 
   dms_timestamp
   create_date
   update_date
   update_user
   update_program
   optimist_lock
   id_demand_cond
   ini_date
   end_date
   id_sector_supply
   cod_demand_type
   
    Column(dms_timestamp) 
   name: dms_timestamp
   path: dms_timestamp
   max_definition_level: 0
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
    Column(create_date) 
   name: create_date
   path: create_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, 
is_from_converted_type=true, force_set_converted_type=false)
   converted_type (legacy): TIMESTAMP_MICROS
   
    Column(update_date) 
   name: update_date
   

[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-12-17 Thread GitBox


lucabem commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1356180670

   Not in my case, Im still having this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-11-30 Thread GitBox


lucabem commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1331884460

   Hi @alexeykudinkin, Im using hudi 0.12.1 and spark 3.1.2. Im trying to 
execute this command:
   ```
   spark-submit \
   --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED \
   --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \
   --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \
   --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \
   --conf spark.driver.memory=12g \
   --conf spark.driver.maxResultSize=12g \
   --driver-cores 8 \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
jars/hudi-utilities-bundle_2.12-0.12.1.jar  \
   --table-type COPY_ON_WRITE  \
   --op INSERT  \
   --source-ordering-field dms_timestamp  \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource  \
   --target-base-path /home/luis/parquet/test_table  \
   --target-table gccom_demand_cond  \
   --transformer-class 
org.apache.hudi.utilities.transform.AWSDmsTransformer  \
   --payload-class org.apache.hudi.payload.AWSDmsAvroPayload  \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id_key  \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
  \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=  \
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=/home/luis/parquet/data \
   --hoodie-conf hoodie.datasource.write.drop.partition.columns=true  \
   --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true  \
   --hoodie-conf hoodie.cleaner.commits.retained=1800  \
   --hoodie-conf clean.retain_commits=1800  \
   --hoodie-conf archive.min_commits=2000  \
   --hoodie-conf archive.max_commits=2010  \
   --hoodie-conf hoodie.keep.min.commits=2000  \
   --hoodie-conf hoodie.keep.max.commits=2010  \
   --enable-sync  \
   --sync-tool-classes org.apache.hudi.hive.HiveSyncTool \
   --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1  \
   --hoodie-conf hoodie.datasource.hive_sync.enable=true  \
   --hoodie-conf hoodie.datasource.hive_sync.database=database  \
   --hoodie-conf hoodie.datasource.hive_sync.table=test_table  \
   --hoodie-conf hoodie.datasource.hive_sync.mode=hms  \
   --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true
   ```
   and when I open spark-ui Environment tab, conf vars appear setted but then 
when Hudi (Javalin) is executed it throws the exception
   ```
   Caused by: org.apache.spark.SparkUpgradeException: 
   You may get a different result due to the upgrading of Spark 3.0: 
   reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z 
from Parquet files can be ambiguous, as the files may be written by Spark 2.x 
or legacy versions of Hive, which uses a legacy hybrid calendar that is 
different from Spark 3.0+'s Proleptic Gregorian calendar.
   See more details in SPARK-31404. 
   You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to 
rebase the datetime values w.r.t. the calendar difference during reading. Or 
set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read 
the datetime values as it is.
   ```
   
   Rigth now, the only solution I have found is reading-writing source parquet 
using spark 3.1.2 (just read / write) with spark.legacy conf, and then use this 
parquet output as input of DeltaStreamer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lucabem commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-11-29 Thread GitBox


lucabem commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1330626172

   Hi, in deltastreamer this issue still exists :(
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org