[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-12-19 Thread GitBox


Virmaline commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1357616082

   Hi lucabem, I haven't run into that, I will have to test that out, maybe 
it'll get to that tomorrow and I can let you know my results, but I don't 
actually know how all of this works on a deeper level, I'm just trying to get 
it working as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-12-17 Thread GitBox


Virmaline commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1356496864

   Can you post your exact spark submit? Do you know why it's failing, what is 
the data type and value in the column?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-12-15 Thread GitBox


Virmaline commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1353894927

   Never mind, I got it working.
   
   I specified the --conf wrong, having the options comma separated instead of 
separate --conf statements and it needs to have both spark.sql.avro and 
spark.sql.parquet options set to work as such:
   
   --conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED 
   --conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED 
   --conf spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED 
   --conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-12-15 Thread GitBox


Virmaline commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1353791597

   @alexeykudinkin
   
   Hey Alexey, 
   
   I'm also still getting the same error after updating to 0.12.1.
   
   Hudi: 0.12.1-amzn-0-SNAPSHOT
   Spark: 3.3.0
   EMR: 6.9.0
   
   `spark-submit 
   --master yarn 
   --deploy-mode cluster 
   --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer,spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED,spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED,spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED,spark.sql.avro.datetimeRebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED,spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle.jar 
   --table-type COPY_ON_WRITE 
   --source-ordering-field replicadmstimestamp 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --target-base-path s3://bucket/folder/folder/table 
   --target-table table 
   --payload-class org.apache.hudi.common.model.AWSDmsAvroPayload 
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
 
   --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING 
   --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM 
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd 
HH:mm:ss.SS" 
   --hoodie-conf hoodie.datasource.write.recordkey.field=_id 
   --hoodie-conf 
hoodie.datasource.write.partitionpath.field=replicadmstimestamp 
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://bucket/folder/folder/table`
   
   I've tried about the every combination of the datetimeRebaseMode I've 
managed to think of, and the result is always the same.
   
   stacktrace included, is there any possible workaround for this? I currently 
have a separate process to change the timestamp columns, which works, but adds 
a bunch of overhead to the process. 
   
   
[stacktrace.txt](https://github.com/apache/hudi/files/10241150/stacktrace.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org