[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
Virmaline commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1357616082 Hi lucabem, I haven't run into that, I will have to test that out, maybe it'll get to that tomorrow and I can let you know my results, but I don't actually know how all of this works on a deeper level, I'm just trying to get it working as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
Virmaline commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1356496864 Can you post your exact spark submit? Do you know why it's failing, what is the data type and value in the column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
Virmaline commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1353894927 Never mind, I got it working. I specified the --conf wrong, having the options comma separated instead of separate --conf statements and it needs to have both spark.sql.avro and spark.sql.parquet options set to work as such: --conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED --conf spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
Virmaline commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1353791597 @alexeykudinkin Hey Alexey, I'm also still getting the same error after updating to 0.12.1. Hudi: 0.12.1-amzn-0-SNAPSHOT Spark: 3.3.0 EMR: 6.9.0 `spark-submit --master yarn --deploy-mode cluster --conf spark.serializer=org.apache.spark.serializer.KryoSerializer,spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED,spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED,spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED,spark.sql.avro.datetimeRebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED,spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar --table-type COPY_ON_WRITE --source-ordering-field replicadmstimestamp --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3://bucket/folder/folder/table --target-table table --payload-class org.apache.hudi.common.model.AWSDmsAvroPayload --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd HH:mm:ss.SS" --hoodie-conf hoodie.datasource.write.recordkey.field=_id --hoodie-conf hoodie.datasource.write.partitionpath.field=replicadmstimestamp --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/folder/folder/table` I've tried about the every combination of the datetimeRebaseMode I've managed to think of, and the result is always the same. stacktrace included, is there any possible workaround for this? I currently have a separate process to change the timestamp columns, which works, but adds a bunch of overhead to the process. [stacktrace.txt](https://github.com/apache/hudi/files/10241150/stacktrace.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org