Hi, I want to discuss about the support for timestamp dataType.
As we know, now Hudi save timestamp type as long, then this will lead to some 
problem when the table include timestamp datatype:
1) At bootstrap operation, if the origin parquet file was written by a spark 
application, then spark will default save timestamp as int96(see 
spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because of 
Hudi can not read Int96 type now.(this issue can be solve by upgrade parquet to 
1.12.0, and set parquet.avro.readInt96AsFixed=true, please check https://github 
<https://github/>.com/apache/parquet-mr/pull/831/files)

2) after bootstrap, doing upsert will fail because we use hoodie schema to read 
origin parquet file. The schema is not match because hoodie schema  treat 
timestamp as long and at origin file it’s Int96

3) after bootstrap, and partial update for a parquet file will fail, because we 
copy the old record and save by hoodie schema( we miss a convertFixedToLong 
operation like spark does)

4) if we set hoodie.datasource.hive_sync.support_timestamp=true, and will had a 
convertTypeException when reading the rt view, it’s because we miss convert 
LongWritable to ItmestampWrtableV2 at HoodieRealtimeRecordReaderUtils.

To solve these issues, we need to upgrade parquet version, and add some config, 
please help to get a good solution, thank you very much!

Reply via email to