[ 
https://issues.apache.org/jira/browse/HUDI-7955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7955:
-----------------------------
    Fix Version/s: 0.16.0
                   1.0.0

> Account for WritableTimestampObjectInspector#getPrimitiveJavaObject Hive3 and 
> Hive2 discrepancies
> -------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-7955
>                 URL: https://issues.apache.org/jira/browse/HUDI-7955
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: voon
>            Assignee: voon
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.16.0, 1.0.0
>
>         Attachments: image-2024-07-05-18-11-33-420.png, 
> image-2024-07-05-18-13-28-135.png
>
>
> The invocation of *getPrimitiveJavaObject* returns a different implementation 
> of timestamp in Hive3 and Hive2. 
>  - Hive2: *java.sql.Timestamp*
>  - Hive3: *org.apache.hadoop.hive.common.type.Timestamp*
> Hudi common is compiled with Hive2, but Trino is using Hive3, causing the 
> discrepancy between compile and runtime. When execution flow falls into this 
> section of the code where the trigger conditions are listed below:
> 1. MOR table is used
> 2. User is querying the _rt table
> 3. User's table has a *TIMESTAMP* type and query requires it
> 4. Merge is required as record is present in both Parquet and Log file
> Error below will be thrown:
> {code:java}
> Query 20240704_075218_05052_yfmfc failed: 'java.sql.Timestamp 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)'
> java.lang.NoSuchMethodError: 'java.sql.Timestamp 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)'
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serializePrimitive(HiveAvroSerializer.java:304)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:212)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.setUpRecordFieldFromWritable(HiveAvroSerializer.java:121)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:108)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.convertArrayWritableToHoodieRecord(RealtimeCompactedRecordReader.java:185)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.mergeRecord(RealtimeCompactedRecordReader.java:172)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:114)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:49)
>         at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:88)
>         at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
>         at 
> io.trino.plugin.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:215)
>         at 
> io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:88)
>         at 
> io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120){code}
> h1. Hive3
> !image-2024-07-05-18-11-33-420.png|width=509,height=572!
> h1. Hive2
> !image-2024-07-05-18-13-28-135.png|width=507,height=501!
>  
> h1. How to reproduce
>  
> {*}Note{*}: Unless specified (in the comments), the SQLs should be executed 
> in Spark-SQL.
> {code:java}
> CREATE TABLE dev_hudi.hudi_7955__hive3_timestamp_issue (
>     id INT,
>     name STRING,
>     timestamp_col TIMESTAMP,
>     grass_region STRING
> ) USING hudi
> PARTITIONED BY (grass_region)
> tblproperties (
>     primaryKey = 'id',
>     type = 'mor',
>     precombineField = 'id',
>     hoodie.index.type = 'BUCKET',
>     hoodie.index.bucket.engine = 'CONSISTENT_HASHING',
>     hoodie.compact.inline = 'true'
> )
> LOCATION 'hdfs://path/to/hudi_tables/hudi_7955__hive3_timestamp_issue';
> -- 5 separate commits to trigger compaction
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (1, 'alex1', 
> now(), 'SG');
> -- No error here as there no MERGE is required between Parquet + Log
> SELECT _hoodie_file_name, id, timestamp_col FROM 
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE 
> '%parquet%';
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (2, 'alex2', 
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (3, 'alex3', 
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (4, 'alex4', 
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (5, 'alex5', 
> now(), 'SG');
> -- Should contain no rows here as table is compacted and all rows are in 
> parquet file (RUN IN PRESTO/TRINO/HIVE3)
> SELECT _hoodie_file_name, id, timestamp_col FROM 
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE 
> '%parquet%';
> -- 2 separate commits which will be written to logs (update)
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (4, 'alex4_1', 
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (5, 'alex5_1', 
> now(), 'SG');
> -- Not querying any TIMESTAMP columns, no errors will be thrown (RUN IN 
> PRESTO/TRINO/HIVE3)
> SELECT _hoodie_file_name, id FROM 
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE 
> '%parquet%';
> -- Error should be thrown here as we are including the timestamp column (RUN 
> IN PRESTO/TRINO/HIVE3)
> SELECT _hoodie_file_name, id, timestamp_col FROM 
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE 
> '%parquet%';{code}
>  
> h1. Solution
> Hive shimming should be applied to obtaining *getPrimitiveJavaObject* too.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to