Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-24 Thread via GitHub


CamelliaYjli commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907801206

   > Yeah, you should use `HoodieHiveInputFormat` or 
HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a 
refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb
   
   OK,thx ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-24 Thread via GitHub


danny0405 closed issue #10486: [SUPPORT] Flink write to COW Hudi table,hive 
aggregate query results has duplicate data but select * did not
URL: https://github.com/apache/hudi/issues/10486


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-24 Thread via GitHub


danny0405 commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907724040

   Yeah, you should use `HoodieHiveInputFormat` or 
HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a 
refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-23 Thread via GitHub


CamelliaYjli commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907220642

   > Seems a bug
   
   When I set 
hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat , 
result is right , is this necessary before querying?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-16 Thread via GitHub


xicm commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1895237009

   Seems a bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-15 Thread via GitHub


danny0405 commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1892987929

   Looks good, @xicm can you help confirm this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-15 Thread via GitHub


CamelliaYjli commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1892007697

   > can you show us the create table statement from Hive?
   
   Okay, the table in Hive is an external table automatically generated during 
synchronization. The statement is as follows:
   
   CREATE EXTERNAL TABLE `cdc_hudi.table_test_duplicate_1`(
 `_hoodie_commit_time` string COMMENT '',
 `_hoodie_commit_seqno` string COMMENT '',
 `_hoodie_record_key` string COMMENT '',
 `_hoodie_partition_path` string COMMENT '',
 `_hoodie_file_name` string COMMENT '',
 `id` string COMMENT '',
 `name` string COMMENT '',
 `age` int COMMENT '')
   ROW FORMAT SERDE
 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   WITH SERDEPROPERTIES (
 'hoodie.query.as.ro.table'='false',
 
'path'='hdfs://localhost:8020/user/hive/warehouse/cdc_hudi.db/table_test_duplicate_1')
   STORED AS INPUTFORMAT
 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
   OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
 
'hdfs://localhost:8020/user/hive/warehouse/cdc_hudi.db/table_test_duplicate_1'
   TBLPROPERTIES (
 'last_commit_completion_time_sync'='20240112161204004',
 'last_commit_time_sync'='20240112160716028',
 'spark.sql.sources.provider'='hudi',
 'spark.sql.sources.schema.numParts'='1',
 
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"string","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"age","type":"integer","nullable":true,"metadata":{}}]}',
 'transient_lastDdlTime'='1704939293')


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-14 Thread via GitHub


danny0405 commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1891257931

   can you show us the create table statement from Hive?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-14 Thread via GitHub


CamelliaYjli commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1891236791

   > Is the hive table synced automatically from the ingestion job?
   
   yes, Hive synchronization haven been enabled.
   
![image](https://github.com/apache/hudi/assets/153248157/876a8566-6ca5-46ab-abb7-2f4cb03892bb)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-14 Thread via GitHub


danny0405 commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1891223591

   Is the hive table synced automatically from the ingestion job?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-14 Thread via GitHub


CamelliaYjli commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1890885226

   > what's you hive execution engine? do you update the hudi-hadoop-mr-bundle 
jar in hive.tar.gz or tez.tar.gz on hdfs?
   
   Sorry for the late reply. I am using Hive-on-MR, and 
hudi-hadoop-mr-bundle-0.14.0.jar has been added to ${HIVE_HOME}/auxlib.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-12 Thread via GitHub


xicm commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1888702810

   what's you hive execution engine? do you update the hudi-hadoop-mr-bundle 
jar in hive.tar.gz or tez.tar.gz on hdfs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-10 Thread via GitHub


CamelliaYjli opened a new issue, #10486:
URL: https://github.com/apache/hudi/issues/10486

   
   **Describe the problem you faced**
   
   I use Flink write Hudi COW table and sync to hive , but hive aggregate query 
(eg. count(*), row_number() over() )results has duplicate data but select * did 
not.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Flink-SQL write  Hudi COW table.
   
   upsert 
   ```java
   String hudiSinkDDL = "CREATE TABLE hudi_table(\n" +
   "id String,\n" +
   "name String,\n" +
   "age Int,\n" +
   "PRIMARY KEY (id) NOT ENFORCED \n" +
   ") WITH (\n" +
   // 基本配置
   "'write.operation' = 'upsert',\n" +
   "'write.precombine' = 'true',\n" +
   "'connector' = 'hudi',\n" +
   "'path'= '${basePath}',\n" +
   "'table.type' = 'COPY_ON_WRITE',\n" +
   "'write.tasks' = '2',\n" +
   "'write.bucket_assign.tasks' = '2',\n" +
   // 同步hive配置
   "'hive_sync.conf.dir'='/opt/apache-hive-3.1.3-bin/conf',\n" +
   "'hive_sync.enabled' = 'true',\n" + // 将数据集注册并同步到 hive 
metastore
   "'hive_sync.mode' = 'hms',\n" + // 采用 hive metastore 同步
   "'hive_sync.metastore.uris' = 'thrift://localhost:9083',\n" +
   "'hive_sync.db' = 'cdc_hudi',\n" +
   "'hive_sync.table' = '${tableName}',\n" +
   // 小文件&压缩配置
   "'clean.retain_commits' = '1',\n" + 
   "'metadata.compaction.delta_commits' = '5',\n" +
   "'hoodie.parquet.compression.codec' = 'gzip',\n" + 
   "'hoodie.parquet.max.file.size' = '268435456'\n" +
   ")";
   ```
   
   2.  insert data into MySQL and update it.
   
   ```sql
   -- insert
   insert into table_test_duplicate_1(id,name,age) 
values('dup_clean_1','Camellia',11);
   -- update
   update table_test_duplicate_1 set age = 20 where id ='dup_clean_1';
   ```
   
   3.  select * from cdc_hudi.table_test_duplicate_1 where id = 'dup_clean_1'; 
Normal results.
   
   https://github.com/apache/hudi/assets/153248157/7a48b473-cc80-4adf-b0a2-f150b0b3b400";>
   
   4.  execute aggregate function; data duplication.
   
   ```sql
   select count(1) from cdc_hudi.table_test_duplicate_1 where id = 
'dup_clean_1';
   ```
   
   https://github.com/apache/hudi/assets/153248157/c9c336bf-1f5f-4a06-8966-377f0a40ddbc";>
   
   ```sql
   select
   *,
   row_number() over (partition by id order by age desc) as rank
   from cdc_hudi.table_test_duplicate_1 where id = 'dup_clean_1';
   ```
   
   https://github.com/apache/hudi/assets/153248157/23e9f81b-d00e-4363-8456-4fdeebb50fe8";>
   
   
   **Expected behavior**
   
   Why do aggregated queries and regular queries have inconsistent results?Your 
help is appreciative.
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : no
   
   * Flink version : 1.17.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org