Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
CamelliaYjli commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907801206 > Yeah, you should use `HoodieHiveInputFormat` or HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb OK,thx ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
danny0405 closed issue #10486: [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not URL: https://github.com/apache/hudi/issues/10486 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
danny0405 commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907724040 Yeah, you should use `HoodieHiveInputFormat` or HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
CamelliaYjli commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907220642 > Seems a bug When I set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat , result is right , is this necessary before querying? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
xicm commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1895237009 Seems a bug -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
danny0405 commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1892987929 Looks good, @xicm can you help confirm this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
CamelliaYjli commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1892007697 > can you show us the create table statement from Hive? Okay, the table in Hive is an external table automatically generated during synchronization. The statement is as follows: CREATE EXTERNAL TABLE `cdc_hudi.table_test_duplicate_1`( `_hoodie_commit_time` string COMMENT '', `_hoodie_commit_seqno` string COMMENT '', `_hoodie_record_key` string COMMENT '', `_hoodie_partition_path` string COMMENT '', `_hoodie_file_name` string COMMENT '', `id` string COMMENT '', `name` string COMMENT '', `age` int COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ( 'hoodie.query.as.ro.table'='false', 'path'='hdfs://localhost:8020/user/hive/warehouse/cdc_hudi.db/table_test_duplicate_1') STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://localhost:8020/user/hive/warehouse/cdc_hudi.db/table_test_duplicate_1' TBLPROPERTIES ( 'last_commit_completion_time_sync'='20240112161204004', 'last_commit_time_sync'='20240112160716028', 'spark.sql.sources.provider'='hudi', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"string","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"age","type":"integer","nullable":true,"metadata":{}}]}', 'transient_lastDdlTime'='1704939293') -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
danny0405 commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1891257931 can you show us the create table statement from Hive? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
CamelliaYjli commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1891236791 > Is the hive table synced automatically from the ingestion job? yes, Hive synchronization haven been enabled. ![image](https://github.com/apache/hudi/assets/153248157/876a8566-6ca5-46ab-abb7-2f4cb03892bb) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
danny0405 commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1891223591 Is the hive table synced automatically from the ingestion job? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
CamelliaYjli commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1890885226 > what's you hive execution engine? do you update the hudi-hadoop-mr-bundle jar in hive.tar.gz or tez.tar.gz on hdfs? Sorry for the late reply. I am using Hive-on-MR, and hudi-hadoop-mr-bundle-0.14.0.jar has been added to ${HIVE_HOME}/auxlib. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
xicm commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1888702810 what's you hive execution engine? do you update the hudi-hadoop-mr-bundle jar in hive.tar.gz or tez.tar.gz on hdfs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
CamelliaYjli opened a new issue, #10486: URL: https://github.com/apache/hudi/issues/10486 **Describe the problem you faced** I use Flink write Hudi COW table and sync to hive , but hive aggregate query (eg. count(*), row_number() over() )results has duplicate data but select * did not. **To Reproduce** Steps to reproduce the behavior: 1. Flink-SQL write Hudi COW table. upsert ```java String hudiSinkDDL = "CREATE TABLE hudi_table(\n" + "id String,\n" + "name String,\n" + "age Int,\n" + "PRIMARY KEY (id) NOT ENFORCED \n" + ") WITH (\n" + // 基本配置 "'write.operation' = 'upsert',\n" + "'write.precombine' = 'true',\n" + "'connector' = 'hudi',\n" + "'path'= '${basePath}',\n" + "'table.type' = 'COPY_ON_WRITE',\n" + "'write.tasks' = '2',\n" + "'write.bucket_assign.tasks' = '2',\n" + // 同步hive配置 "'hive_sync.conf.dir'='/opt/apache-hive-3.1.3-bin/conf',\n" + "'hive_sync.enabled' = 'true',\n" + // 将数据集注册并同步到 hive metastore "'hive_sync.mode' = 'hms',\n" + // 采用 hive metastore 同步 "'hive_sync.metastore.uris' = 'thrift://localhost:9083',\n" + "'hive_sync.db' = 'cdc_hudi',\n" + "'hive_sync.table' = '${tableName}',\n" + // 小文件&压缩配置 "'clean.retain_commits' = '1',\n" + "'metadata.compaction.delta_commits' = '5',\n" + "'hoodie.parquet.compression.codec' = 'gzip',\n" + "'hoodie.parquet.max.file.size' = '268435456'\n" + ")"; ``` 2. insert data into MySQL and update it. ```sql -- insert insert into table_test_duplicate_1(id,name,age) values('dup_clean_1','Camellia',11); -- update update table_test_duplicate_1 set age = 20 where id ='dup_clean_1'; ``` 3. select * from cdc_hudi.table_test_duplicate_1 where id = 'dup_clean_1'; Normal results. https://github.com/apache/hudi/assets/153248157/7a48b473-cc80-4adf-b0a2-f150b0b3b400";> 4. execute aggregate function; data duplication. ```sql select count(1) from cdc_hudi.table_test_duplicate_1 where id = 'dup_clean_1'; ``` https://github.com/apache/hudi/assets/153248157/c9c336bf-1f5f-4a06-8966-377f0a40ddbc";> ```sql select *, row_number() over (partition by id order by age desc) as rank from cdc_hudi.table_test_duplicate_1 where id = 'dup_clean_1'; ``` https://github.com/apache/hudi/assets/153248157/23e9f81b-d00e-4363-8456-4fdeebb50fe8";> **Expected behavior** Why do aggregated queries and regular queries have inconsistent results?Your help is appreciative. **Environment Description** * Hudi version : 0.14.0 * Spark version : no * Flink version : 1.17.0 * Hive version : 3.1.3 * Hadoop version : 3.3.6 * Storage (HDFS/S3/GCS..) :HDFS * Running on Docker? (yes/no) :no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org