Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
ad1happy2go commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2087975711 @juice411 Do you have any other help on this. Please let us know if you are good. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2066279351 > but the issue is that we can't access older data. If you table is ingested in streaming `upsert`, then you just specify the `read.start-commit` as the first commit instant time on the timeline, and skip the compaction. Only instant that has not been cleaned can be consumed. It actually depends on how you write the history dataset, because `bulk_insert` does not guarantee the payload sequence of one key, so if the table is boostraped with `bulk_insert`, the only way is to consume from `earliest`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2065674083 we want to start re-acquiring data from the first record of the upstream Hudi table and rebuild the downstream table, but the issue is that we can't access older data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2065618201 It should work for option `'read.start-commit'='earliest',`, what is the current behavior now, comsuming from the latest commit or a very specific one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2063254935 CREATE TABLE if not exists test_simulated_data.ods_table_v1( id int, count_field double, write_time timestamp(0), _part string, proc_time timestamp(3), WATERMARK FOR write_time AS write_time ) PARTITIONED BY (_part) WITH( 'connector'='hudi', 'path'='hdfs://masters/test_simulated_data/ods_table_v1', 'table.type'='MERGE_ON_READ', 'hoodie.datasource.write.recordkey.field'='id', 'hoodie.datasource.write.precombine.field'='write_time', 'compaction.async.enabled'='true', 'compaction.schedule.enabled'='true', 'compaction.trigger.strategy'='time_elapsed', 'compaction.delta_seconds'='600', 'compaction.delta_commits'='1', 'read.streaming.enabled'='true', 'read.streaming.skip_compaction'='true', 'read.start-commit'='earliest', 'changelog.enabled'='true', 'hive_sync.enable'='true', 'hive_sync.mode'='hms', 'hive_sync.metastore.uris'='thrift://h35:9083', 'hive_sync.db'='test_simulated_data', 'hive_sync.table'='hive_ods_table' ); -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2063253723 CREATE TABLE if not exists test_simulated_data.ods_table_v1( id int, count_field double, write_time timestamp(0), _part string, proc_time timestamp(3), WATERMARK FOR write_time AS write_time ) PARTITIONED BY (_part) WITH( 'connector'='hudi', 'path'='hdfs://masters/test_simulated_data/ods_table_v1', 'table.type'='MERGE_ON_READ', 'hoodie.datasource.write.recordkey.field'='id', 'hoodie.datasource.write.precombine.field'='write_time', 'compaction.async.enabled'='true', 'compaction.schedule.enabled'='true', 'compaction.trigger.strategy'='time_elapsed', 'compaction.delta_seconds'='600', 'compaction.delta_commits'='1', 'read.streaming.enabled'='true', 'read.streaming.skip_compaction'='true', 'read.start-commit'='earliest', 'changelog.enabled'='true', 'hive_sync.enable'='true', 'hive_sync.mode'='hms', 'hive_sync.metastore.uris'='thrift://h35:9083', 'hive_sync.db'='test_simulated_data', 'hive_sync.table'='hive_ods_table' ); -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2061071009 Can you share you source table definitions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060919550 @danny0405 We set read.start-commit as earliest because it did not work as expected, and we have been very anxious about it. We have tried various solutions to obtain the full data set, but none of them worked. This issue has been plaguing us for more than three days. Is there any other approach we can take? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060710749 > How can I obtain the full dataset from the upstream Hudi table? Specifies the `read.start-commit` as `earliest`. By default the streaming source consumes from the latest commit of a Hudi table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060565779 @danny0405![image](https://github.com/apache/hudi/assets/10968514/2400eb28-d8f3-471c-b32a-a625cfd5a17f) As shown in the screenshot, Flink has created a job that fetches data from an upstream Hudi table and performs a count calculation. However, from the screenshot, it appears that after several minutes, the job has only fetched 6 records from the upstream table. This raises the question of where the other data might have gone. How can I obtain the full dataset from the upstream Hudi table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060452646 ![image](https://github.com/apache/hudi/assets/10968514/9c567a2c-9237-453c-8706-af380cf28a6b) During our testing, we've encountered an unusual issue with the Hudi stream read table. When the downstream processing system fetches data from the upstream Hudi table (designed as a stream read table) and attempts to process it, it reports that it cannot find log files. Obviously, this is expected since the data has been merged into Parquet files. However, the question remains: why is Hudi still searching for these non-existent files? This issue is causing inconsistencies in the downstream processing results, leading us to believe that the downstream system might not be able to fully capture all the data from the upstream table. We're eager to understand the root cause of this behavior and if there are any recommended workarounds or configurations that we should be aware of. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060230825 @danny0405 The previous versions we were using were Hudi 0.14.1 and Flink 1.17.2. Also, we believe our issue is not related to the precombine field as we have a unique ID to identify each data entry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060119989 And can you also supplement the Hudi and Flink release you use here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
ad1happy2go commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2059267852 @juice411 precombine field is used as ordering field to deduplicate. For example if we have two records in source with same record key, then hudi will pick up the record with higher precombine value and skip the other one. This happens when we use upsert operation type. For bulk_insert and insert it will insert both of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2058742164 I appreciate the clarification. While I'm not entirely sure about the significance of preCombine, I've learned from the developer that hoodie.datasource.write.precombine.field is set to write_time. the write_time is a timestamp field representing the time when data was written, formatted like '2024-01-01 18:59:25.0'. Could you elaborate on the impacts or benefits of this setting? For instance, how does it enhance data processing, query efficiency, or data consistency? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2058358581 Upon further testing after upgrading to the new master version, we have discovered missing data. As per our testing expectations, the results for all days should be consistent and equal to the data from the first day. However, as evident from the screenshot attached, the data for subsequent days is inconsistent. I have confirmed that the entire data system has been stopped for more than half an hour, ruling out the possibility of any pending or unfinished data processing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2058328083 Could you please provide your contact information in China, as I have noticed you are located in Hangzhou? It would be helpful for further communication. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2056310014 > I can only find all the data for February 25, 2024, and cannot find any other data. By the way, we have configured metadata synchronization to Hive, and all the written data can be found from the Hive end. What engine did you use when you found the data loss? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2056158828 The data is written through the flink-mysql-cdc method, from January 1, 2024, to March 31, 2024, with 10,000 records being written to MySQL every day. After completing one round of writing, it starts writing from the first day again and continues to cycle for several rounds. However, when I query the Hudi table, I can only find all the data for February 25, 2024, and cannot find any other data. By the way, we have configured metadata synchronization to Hive, and all the written data can be found from the Hive end. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2056069011 How did you write the earliest data set, are they got updated or just got lost? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2055471100 We are using Hudi version 0.14.1, and we have tried both streaming reads and batch queries, but we cannot read the earliest written data. If this is the issue you mentioned, we will try upgrading to the master branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2054894833 what hudi release did you use then, we did fond a weird data loss issue about compaction in release 0.14.0, it is fixed in master and 1.0.x branch now. Are you talking about streaming read data loss or batch queries? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org