Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2087975711

   @juice411 Do you have any other help on this. Please let us know if you are 
good. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-19 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2066279351

   > but the issue is that we can't access older data.
   
   If you table is ingested in streaming `upsert`, then you just specify the 
`read.start-commit` as the first commit instant time on the timeline, and skip 
the compaction. Only  instant that has not been cleaned can be consumed.
   
   It actually depends on how you write the history dataset, because 
`bulk_insert` does not guarantee the payload sequence of one key, so if the 
table is boostraped with `bulk_insert`, the only way is to consume from 
`earliest`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-18 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2065674083

   we want to start re-acquiring data from the first record of the upstream 
Hudi table and rebuild the downstream table, but the issue is that we can't 
access older data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2065618201

   It should work for option `'read.start-commit'='earliest',`, what is the 
current behavior now, comsuming from the latest commit or a very specific one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-18 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2063254935

   CREATE TABLE if not exists test_simulated_data.ods_table_v1(
   id int,
   count_field double,
   write_time timestamp(0),
   _part string,
   proc_time timestamp(3),
   WATERMARK FOR write_time AS write_time
   )
   PARTITIONED BY (_part)
   WITH(
   'connector'='hudi',
   'path'='hdfs://masters/test_simulated_data/ods_table_v1',
   'table.type'='MERGE_ON_READ',
   'hoodie.datasource.write.recordkey.field'='id',
   'hoodie.datasource.write.precombine.field'='write_time',
   'compaction.async.enabled'='true',
   'compaction.schedule.enabled'='true',
   'compaction.trigger.strategy'='time_elapsed',
   'compaction.delta_seconds'='600',
   'compaction.delta_commits'='1',
   'read.streaming.enabled'='true',
   'read.streaming.skip_compaction'='true',
   'read.start-commit'='earliest',
   'changelog.enabled'='true',
   'hive_sync.enable'='true',
   'hive_sync.mode'='hms',
   'hive_sync.metastore.uris'='thrift://h35:9083',
   'hive_sync.db'='test_simulated_data',
   'hive_sync.table'='hive_ods_table'
   );


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-18 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2063253723

   CREATE TABLE if not exists test_simulated_data.ods_table_v1(
   id int,
   count_field double,
   write_time timestamp(0),
   _part string,
   proc_time timestamp(3),
   WATERMARK FOR write_time AS write_time
   )
   PARTITIONED BY (_part)
   WITH(
   'connector'='hudi',
   'path'='hdfs://masters/test_simulated_data/ods_table_v1',
   'table.type'='MERGE_ON_READ',
   'hoodie.datasource.write.recordkey.field'='id',
   'hoodie.datasource.write.precombine.field'='write_time',
   'compaction.async.enabled'='true',
   'compaction.schedule.enabled'='true',
   'compaction.trigger.strategy'='time_elapsed',
   'compaction.delta_seconds'='600',
   'compaction.delta_commits'='1',
   'read.streaming.enabled'='true',
   'read.streaming.skip_compaction'='true',
   'read.start-commit'='earliest',
   'changelog.enabled'='true',
   'hive_sync.enable'='true',
   'hive_sync.mode'='hms',
   'hive_sync.metastore.uris'='thrift://h35:9083',
   'hive_sync.db'='test_simulated_data',
   'hive_sync.table'='hive_ods_table'
   );


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2061071009

   Can you share you source table definitions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-17 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060919550

   @danny0405 We set read.start-commit as earliest because it did not work as 
expected, and we have been very anxious about it. We have tried various 
solutions to obtain the full data set, but none of them worked. This issue has 
been plaguing us for more than three days. Is there any other approach we can 
take?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060710749

   > How can I obtain the full dataset from the upstream Hudi table?
   
   Specifies the `read.start-commit` as `earliest`. By default the streaming 
source consumes from the latest commit of a Hudi table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-17 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060565779


@danny0405![image](https://github.com/apache/hudi/assets/10968514/2400eb28-d8f3-471c-b32a-a625cfd5a17f)
   As shown in the screenshot, Flink has created a job that fetches data from 
an upstream Hudi table and performs a count calculation. However, from the 
screenshot, it appears that after several minutes, the job has only fetched 6 
records from the upstream table. This raises the question of where the other 
data might have gone.
   How can I obtain the full dataset from the upstream Hudi table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-16 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060452646

   
![image](https://github.com/apache/hudi/assets/10968514/9c567a2c-9237-453c-8706-af380cf28a6b)
   During our testing, we've encountered an unusual issue with the Hudi stream 
read table. When the downstream processing system fetches data from the 
upstream Hudi table (designed as a stream read table) and attempts to process 
it, it reports that it cannot find log files. Obviously, this is expected since 
the data has been merged into Parquet files. However, the question remains: why 
is Hudi still searching for these non-existent files?
   
   This issue is causing inconsistencies in the downstream processing results, 
leading us to believe that the downstream system might not be able to fully 
capture all the data from the upstream table. We're eager to understand the 
root cause of this behavior and if there are any recommended workarounds or 
configurations that we should be aware of.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-16 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060230825

   @danny0405 The previous versions we were using were Hudi 0.14.1 and Flink 
1.17.2. Also, we believe our issue is not related to the precombine field as we 
have a unique ID to identify each data entry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-16 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2060119989

   And can you also supplement the Hudi and Flink release you use here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-16 Thread via GitHub


ad1happy2go commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2059267852

   @juice411 precombine field is used as ordering field to deduplicate. For 
example if we have two records in source with same record key, then hudi will 
pick up the record with higher precombine value and skip the other one. This 
happens when we use upsert operation type. For bulk_insert and insert it will 
insert both of them. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-16 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2058742164

   I appreciate the clarification. While I'm not entirely sure about the 
significance of preCombine, I've learned from the developer that 
hoodie.datasource.write.precombine.field is set to write_time. the write_time 
is a timestamp field representing the time when data was written, formatted 
like '2024-01-01 18:59:25.0'.
   
   Could you elaborate on the impacts or benefits of this setting? For 
instance, how does it enhance data processing, query efficiency, or data 
consistency?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-15 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2058358581

   Upon further testing after upgrading to the new master version, we have 
discovered missing data. As per our testing expectations, the results for all 
days should be consistent and equal to the data from the first day. However, as 
evident from the screenshot attached, the data for subsequent days is 
inconsistent. I have confirmed that the entire data system has been stopped for 
more than half an hour, ruling out the possibility of any pending or unfinished 
data processing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-15 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2058328083

   Could you please provide your contact information in China, as I have 
noticed you are located in Hangzhou? It would be helpful for further 
communication.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2056310014

   >  I can only find all the data for February 25, 2024, and cannot find any 
other data. By the way, we have configured metadata synchronization to Hive, 
and all the written data can be found from the Hive end.
   
   What engine did you use when you found the data loss?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-15 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2056158828

   The data is written through the flink-mysql-cdc method, from January 1, 
2024, to March 31, 2024, with 10,000 records being written to MySQL every day. 
After completing one round of writing, it starts writing from the first day 
again and continues to cycle for several rounds. However, when I query the Hudi 
table, I can only find all the data for February 25, 2024, and cannot find any 
other data. By the way, we have configured metadata synchronization to Hive, 
and all the written data can be found from the Hive end.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2056069011

   How did you write the earliest data set, are they got updated or just got 
lost?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-14 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2055471100

   We are using Hudi version 0.14.1, and we have tried both streaming reads and 
batch queries, but we cannot read the earliest written data. If this is the 
issue you mentioned, we will try upgrading to the master branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-14 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2054894833

   what hudi release did you use then, we did fond a weird data loss issue 
about compaction in release 0.14.0, it is fixed in master and 1.0.x branch now.
   
   Are you talking about streaming read data loss or batch queries?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org