Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-2044042481

   hey @ad1happy2go : any follow up on this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-11-01 Thread via GitHub


ad1happy2go commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1788698298

   @blackcheckren Ideally these nulls should not cause data loss. Though I have 
not understood your explanation completely.
   
   Is this issue reproducible. It looks to be data related only though. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-22 Thread via GitHub


blackcheckren commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1774052369

   
![d29f8bf682be3042130b06390613b9b](https://github.com/apache/hudi/assets/88579280/b318c56e-9e55-4fba-8a17-2b911b36f7f2)
   
![fbc7345f6e288f4439554fc9a4db36d](https://github.com/apache/hudi/assets/88579280/0e892baa-3efe-4433-bfce-e012a5e8df3e)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-22 Thread via GitHub


blackcheckren commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1774052145

   @ad1happy2go Sorry for the late reply. I read the tables in Maxcompute into 
the memory, sort them by primary key, and write them into the Hudi table. Then 
I read the table from the file system and compare the data in the original 
table. However, I did not find any abnormality in the data level. I printed out 
the records that did not exist in the Hudi table but existed in the original 
table, and read the records with the primary key minus 1 in the original table 
according to the primary key, and the data performance was normal. It's 
confusing to me.
   I wonder if the number of null values in the timestamp field is the cause, 
because I observe that the number of null values in the above data is only 1, 
and there are an even number of null values above and below.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-18 Thread via GitHub


ad1happy2go commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1768467782

   Great. Thanks @blackcheckren . Let us know your findings. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-16 Thread via GitHub


blackcheckren commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1764204817

   @ad1happy2go I found that this problem also came from the case of a friend 
in the official group: he was collecting data in sqlserver, and if the datetime 
type was collected, there would be out-of-order and wrong lines. The users of 
our platform always gave feedback on data errors, and we also found data loss 
during the investigation. After viewing the table that found the problem 
basically has a datetime type field, I use SQL to convert the related field to 
String type, and then write, the wrong row and data loss disappeared.
   Tomorrow I will sort the data and write it into the table, check the data 
before and after the error row and data loss, and the relevant information will 
be sent to the following post.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-16 Thread via GitHub


ad1happy2go commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1763956806

   @blackcheckren Yes, they follow different core writer path so handle them 
differently. Out of my curiosity, I am still worried why Spark timestamp type 
will cause data loss. Can you explain a bit more of this about your findings. 
It may be potential bug in code which we want to fix. Thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-14 Thread via GitHub


blackcheckren commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1762783089

   @ad1happy2go The problem has been located under the tips of friends in hudi 
technical communication group. This problem is because the Spark timestampType 
data is written to the Hudi table parquet file, which will cause data errors 
and loss. You only need to convert the type of data to string to avoid this 
problem, which is indeed the case after the verification of two friends. But I 
still have some questions. This problem is mainly caused by bulk_Insert 
operation, but not in insert mode. Will these two operation types handle data 
writing files differently? I am not familiar with the source code, hope to get 
your reply, thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-13 Thread via GitHub


ad1happy2go commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1761804648

   Thanks a lot for the details. @blackcheckren . I will work on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-10 Thread via GitHub


ad1happy2go commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1755514440

   @blackcheckren I couldn't reproduce this issue actually and not sure why 
that would happen actually. The configurations looks okay.
   
   In case you still have spark event logs, can you check were there 
tasks/stage failures during the run which created duplicates? Or Are you 
getting this issue consistently when you re-ingested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-06 Thread via GitHub


blackcheckren commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1751572069

   @ad1happy2go What other information do I need to provide in order to 
troubleshoot the problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data loss occurs when using bulkinsert [hudi]

2023-10-06 Thread via GitHub


blackcheckren commented on issue #9748:
URL: https://github.com/apache/hudi/issues/9748#issuecomment-1751562608

   @ad1happy2go yes,Yes, I tried to insert data using the bulk_insert operation 
type many times, and the result was a fixed number of missing data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org