[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.

via GitHub Wed, 26 Apr 2023 21:07:30 -0700


nsivabalan commented on issue #5777:
URL: https://github.com/apache/hudi/issues/5777#issuecomment-1524667071


   here are the reasons why we might see duplicates. So far, I could not pin 
point any of them for your use-case. but if you can find anything resembling 
your use-case, let us know. 
   
   1. If you are using "insert" or bulk_insert" as operation type. 
   2. If you are using any of the global indices (global_bloom or 
global_simple) and if records migrated from one partition to another, until 
compaction kicks in, you may see duplicates w/ Read optimized query. but once 
compaction completes for the older file group of interest, you may not see 
duplicates. 
   3. Also, if the index is global one, if records migrate from 1 partition to 
another and again to another before compaction can kick in, you may see 
duplicates even w/ RT query. this may not go away even w/ compaction. 
   4. Multi-writer scenarios, where both writers ingested the same records 
concurrently for the first time. 
   5. If writing via spark structured streaming, if you are using "insert" or 
"bulk_insert", there are chances for duplicates. I will try to update this 
comment of mine if I can think of any more reason. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.

Reply via email to