nsivabalan commented on issue #5777: URL: https://github.com/apache/hudi/issues/5777#issuecomment-1524667071
here are the reasons why we might see duplicates. So far, I could not pin point any of them for your use-case. but if you can find anything resembling your use-case, let us know. 1. If you are using "insert" or bulk_insert" as operation type. 2. If you are using any of the global indices (global_bloom or global_simple) and if records migrated from one partition to another, until compaction kicks in, you may see duplicates w/ Read optimized query. but once compaction completes for the older file group of interest, you may not see duplicates. 3. Also, if the index is global one, if records migrate from 1 partition to another and again to another before compaction can kick in, you may see duplicates even w/ RT query. this may not go away even w/ compaction. 4. Multi-writer scenarios, where both writers ingested the same records concurrently for the first time. 5. If writing via spark structured streaming, if you are using "insert" or "bulk_insert", there are chances for duplicates. I will try to update this comment of mine if I can think of any more reason. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org