[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer
nsivabalan commented on issue #6606: URL: https://github.com/apache/hudi/issues/6606#issuecomment-1287946948 I have put up a patch to auto retry with spark data source writes incase of conflicts https://github.com/apache/hudi/pull/6854 Hope that helps your case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer
nsivabalan commented on issue #6606: URL: https://github.com/apache/hudi/issues/6606#issuecomment-1253030434 nope. thats not how it works as of today. 2nd writer don't wait for 1st writer to complete. Thats not OCC at all in my understanding. what you are suggesting is, take a global lock for each write, complete the write and release the lock and then start w/ next write. In my opinion, this is just a sequential batch of writes. In general sense, multi-writer means, two concurrent writers can write to hudi concurrently. if they don't overlap wrt data they update, both should succeed. if not, one of them will fail. let me know if you need more clarification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer
nsivabalan commented on issue #6606: URL: https://github.com/apache/hudi/issues/6606#issuecomment-1248873001 you can read about multi writer guarantees here https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer
nsivabalan commented on issue #6606: URL: https://github.com/apache/hudi/issues/6606#issuecomment-1248872759 here is what is happening. if there are two concurrent writers writing to non overlapping data files, hudi will succeed both writes. but if both are modifying the same data file, hudi will succeed one and will fail another write. and hence you are seeing conflict resolution failed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer
nsivabalan commented on issue #6606: URL: https://github.com/apache/hudi/issues/6606#issuecomment-1242774819 @koochiswathiTR : can you check my above response and update please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer
nsivabalan commented on issue #6606: URL: https://github.com/apache/hudi/issues/6606#issuecomment-1238882651 oh, I thought, both jobs are running concurrently? is it not. can you throw some light on exact steps. is it. step1: start job1 in EMR cluster1. which consumes from source X and writes to hudi table Y step2: stop job1. its essentially a batch job. step3: start job2 in EMR cluster2 which again consumes from source X and writes to hudi table Y. now if you query hudi, you see duplicate data? is my understanding right ? also, can you share your write configs used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer
nsivabalan commented on issue #6606: URL: https://github.com/apache/hudi/issues/6606#issuecomment-1238880984 unless you configure lock providers, hudi can't guarantee this. I would suggest to add locking for both writers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org