[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-10-22 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1287946948

   I have put up a patch to auto retry with spark data source writes incase of 
conflicts https://github.com/apache/hudi/pull/6854
   Hope that helps your case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-09-20 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1253030434

   nope. thats not how it works as of today. 2nd writer don't wait for 1st 
writer to complete. Thats not OCC at all in my understanding. what you are 
suggesting is, take a global lock for each write, complete the write and  
release the lock and then start w/ next write. In my opinion, this is just a 
sequential batch of writes. 
   
   In general sense, multi-writer means, two concurrent writers can write to 
hudi concurrently. if they don't overlap wrt data they update, both should 
succeed. if not, one of them will fail. 
   
   let me know if you need more clarification. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-09-15 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1248873001

   you can read about multi writer guarantees here 
https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-09-15 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1248872759

   here is what is happening. 
   if there are two concurrent writers writing to non overlapping data files, 
hudi will succeed both writes. but if both are modifying the same data file, 
hudi will succeed one and will fail another write. and hence you are seeing 
conflict resolution failed. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-09-10 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1242774819

   @koochiswathiTR : can you check my above response and update please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-09-06 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1238882651

   oh, I thought, both jobs are running concurrently? is it not. can you throw 
some light on exact steps. 
   is it. 
   step1: start job1 in EMR cluster1. which consumes from source X and writes 
to hudi table Y
   step2: stop job1. its essentially a batch job.
   step3: start job2 in EMR cluster2 which again consumes from source X and 
writes to hudi table Y. 
   now if you query hudi, you see duplicate data? 
   
   is my understanding right ? 
   
   also, can you share your write configs used. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-09-06 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1238880984

   unless you configure lock providers, hudi can't guarantee this. I would 
suggest to add locking for both writers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org