kuczoram commented on pull request #2231: URL: https://github.com/apache/hive/pull/2231#issuecomment-829104540
Hi Krisztian! Thanks for this patch, it is very interesting. I would have one question about using multiple reducers. Do you know how it will be guaranteed that all rows with the same bucket would go into the same reducer and at the end to the same file? I am asking it because during compaction we had issues that rows with the same bucket number went into different reducers and we ended up with corrupted files (when rows with the same bucket numbers went into different files or files contained rows with different bucket numbers). I saw this issue when created an unbucketed table, but inserted a bigger amount of data, so at the end, the table contained multiple bucket files. I know that compaction is a different story, I am just curious whether or not something similar could happen with deletes/updates using multiple reducers. If you know how the row distribution between the reducers would work for deletes and updates, I would be really grateful if you could share some details. Thanks and regards, Marta -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
