prasanthj opened a new pull request #2057:
URL: https://github.com/apache/hive/pull/2057


   ### What changes were proposed in this pull request?
   There has been a bug lurking in alter table concatenate for ORC which is 
typically observed in case where orc files are bigger and different nodes and 
racks. Because of the CombineFileInputFormat groups the files together based on 
node/rack locality and based on default max split size of 256MB, if the orc 
file size is >256MB and if the file spans multiple nodes/rack then CombineIF 
splits the file and groups then in different splits. Now when these different 
splits are processed by the mappers of merge task, the first task will initiate 
the concatenate and as part of task commit will move the file to scratch dir. 
Now when the same file is processed by a different split, the will be 
non-existent as it was moved by the prior mapper. This can cause failures in 
alter table concat task and also can results in stripes being lost because of 
this partial concatenation. 
   This PR addresses this issue by mapping the mapper that gets the start of 
the split to own the entire orc file for concatenation. It will process all the 
stripes, concatenate them to destination file and move the source file. Mappers 
that does not get start of the split will simply skip as the file is already 
handled or will be handled by different mapper.
   
   ### Why are the changes needed?
   To avoid concatenation failures and stripe loss issues. 
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Tested in internal repro cluster which had bigger orc files that spans 
multiple nodes and racks. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to