pavibhai opened a new pull request #1072:
URL: https://github.com/apache/orc/pull/1072


   Optimizes the read of streams in ORC by combining multiple nearby reads a 
single read, optionally allowing the retention or drop of the extra bytes.
   
   * minSeekSize: If separation between multiple reads is within minSeekSize 
then these are combined into a single read
   * minSeekSizeTolerance: Helps in the decisioning of whether to retain the 
extra bytes (extra memory) or take extra CPU to drop the unwanted bytes
   
   ### What changes were proposed in this pull request?
   We are introducing two new configuration parameters that control how read of 
streams takes place in ORC
   * minSeekSize: If separation between multiple reads is within minSeekSize 
then these are combined into a single read
   * minSeekSizeTolerance: Helps in the decisioning of whether to retain the 
extra bytes (extra memory) or take extra CPU to drop the unwanted bytes
   
   
   ### Why are the changes needed?
   This leads to significant time savings (and cost also) when dealing with AWS 
S3. Reads with gaps e.g. reading alternate columns shows a significant penalty 
5.8s vs 1.4s with the patch.
   
   ### How was this patch tested?
   * New Unit Tests were added
   * None of the existing tests were changed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to