pavibhai opened a new pull request #1072: URL: https://github.com/apache/orc/pull/1072
Optimizes the read of streams in ORC by combining multiple nearby reads a single read, optionally allowing the retention or drop of the extra bytes. * minSeekSize: If separation between multiple reads is within minSeekSize then these are combined into a single read * minSeekSizeTolerance: Helps in the decisioning of whether to retain the extra bytes (extra memory) or take extra CPU to drop the unwanted bytes ### What changes were proposed in this pull request? We are introducing two new configuration parameters that control how read of streams takes place in ORC * minSeekSize: If separation between multiple reads is within minSeekSize then these are combined into a single read * minSeekSizeTolerance: Helps in the decisioning of whether to retain the extra bytes (extra memory) or take extra CPU to drop the unwanted bytes ### Why are the changes needed? This leads to significant time savings (and cost also) when dealing with AWS S3. Reads with gaps e.g. reading alternate columns shows a significant penalty 5.8s vs 1.4s with the patch. ### How was this patch tested? * New Unit Tests were added * None of the existing tests were changed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org