[GitHub] [orc] pgaref commented on a change in pull request #996: ORC-1078: Row group end offset doesn't accommodate all the blocks

GitBox Thu, 13 Jan 2022 06:31:56 -0800


pgaref commented on a change in pull request #996:
URL: https://github.com/apache/orc/pull/996#discussion_r784011465




##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java
##########
@@ -182,10 +182,13 @@ public static long estimateRgEndOffset(boolean 
isCompressed,
                                          long streamLength) {
     // figure out the worst case last location
     // if adjacent groups have the same compressed block offset then stretch 
the slop
-    // by factor of 2 to safely accommodate the next compression block.
-    // One for the current compression block and another for the next 
compression block.
+    // by a factor to safely accommodate the next compression block.
+    // 512 is the MAX_SCOPE defined in RunLengthIntegerWriterV2.
+    // 8 is the maximum size of bytes for each value (see 
RunLengthIntegerWriterV2.zzBits100p).
+    // We need to calculate the maximum number of blocks by bufferSize 
accordingly.
+    int stretchFactor = bufferSize > 0 ? 2 + (512 * 8 - 1) / bufferSize : 2;
     long slop = isCompressed
-                    ? 2 * (OutStream.HEADER_SIZE + bufferSize)
+                    ? stretchFactor * (OutStream.HEADER_SIZE + bufferSize)
                     : WORST_UNCOMPRESSED_SLOP;
     return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + 
slop);

Review comment:
       Thats correct @hsnusonic  -- we start by reading each stripe footer and 
RowIndexes and check if we have rowgroups that match our filters. If there is 
no match we proceed to the next stripe. In that sense the change in the SLOP 
above wont make any difference as the data should be already in memory.
   
   More details  
https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L1264
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] pgaref commented on a change in pull request #996: ORC-1078: Row group end offset doesn't accommodate all the blocks

Reply via email to