[GitHub] [hive] szlta commented on a change in pull request #1576: HIVE-24266

GitBox Tue, 13 Oct 2020 06:56:22 -0700


szlta commented on a change in pull request #1576:
URL: https://github.com/apache/hive/pull/1576#discussion_r503970960




##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
##########
@@ -1156,13 +1157,36 @@ public BISplitStrategy(Context context, FileSystem fs, 
Path dir,
           } else {
             TreeMap<Long, BlockLocation> blockOffsets = 
SHIMS.getLocationsWithOffset(fs, fileStatus);
             for (Map.Entry<Long, BlockLocation> entry : 
blockOffsets.entrySet()) {
-              if (entry.getKey() + entry.getValue().getLength() > logicalLen) {
+              long blockOffset = entry.getKey();
+              long blockLength = entry.getValue().getLength();
+              if(blockOffset > logicalLen) {
                 //don't create splits for anything past logical EOF
-                continue;
+                //map is ordered, thus any possible entry in the iteration 
after this is bound to be > logicalLen
+                break;
               }
-              OrcSplit orcSplit = new OrcSplit(fileStatus.getPath(), fileKey, 
entry.getKey(),
-                entry.getValue().getLength(), entry.getValue().getHosts(), 
null, isOriginal, true,
-                deltas, -1, logicalLen, dir, offsetAndBucket);
+              long splitLength = blockLength;
+
+              long blockEndOvershoot = (blockOffset + blockLength) - 
logicalLen;
+              if (blockEndOvershoot > 0) {
+                // if logicalLen is placed within a block, we should make 
(this last) split out of the part of this block
+                // -> we should read less than block end
+                splitLength -= blockEndOvershoot;
+              } else if (blockOffsets.lastKey() == blockOffset && 
blockEndOvershoot < 0) {
+                // This is the last block but it ends before logicalLen
+                // This can happen with HDFS if hflush was called and blocks 
are not persisted to disk yet, but content
+                // is otherwise available for readers, as DNs have these 
buffers in memory at this time.
+                // -> we should read more than (persisted) block end, but 
surely not more than the whole block
+                if (fileStatus instanceof HdfsLocatedFileStatus) {
+                  HdfsLocatedFileStatus hdfsFileStatus = 
(HdfsLocatedFileStatus)fileStatus;
+                  if (hdfsFileStatus.getLocatedBlocks().isUnderConstruction()) 
{
+                    // blockEndOvershoot is negative here...
+                    splitLength = Math.min(splitLength - blockEndOvershoot, 
hdfsFileStatus.getBlockSize());

Review comment:
       hdfsFileStatus.blockSize() is not the block length, but the configured 
(max) block size (e.g. 256MB) - that's how big a block can be max, and that's 
why we shouldn't read past that in this split




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] szlta commented on a change in pull request #1576: HIVE-24266

Reply via email to