liangyu-1 commented on code in PR #7859:
URL: https://github.com/apache/hadoop/pull/7859#discussion_r2265543550


##########
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java:
##########
@@ -369,6 +369,9 @@ public InputSplit[] getSplits(JobConf job, int numSplits)
         } else {
           blkLocations = fs.getFileBlockLocations(file, 0, length);
         }
+        if (blkLocations.length == 0){

Review Comment:
   @Hexiaoqiao thanks for review.
   
   It is a rare corner case, only happens when using spark streaming monitors a 
path where the upstream system sometimes starts two identical tasks that 
attempt to create and write to the same HDFS file simultaneously. This can lead 
to conflicts where a file is created and written to twice in quick succession.
   
   When Spark scans for files, it uses the FileInputFormat.getSplits() method 
to split the file. The first step in getSplits is to retrieve the file's 
length. If the file length is not zero, the next step is to get the block 
locations array for that file. However, if the two upstream programs rapidly 
create and write to the same file (i.e., the file is overwritten or appended to 
almost simultaneously), a race condition may occur:
   
   The file's length is already non-zero, but calling getFileBlockLocations() 
returns an empty array because the file is being overwritten or is not yet 
fully written.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to