gaoyajun02 commented on code in PR #38333: URL: https://github.com/apache/spark/pull/38333#discussion_r1012466373
########## core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala: ########## @@ -794,7 +794,15 @@ final class ShuffleBlockFetcherIterator( // since the last call. val msg = s"Received a zero-size buffer for block $blockId from $address " + s"(expectedApproxSize = $size, isNetworkReqDone=$isNetworkReqDone)" - throwFetchFailedException(blockId, mapIndex, address, new IOException(msg)) + if (blockId.isShuffleChunk) { + logWarning(msg) + pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address) Review Comment: did you mean PushMergedRemoteMetaFetchResult? The size of push-merged block is not zero, since the size of each chunk cannot be obtained on the reduce side, we print the zero-size log in the following code on the server side, and confirm that the indexFile has the same offset continuously, but I actually don't understand why... https://github.com/apache/spark/blob/9a7596e1dde0f1dd596aa6d3b2efbcb5d1ef70ea/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala#L500 Then according to the hardware layer error information, we basically determine that the problem of data loss occurs in the process of writing data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org