[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-18 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1026206654


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,18 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  // Zero-size block may come from nodes with hardware failures, 
For shuffle chunks,
+  // the original shuffle blocks that belong to that zero-size 
shuffle chunk is
+  // available and we can opt to fallback immediately.
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)
+  // Set result to null to trigger another iteration of the while 
loop to get either.
+  result = null
+  null
+} else {
+  throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+}
   }
 
   val in = try {

Review Comment:
   oh got it, can you take a look again? Thx



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-16 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1023946842


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,18 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  // Zero-size block may come from nodes with hardware failures, 
For shuffle chunks,
+  // the original shuffle blocks that belong to that zero-size 
shuffle chunk is
+  // available and we can opt to fallback immediately.
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)
+  // Set result to null to trigger another iteration of the while 
loop to get either.
+  result = null
+  null
+} else {
+  throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+}
   }
 
   val in = try {

Review Comment:
   resolved



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-16 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1023946842


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,18 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  // Zero-size block may come from nodes with hardware failures, 
For shuffle chunks,
+  // the original shuffle blocks that belong to that zero-size 
shuffle chunk is
+  // available and we can opt to fallback immediately.
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)
+  // Set result to null to trigger another iteration of the while 
loop to get either.
+  result = null
+  null
+} else {
+  throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+}
   }
 
   val in = try {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-12 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1020753726


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,18 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  // Zero-size block may come from nodes with hardware failures, 
For shuffle chunks,
+  // the original shuffle blocks that belong to that zero-size 
shuffle chunk is
+  // available and we can opt to fallback immediately.
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)
+  // Set result to null to trigger another iteration of the while 
loop to get either.
+  result = null
+  null

Review Comment:
   Thanks for review, updated. @mridulm 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-12 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1020753726


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,18 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  // Zero-size block may come from nodes with hardware failures, 
For shuffle chunks,
+  // the original shuffle blocks that belong to that zero-size 
shuffle chunk is
+  // available and we can opt to fallback immediately.
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)
+  // Set result to null to trigger another iteration of the while 
loop to get either.
+  result = null
+  null

Review Comment:
   update, Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-04 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1013969510


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,15 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)

Review Comment:
   ok, updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-02 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1012466373


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,15 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)

Review Comment:
   did you mean PushMergedRemoteMetaFetchResult? 
   The size of push-merged block is not zero, since the size of each chunk 
cannot be obtained on the reduce side, we print the zero-size log in the 
following code on the server side, and confirm that the indexFile has the same 
offset continuously, but I actually don't know why...
   
https://github.com/apache/spark/blob/9a7596e1dde0f1dd596aa6d3b2efbcb5d1ef70ea/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala#L500
   
   Then according to the hardware layer error information, we basically 
determine that the problem of data loss occurs in the process of writing data.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-02 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1012466373


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,15 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)

Review Comment:
   did you mean PushMergedRemoteMetaFetchResult? 
   The size of push-merged block is not zero, since the size of each chunk 
cannot be obtained on the reduce side, we print the zero-size log in the 
following code on the server side, and confirm that the indexFile has the same 
offset continuously, but I actually don't understand why...
   
https://github.com/apache/spark/blob/9a7596e1dde0f1dd596aa6d3b2efbcb5d1ef70ea/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala#L500
   
   Then according to the hardware layer error information, we basically 
determine that the problem of data loss occurs in the process of writing data.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-02 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1012466373


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,15 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)

Review Comment:
   did you mean PushMergedRemoteMetaFetchResult? 
   The size of push-merged block is not zero, since the size of each chunk 
cannot be obtained on the reduce side, we print the zero-size log in the 
following code on the server side, and confirm that the indexFile has the same 
offset continuously, but I actually don't understand why indexFile has the same 
offset continuously.
   
https://github.com/apache/spark/blob/9a7596e1dde0f1dd596aa6d3b2efbcb5d1ef70ea/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala#L500
   
   Then according to the hardware layer error information, we basically 
determine that the problem of data loss occurs in the process of writing data.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-02 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1012468152


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,15 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)

Review Comment:
   And then the size of the original shuffle block is not 0, because there is 
no FetchFailedException in the stage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on a diff in pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-02 Thread GitBox


gaoyajun02 commented on code in PR #38333:
URL: https://github.com/apache/spark/pull/38333#discussion_r1012466373


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -794,7 +794,15 @@ final class ShuffleBlockFetcherIterator(
 // since the last call.
 val msg = s"Received a zero-size buffer for block $blockId from 
$address " +
   s"(expectedApproxSize = $size, 
isNetworkReqDone=$isNetworkReqDone)"
-throwFetchFailedException(blockId, mapIndex, address, new 
IOException(msg))
+if (blockId.isShuffleChunk) {
+  logWarning(msg)
+  
pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)

Review Comment:
   did you mean PushMergedRemoteMetaFetchResult? 
   The size of push-merged block is not zero, since the size of each chunk 
cannot be obtained on the reduce side, we print the zero-size log in the 
following code on the server side, and confirm that the indexFile has the same 
offset continuously. 
https://github.com/apache/spark/blob/9a7596e1dde0f1dd596aa6d3b2efbcb5d1ef70ea/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala#L500
   
   Then according to the hardware layer error information, we basically 
determine that the problem of data loss occurs in the process of writing data.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org