[GitHub] [spark] otterc commented on a change in pull request #32007: [SPARK-33350][SHUFFLE] Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

GitBox Wed, 28 Apr 2021 11:19:31 -0700


otterc commented on a change in pull request #32007:
URL: https://github.com/apache/spark/pull/32007#discussion_r622429509




##########
File path: core/src/main/scala/org/apache/spark/storage/BlockId.scala
##########
@@ -87,6 +87,29 @@ case class ShufflePushBlockId(shuffleId: Int, mapIndex: Int, 
reduceId: Int) exte
   override def name: String = "shufflePush_" + shuffleId + "_" + mapIndex + 
"_" + reduceId
 }
 
+@DeveloperApi
+case class ShuffleMergedBlockId(appId: String, shuffleId: Int, reduceId: Int) 
extends BlockId {
+  override def name: String = "mergedShuffle_" + appId + "_" + shuffleId + "_" 
+ reduceId + ".data"
+}
+
+@DeveloperApi
+case class ShuffleMergedIndexBlockId(
+  appId: String,
+  shuffleId: Int,
+  reduceId: Int) extends BlockId {
+  override def name: String =
+    "mergedShuffle_" + appId + "_" + shuffleId + "_" + reduceId + ".index"

Review comment:
       Below are my comments for each solution 
   1. Create a new `RegisterExecutor` message. If we do this, we might also 
want to add the recent "merge_manager_attemptx` name. A benefit of that would 
be to remove the logic on the server which tries to find the merge_manager 
directory path by finding the parent of blockMgrs directory.
   
   2. Encoding `attemptId` in the `appId` of `RegisterExecutor` message. The 
old `ExternalShuffleBlockResolver` needs to know the `appId` (without 
attemptId) when the executors are registered and the `executors` map is 
populated, since all the `get...BlockData()` apis reference this map with only 
`appId` and `executorId`. We can't change these apis. `RemoteBlockPushResolver` 
may also need to know the `appId` and `attemptId` separately to avoid changes 
to multiple data-structures. It would be then better to handle this parsing in 
`ExternalBlockHandler`. 
   
   3. Have the `RemoteBlockPushResolver` figure out the attemptId just by 
itself by listing the directory and finding the latest attempt. With this one 
every time an executor registers, the remote shuffle has to do the following:
   - find the parent directory from blockManager dirs. 
   - list the merge_manager_* dirs  
   - figure out the latest attempt.
   All these 3 steps above are avoided in solution 2 for every executor 
registration because we know the attempt Id from the message itself and it will 
ignore the ones belonging to the current one. The additional cost that is 
incurred in solution 2 is just splitting the `appId` and `attemptId` which I 
don't think is that much.
   
   I think Solution 1 is the cleanest but I can see that it adds another 
message and attemptId may be just specific to Yarn. Between 2 and 3, I prefer 
solution 2.
   
   WDYT @tgravescs @mridulm @attilapiros @Ngone51 @Victsm @zhouyejoe 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] otterc commented on a change in pull request #32007: [SPARK-33350][SHUFFLE] Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

Reply via email to