[GitHub] spark pull request: [SPARK-8029] Robust shuffle writer

squito Thu, 12 Nov 2015 13:35:50 -0800

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9610#discussion_r44717955
  
    --- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/HashShuffleWriter.scala ---
    @@ -106,6 +108,19 @@ private[spark] class HashShuffleWriter[K, V](
           writer.commitAndClose()
           writer.fileSegment().length
         }
    +    // rename all shuffle files to final paths
    +    shuffle.writers.zip(sizes).foreach { case (writer: 
DiskBlockObjectWriter, size: Long) =>
    +      if (size > 0) {
    +        val output = blockManager.diskBlockManager.getFile(writer.blockId)
    +        if (output.exists()) {
    +          writer.file.delete()
    +        } else {
    +          if (!writer.file.renameTo(output)) {
    +            throw new IOException(s"fail to rename ${writer.file} to 
$output")
    --- End diff --
    
    yeah I suppose it all depends on what the model is for non-deterministic 
data.  The reduce tasks can read data from a mix attempts, but I guess that is 
OK (we can't completely prevent it in any case).  There is also the problem of 
returning the right mapstatus here, but it doesn't matter as much in this case 
-- you will at least return some set of non-empty blocks that is consistent 
with the shuffle data on disk, even if the sizes can be arbitrarily wrong.
    
    Also I know its super-rare, but there _is_ a race between `output.exists` 
and `renameTo(output)`, might as well protect against that.
    
    I also find it a weird that this is neither first or last attempt wins -- 
the first attempt to get to each output file wins, but it can be a mix of 
attempts.  again I'd include a comment explaining the logic



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8029] Robust shuffle writer

Reply via email to