[GitHub] spark pull request: [SPARK-3119] Re-implementation of TorrentBroad...

mridulm Tue, 19 Aug 2014 11:36:07 -0700

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2030#discussion_r16434120
  
    --- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
    @@ -18,50 +18,116 @@
     package org.apache.spark.broadcast
     
     import java.io._
    +import java.nio.ByteBuffer
     
    +import scala.collection.JavaConversions.asJavaEnumeration
     import scala.reflect.ClassTag
     import scala.util.Random
     
     import org.apache.spark.{Logging, SparkConf, SparkEnv, SparkException}
     import org.apache.spark.io.CompressionCodec
     import org.apache.spark.storage.{BroadcastBlockId, StorageLevel}
    +import org.apache.spark.util.ByteBufferInputStream
     
     /**
    - *  A [[org.apache.spark.broadcast.Broadcast]] implementation that uses a 
BitTorrent-like
    - *  protocol to do a distributed transfer of the broadcasted data to the 
executors.
    - *  The mechanism is as follows. The driver divides the serializes the 
broadcasted data,
    - *  divides it into smaller chunks, and stores them in the BlockManager of 
the driver.
    - *  These chunks are reported to the BlockManagerMaster so that all the 
executors can
    - *  learn the location of those chunks. The first time the broadcast 
variable (sent as
    - *  part of task) is deserialized at a executor, all the chunks are 
fetched using
    - *  the BlockManager. When all the chunks are fetched (initially from the 
driver's
    - *  BlockManager), they are combined and deserialized to recreate the 
broadcasted data.
    - *  However, the chunks are also stored in the BlockManager and reported 
to the
    - *  BlockManagerMaster. As more executors fetch the chunks, 
BlockManagerMaster learns
    - *  multiple locations for each chunk. Hence, subsequent fetches of each 
chunk will be
    - *  made to other executors who already have those chunks, resulting in a 
distributed
    - *  fetching. This prevents the driver from being the bottleneck in 
sending out multiple
    - *  copies of the broadcast data (one per executor) as done by the
    - *  [[org.apache.spark.broadcast.HttpBroadcast]].
    + * A BitTorrent-like implementation of 
[[org.apache.spark.broadcast.Broadcast]].
    + *
    + * The mechanism is as follows:
    + *
    + * The driver divides the serialized object into small chunks and
    + * stores those chunks in the BlockManager of the driver.
    + *
    + * On each executor, the executor first attempts to fetch the object from 
its BlockManager. If
    + * it does not exist, it then uses remote fetches to fetch the small 
chunks from the driver and/or
    + * other executors if available. Once it gets the chunks, it puts the 
chunks in its own
    + * BlockManager, ready for other executors to fetch from.
    + *
    + * This prevents the driver from being the bottleneck in sending out 
multiple copies of the
    + * broadcast data (one per executor) as done by the 
[[org.apache.spark.broadcast.HttpBroadcast]].
    --- End diff --
    
    Hmm, our broadcast variables are in the 1mb - 3 mb range : so does not 
qualify I guess ... thx for the comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3119] Re-implementation of TorrentBroad...

Reply via email to