While Spark already offers support for asynchronous reduce (collect data from
workers, while not interrupting execution of a parallel transformation)
through accumulator, I have made little progress on making this process
reciprocal, namely, to broadcast data from driver to workers to be used by
all executors in the middle of a transformation. This primarily intended to
be used in downpour SGD/adagrad, a non-blocking concurrent machine learning
optimizer that performs better than existing synchronous GD in MLlib, and
have vast application in training of many models.

My attempt so far is to stick to out-of-the-box, immutable broadcast, open a
new thread on driver, in which I broadcast a thin data wrapper that when
deserialized, will insert into a mutable singleton that is already
replicated to all workers in the fat jar, this customized deserialization is
not hard, just overwrite readObject like this:

class AutoInsert(var value: Int) extends Serializable{

  WorkerReplica.last = value

  private def readObject(in: ObjectInputStream): Unit = {
    in.defaultReadObject()
    WorkerContainer.last = this.value
  }
}

Unfortunately it looks like the deserializtion is called lazily and won't do
anything before a worker use it (Broadcast[AutoInsert]), this is impossible
without waiting for workers' stage to be finished and broadcast again. I'm
wondering if I can 'hack' this thing into working? Or I'll have to write a
serious extension to broadcast component to enable changing the value.

Hope I can find like-minded on this forum because ML is a selling point of
Spark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to