While Spark already offers support for asynchronous reduce (collect data from workers, while not interrupting execution of a parallel transformation) through accumulator, I have made little progress on making this process reciprocal, namely, to broadcast data from driver to workers to be used by all executors in the middle of a transformation. This primarily intended to be used in downpour SGD/adagrad, a non-blocking concurrent machine learning optimizer that performs better than existing synchronous GD in MLlib, and have vast application in training of many models.
My attempt so far is to stick to out-of-the-box, immutable broadcast, open a new thread on driver, in which I broadcast a thin data wrapper that when deserialized, will insert into a mutable singleton that is already replicated to all workers in the fat jar, this customized deserialization is not hard, just overwrite readObject like this: class AutoInsert(var value: Int) extends Serializable{ WorkerReplica.last = value private def readObject(in: ObjectInputStream): Unit = { in.defaultReadObject() WorkerContainer.last = this.value } } Unfortunately it looks like the deserializtion is called lazily and won't do anything before a worker use it (Broadcast[AutoInsert]), this is impossible without waiting for workers' stage to be finished and broadcast again. I'm wondering if I can 'hack' this thing into working? Or I'll have to write a serious extension to broadcast component to enable changing the value. Hope I can find like-minded on this forum because ML is a selling point of Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org