That's helpful, thanks. I didn't see that thread earlier. But, it sounds like 
the best solution is to use singletons in the executors, which I'm already 
doing.  (BTW - the reason why I consider that method kind of hack-ish, is 
because the it makes the code a bit more difficult for others to understand.). 

Based on its description, I was hoping that Spark's broadcast mechanism was 
using shared memory between JVMs (memory mapped files or named pipes, etc), in 
which case the data structure would only need to be created once per machine.  
I'll have to take a look at the code.

Most likely, I'll have to implement a service on the node and have each 
executor call it.

Sent from my iPhone

> On Jun 30, 2016, at 8:45 AM, Yong Zhang <java8...@hotmail.com> wrote:
> 
> How about this old discussion related to similar problem as yours.
> 
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-td3203.html
> 
> Yong
> 
> From: aper...@timerazor.com
> Date: Wed, 29 Jun 2016 14:00:07 +0000
> Subject: Possible to broadcast a function?
> To: user@spark.apache.org
> 
> The user guide describes a broadcast as a way to move a large dataset to each 
> node:
> 
> "Broadcast variables allow the programmer to keep a read-only variable cached 
> on each machine rather than shipping a copy of it with tasks. They can be 
> used, for example, to give every node a copy of a large input dataset in an 
> efficient manner."
> 
> And the broadcast example shows it being used with a variable.
> 
> But, is it somehow possible to instead broadcast a function that can be 
> executed once, per node?
> 
> My use case is the following:
> 
> I have a large data structure that I currently create on each executor.  The 
> way that I create it is a hack.  That is, when the RDD function is executed 
> on the executor, I block, load a bunch of data (~250 GiB) from an external 
> data source, create the data structure as a static object in the JVM, and 
> then resume execution.  This works, but it ends up costing me a lot of extra 
> memory (i.e. a few TiB when I have a lot of executors).
> 
> What I'd like to do is use the broadcast mechanism to load the data structure 
> once, per node.  But, I can't serialize the data structure from the driver.
> 
> Any ideas?
> 
> Thanks!
> 
> Aaron
> 

Reply via email to