The user guide describes a broadcast as a way to move a large dataset to
each node:

"Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks. They
can be used, for example, to give every node a copy of a large input
dataset in an efficient manner."

And the broadcast example shows it being used with a variable.

But, is it somehow possible to instead broadcast a function that can be
executed once, per node?

My use case is the following:

I have a large data structure that I currently create on each executor.
The way that I create it is a hack.  That is, when the RDD function is
executed on the executor, I block, load a bunch of data (~250 GiB) from an
external data source, create the data structure as a static object in the
JVM, and then resume execution.  This works, but it ends up costing me a
lot of extra memory (i.e. a few TiB when I have a lot of executors).

What I'd like to do is use the broadcast mechanism to load the data
structure once, per node.  But, I can't serialize the data structure from
the driver.

Any ideas?

Thanks!

Aaron

Reply via email to