following this suggestion, Aaron, you may take a look at Alluxio as the off-heap in-memory data storage as input/output for Spark jobs if that works for you.
See more intro on how to run Spark with Alluxio as data input / output. http://www.alluxio.org/documentation/en/Running-Spark-on-Alluxio.html - Bin On Wed, Jun 29, 2016 at 8:40 AM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > Have you looked at Alluxio? (earlier tachyon) > > Best Regards, > Sonal > Founder, Nube Technologies <http://www.nubetech.co> > Reifier at Strata Hadoop World > <https://www.youtube.com/watch?v=eD3LkpPQIgM> > Reifier at Spark Summit 2015 > <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Wed, Jun 29, 2016 at 7:30 PM, Aaron Perrin <aper...@timerazor.com> > wrote: > >> The user guide describes a broadcast as a way to move a large dataset to >> each node: >> >> "Broadcast variables allow the programmer to keep a read-only variable >> cached on each machine rather than shipping a copy of it with tasks. They >> can be used, for example, to give every node a copy of a large input >> dataset in an efficient manner." >> >> And the broadcast example shows it being used with a variable. >> >> But, is it somehow possible to instead broadcast a function that can be >> executed once, per node? >> >> My use case is the following: >> >> I have a large data structure that I currently create on each executor. >> The way that I create it is a hack. That is, when the RDD function is >> executed on the executor, I block, load a bunch of data (~250 GiB) from an >> external data source, create the data structure as a static object in the >> JVM, and then resume execution. This works, but it ends up costing me a >> lot of extra memory (i.e. a few TiB when I have a lot of executors). >> >> What I'd like to do is use the broadcast mechanism to load the data >> structure once, per node. But, I can't serialize the data structure from >> the driver. >> >> Any ideas? >> >> Thanks! >> >> Aaron >> >> >