Re: Best practices for sharing/maintaining large resource files for Spark jobs

Gene Pang Thu, 14 Jan 2016 07:37:22 -0800

Hi Dmitry,

I am not familiar with all of the details you have just described, but I
think Tachyon should be able to help you.


If you store all of your resource files in HDFS or S3 or both, you can run
Tachyon to use those storage systems as the under storage (
http://tachyon-project.org/documentation/Mounting-and-Transparent-Naming.html).
Then, you can run Spark to read from Tachyon as an HDFS-compatible file
system and do your processing. If you ever need to write data back out, you
can write it out as a file back into Tachyon, which can also be reflected
into your original store (HDFS, S3). The caching in Tachyon will be done
on-demand and you can use the LRU policy for eviction (or a customized
policy
http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html). I
think this might sound like your option #4. I hope this helps!

Thanks,
Gene

On Thu, Jan 14, 2016 at 4:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:

> OK so it looks like Tachyon is a cluster memory plugin marked as
> "experimental" in Spark.
>
> In any case, we've got a few requirements for the system we're working on
> which may drive the decision for how to implement large resource file
> management.
>
> The system is a framework of N data analyzers which take incoming
> documents as input and transform them or extract some data out of those
> documents.  These analyzers can be chained together which makes it a great
> case for processing with RDD's and a set of map/filter types of Spark
> functions. There's already an established framework API which we want to
> preserve.  This means that most likely, we'll create a relatively thin
> "binding" layer for exposing these analyzers as well-documented functions
> to the end-users who want to use them in a Spark based distributed
> computing environment.
>
> We also want to, ideally, hide the complexity of how these resources are
> loaded from the end-users who will be writing the actual Spark jobs that
> utilize the Spark "binding" functions that we provide.
>
> So, for managing large numbers of small, medium, or large resource files,
> we're considering the below options, with a variety of pros and cons
> attached, from the following perspectives:
>
> a) persistence - where do the resources reside initially;
> b) loading - what are the mechanics for loading of these resources;
> c) caching and sharing across worker nodes.
>
> Possible options:
>
> 1. Load each resource into a broadcast variable. Considering that we have
> scores if not hundreds of these resource files, maintaining that many
> broadcast variables seems like a complexity that's going to be hard to
> manage. We'd also need a translation layer between the broadcast variables
> and the internal API that would want to "speak" InputStream's rather than
> broadcast variables.
>
> 2. Load resources into RDD's and perform join's against them from our
> incoming document data RDD's, thus achieving the effect of a value lookup
> from the resources.  While this seems like a very Spark'y way of doing
> things, the lookup mechanics seem quite non-trivial, especially because
> some of the resources aren't going to be pure dictionaries; they may be
> statistical models.  Additionally, this forces us to utilize Spark's
> semantics for handling of these resources which means a potential rewrite
> of our internal product API. That would be a hard option to go with.
>
> 3. Pre-install all the needed resources on each of the worker nodes;
> retrieve the needed resources from the file system and load them into
> memory as needed. Ideally, the resources would only be installed once, on
> the Spark driver side; we'd want to avoid having to pre-install all these
> files on each node. However, we've done this as an exercise and this
> approach works OK.
>
> 4. Pre-load all the resources into HDFS or S3 i.e. into some distributed
> persistent store; load them into cluster memory from there, as necessary.
> Presumably this could be a pluggable store with a common API exposed.
> Since our framework is an OEM'able product, we could plug and play with a
> variety of such persistent stores via Java's FileSystem/URL scheme handler
> API's.
>
> 5. Implement a Resource management server, with a RESTful interface on
> top. Under the covers, this could be a wrapper on top of #4.  Potentially
> unnecessary if we have a solid persistent store API as per #4.
>
> 6. Beyond persistence, caching also has to be considered for these
> resources. We've considered Tachyon (especially since it's pluggable into
> Spark), Redis, and the like. Ideally, I would think we'd want resources to
> be loaded into the cluster memory as needed; paged in/out on-demand in an
> LRU fashion.  From this perspective, it's not yet clear to me what the best
> option(s) would be. Any thoughts / recommendations would be appreciated.
>
>
>
>
>
> On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Thanks, Gene.
>>
>> Does Spark use Tachyon under the covers anyway for implementing its
>> "cluster memory" support?
>>
>> It seems that the practice I hear the most about is the idea of loading
>> resources as RDD's and then doing join's against them to achieve the lookup
>> effect.
>>
>> The other approach would be to load the resources into broadcast
>> variables but I've heard concerns about memory.  Could we run out of memory
>> if we load too much into broadcast vars?  Is there any memory_to_disk/spill
>> to disk capability for broadcast variables in Spark?
>>
>>
>> On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <gene.p...@gmail.com> wrote:
>>
>>> Hi Dmitry,
>>>
>>> Yes, Tachyon can help with your use case. You can read and write to
>>> Tachyon via the filesystem api (
>>> http://tachyon-project.org/documentation/File-System-API.html). There
>>> is a native Java API as well as a Hadoop-compatible API. Spark is also able
>>> to interact with Tachyon via the Hadoop-compatible API, so Spark jobs can
>>> read input files from Tachyon and write output files to Tachyon.
>>>
>>> I hope that helps,
>>> Gene
>>>
>>> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> I'd guess that if the resources are broadcast Spark would put them into
>>>> Tachyon...
>>>>
>>>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <
>>>> dgoldenberg...@gmail.com> wrote:
>>>>
>>>> Would it make sense to load them into Tachyon and read and broadcast
>>>> them from there since Tachyon is already a part of the Spark stack?
>>>>
>>>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>>>
>>>>
>>>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>>>> sabarish.sasidha...@manthan.com> wrote:
>>>>
>>>> One option could be to store them as blobs in a cache like Redis and
>>>> then read + broadcast them from the driver. Or you could store them in HDFS
>>>> and read + broadcast from the driver.
>>>>
>>>> Regards
>>>> Sab
>>>>
>>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>>>> dgoldenberg...@gmail.com> wrote:
>>>>
>>>>> We have a bunch of Spark jobs deployed and a few large resource files
>>>>> such as e.g. a dictionary for lookups or a statistical model.
>>>>>
>>>>> Right now, these are deployed as part of the Spark jobs which will
>>>>> eventually make the mongo-jars too bloated for deployments.
>>>>>
>>>>> What are some of the best practices to consider for maintaining and
>>>>> sharing large resource files like these?
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Architect - Big Data
>>>> Ph: +91 99805 99458
>>>>
>>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>>> Sullivan India ICT)*
>>>> +++
>>>>
>>>>
>>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Reply via email to