Re: Best practices for sharing/maintaining large resource files for Spark jobs

Dmitry Goldenberg Tue, 12 Jan 2016 12:05:32 -0800

Thanks, Gene.

Does Spark use Tachyon under the covers anyway for implementing its
"cluster memory" support?


It seems that the practice I hear the most about is the idea of loading
resources as RDD's and then doing join's against them to achieve the lookup
effect.

The other approach would be to load the resources into broadcast variables
but I've heard concerns about memory.  Could we run out of memory if we
load too much into broadcast vars?  Is there any memory_to_disk/spill to
disk capability for broadcast variables in Spark?


On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <gene.p...@gmail.com> wrote:

> Hi Dmitry,
>
> Yes, Tachyon can help with your use case. You can read and write to
> Tachyon via the filesystem api (
> http://tachyon-project.org/documentation/File-System-API.html). There is
> a native Java API as well as a Hadoop-compatible API. Spark is also able to
> interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
> input files from Tachyon and write output files to Tachyon.
>
> I hope that helps,
> Gene
>
> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I'd guess that if the resources are broadcast Spark would put them into
>> Tachyon...
>>
>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
>> wrote:
>>
>> Would it make sense to load them into Tachyon and read and broadcast them
>> from there since Tachyon is already a part of the Spark stack?
>>
>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>
>>
>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>> One option could be to store them as blobs in a cache like Redis and then
>> read + broadcast them from the driver. Or you could store them in HDFS and
>> read + broadcast from the driver.
>>
>> Regards
>> Sab
>>
>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> We have a bunch of Spark jobs deployed and a few large resource files
>>> such as e.g. a dictionary for lookups or a statistical model.
>>>
>>> Right now, these are deployed as part of the Spark jobs which will
>>> eventually make the mongo-jars too bloated for deployments.
>>>
>>> What are some of the best practices to consider for maintaining and
>>> sharing large resource files like these?
>>>
>>> Thanks.
>>>
>>
>>
>>
>> --
>>
>> Architect - Big Data
>> Ph: +91 99805 99458
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>> +++
>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Reply via email to