Re: Best practices for sharing/maintaining large resource files for Spark jobs

Gene Pang Tue, 12 Jan 2016 08:21:14 -0800

Hi Dmitry,

Yes, Tachyon can help with your use case. You can read and write to Tachyon
via the filesystem api (
http://tachyon-project.org/documentation/File-System-API.html). There is a
native Java API as well as a Hadoop-compatible API. Spark is also able to
interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
input files from Tachyon and write output files to Tachyon.


I hope that helps,
Gene

On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:

> I'd guess that if the resources are broadcast Spark would put them into
> Tachyon...
>
> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
> wrote:
>
> Would it make sense to load them into Tachyon and read and broadcast them
> from there since Tachyon is already a part of the Spark stack?
>
> If so I wonder if I could do that Tachyon read/write via a Spark API?
>
>
> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
> One option could be to store them as blobs in a cache like Redis and then
> read + broadcast them from the driver. Or you could store them in HDFS and
> read + broadcast from the driver.
>
> Regards
> Sab
>
> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> We have a bunch of Spark jobs deployed and a few large resource files
>> such as e.g. a dictionary for lookups or a statistical model.
>>
>> Right now, these are deployed as part of the Spark jobs which will
>> eventually make the mongo-jars too bloated for deployments.
>>
>> What are some of the best practices to consider for maintaining and
>> sharing large resource files like these?
>>
>> Thanks.
>>
>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Reply via email to