Hi Dmitry,
I am not familiar with all of the details you have just described, but I
think Tachyon should be able to help you.
If you store all of your resource files in HDFS or S3 or both, you can run
Tachyon to use those storage systems as the under storage (
OK so it looks like Tachyon is a cluster memory plugin marked as
"experimental" in Spark.
In any case, we've got a few requirements for the system we're working on
which may drive the decision for how to implement large resource file
management.
The system is a framework of N data analyzers
The other thing from some folks' recommendations on this list was Apache
Ignite. Their In-Memory File System (
https://ignite.apache.org/features/igfs.html) looks quite interesting.
On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg wrote:
> OK so it looks like
Ignite can also cache rdd
> On 12 Jan 2016, at 13:06, Dmitry Goldenberg wrote:
>
> Jorn, you said Ignite or ... ? What was the second choice you were thinking
> of? It seems that got omitted.
>
>> On Jan 12, 2016, at 2:44 AM, Jörn Franke
I'd guess that if the resources are broadcast Spark would put them into
Tachyon...
> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg
> wrote:
>
> Would it make sense to load them into Tachyon and read and broadcast them
> from there since Tachyon is already a part of
Jorn, you said Ignite or ... ? What was the second choice you were thinking of?
It seems that got omitted.
> On Jan 12, 2016, at 2:44 AM, Jörn Franke wrote:
>
> You can look at ignite as a HDFS cache or for storing rdds.
>
>> On 11 Jan 2016, at 21:14, Dmitry Goldenberg
Would it make sense to load them into Tachyon and read and broadcast them from
there since Tachyon is already a part of the Spark stack?
If so I wonder if I could do that Tachyon read/write via a Spark API?
> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan
>
Thanks, Gene.
Does Spark use Tachyon under the covers anyway for implementing its
"cluster memory" support?
It seems that the practice I hear the most about is the idea of loading
resources as RDD's and then doing join's against them to achieve the lookup
effect.
The other approach would be to
Hi Dmitry,
Yes, Tachyon can help with your use case. You can read and write to Tachyon
via the filesystem api (
http://tachyon-project.org/documentation/File-System-API.html). There is a
native Java API as well as a Hadoop-compatible API. Spark is also able to
interact with Tachyon via the
You can look at ignite as a HDFS cache or for storing rdds.
> On 11 Jan 2016, at 21:14, Dmitry Goldenberg wrote:
>
> We have a bunch of Spark jobs deployed and a few large resource files such as
> e.g. a dictionary for lookups or a statistical model.
>
> Right now,
Has anyone used Ignite in production system ?
On Mon, Jan 11, 2016 at 11:44 PM, Jörn Franke wrote:
> You can look at ignite as a HDFS cache or for storing rdds.
>
> > On 11 Jan 2016, at 21:14, Dmitry Goldenberg
> wrote:
> >
> > We have a bunch
One option could be to store them as blobs in a cache like Redis and then
read + broadcast them from the driver. Or you could store them in HDFS and
read + broadcast from the driver.
Regards
Sab
On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg wrote:
> We have a
We have a bunch of Spark jobs deployed and a few large resource files such
as e.g. a dictionary for lookups or a statistical model.
Right now, these are deployed as part of the Spark jobs which will
eventually make the mongo-jars too bloated for deployments.
What are some of the best practices
13 matches
Mail list logo