Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Charles Earl
Would tachyon be appropriate here?

On Friday, June 5, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

 Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
 Batch Jobs (besides anyone can put something like that in 5 min), while I
 am under the impression that Dmytiy is working on Spark Streaming app



 Besides the Job Server is essentially for sharing the Spark Context
 between multiple threads



 Re Dmytiis intial question – you can load large data sets as Batch
 (Static) RDD from any Spark Streaming App and then join DStream RDDs
 against them to emulate “lookups” , you can also try the “Lookup RDD” –
 there is a git hub project



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com
 javascript:_e(%7B%7D,'cvml','dgoldenberg...@gmail.com');]
 *Sent:* Friday, June 5, 2015 12:12 AM
 *To:* Yiannis Gkoufas
 *Cc:* Olivier Girardot; user@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user@spark.apache.org');
 *Subject:* Re: How to share large resources like dictionaries while
 processing data with Spark ?



 Thanks so much, Yiannis, Olivier, Huang!



 On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
 javascript:_e(%7B%7D,'cvml','johngou...@gmail.com'); wrote:

 Hi there,



 I would recommend checking out
 https://github.com/spark-jobserver/spark-jobserver which I think gives
 the functionality you are looking for.

 I haven't tested it though.



 BR



 On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com
 javascript:_e(%7B%7D,'cvml','ssab...@gmail.com'); wrote:

 You can use it as a broadcast variable, but if it's too large (more than
 1Gb I guess), you may need to share it joining this using some kind of key
 to the other RDDs.

 But this is the kind of thing broadcast variables were designed for.



 Regards,



 Olivier.



 Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com
 javascript:_e(%7B%7D,'cvml','dgoldenberg...@gmail.com'); a écrit :

 We have some pipelines defined where sometimes we need to load potentially
 large resources such as dictionaries.

 What would be the best strategy for sharing such resources among the
 transformations/actions within a consumer?  Can they be shared somehow
 across the RDD's?

 I'm looking for a way to load such a resource once into the cluster memory
 and have it be available throughout the lifecycle of a consumer...

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');
 For additional commands, e-mail: user-h...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');







-- 
- Charles


RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:

You can use it as a broadcast variable, but if it's too large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 



RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
And RDD.lookup() can not be invoked from Transformations e.g. maps

 

Lookup() is an action which can be invoked only from the driver – if you want 
functionality like that from within Transformations executed on the cluster 
nodes try Indexed RDD

 

Other options are load a Batch / Static RDD once in your Spark Streaming App 
and then keep joining and then e.g. filtering every incoming DStream RDD with 
the (big static) Batch RDD

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Friday, June 5, 2015 3:27 PM
To: 'Dmitry Goldenberg'
Cc: 'Yiannis Gkoufas'; 'Olivier Girardot'; 'user@spark.apache.org'
Subject: RE: How to share large resources like dictionaries while processing 
data with Spark ?

 

It is called Indexed RDD https://github.com/amplab/spark-indexedrdd 

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 3:15 PM
To: Evo Eftimov
Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I 
can't seem to locate it exactly on Github. (Yes, to your point, our project is 
Spark streaming based). Thank you.

 

On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov evo.efti...@isecc.com wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:

You can use it as a broadcast variable, but if it's too large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 

 



Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Charles Earl
Would the IndexedRDD feature provide what the Lookup RDD does?
I'Ve been using a broadcast variable map for a similar kind of thing -- It
probably is within 1GB but interested to know if the lookup (or indexed)
might be better.
C

On Friday, June 5, 2015, Dmitry Goldenberg dgoldenberg...@gmail.com wrote:

 Thanks everyone. Evo, could you provide a link to the Lookup RDD project?
 I can't seem to locate it exactly on Github. (Yes, to your point, our
 project is Spark streaming based). Thank you.

 On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov evo.efti...@isecc.com
 javascript:_e(%7B%7D,'cvml','evo.efti...@isecc.com'); wrote:

 Oops, @Yiannis, sorry to be a party pooper but the Job Server is for
 Spark Batch Jobs (besides anyone can put something like that in 5 min),
 while I am under the impression that Dmytiy is working on Spark Streaming
 app



 Besides the Job Server is essentially for sharing the Spark Context
 between multiple threads



 Re Dmytiis intial question – you can load large data sets as Batch
 (Static) RDD from any Spark Streaming App and then join DStream RDDs
 against them to emulate “lookups” , you can also try the “Lookup RDD” –
 there is a git hub project



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com
 javascript:_e(%7B%7D,'cvml','dgoldenberg...@gmail.com');]
 *Sent:* Friday, June 5, 2015 12:12 AM
 *To:* Yiannis Gkoufas
 *Cc:* Olivier Girardot; user@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user@spark.apache.org');
 *Subject:* Re: How to share large resources like dictionaries while
 processing data with Spark ?



 Thanks so much, Yiannis, Olivier, Huang!



 On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
 javascript:_e(%7B%7D,'cvml','johngou...@gmail.com'); wrote:

 Hi there,



 I would recommend checking out
 https://github.com/spark-jobserver/spark-jobserver which I think gives
 the functionality you are looking for.

 I haven't tested it though.



 BR



 On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com
 javascript:_e(%7B%7D,'cvml','ssab...@gmail.com'); wrote:

 You can use it as a broadcast variable, but if it's too large (more
 than 1Gb I guess), you may need to share it joining this using some kind of
 key to the other RDDs.

 But this is the kind of thing broadcast variables were designed for.



 Regards,



 Olivier.



 Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com
 javascript:_e(%7B%7D,'cvml','dgoldenberg...@gmail.com'); a écrit :

 We have some pipelines defined where sometimes we need to load potentially
 large resources such as dictionaries.

 What would be the best strategy for sharing such resources among the
 transformations/actions within a consumer?  Can they be shared somehow
 across the RDD's?

 I'm looking for a way to load such a resource once into the cluster memory
 and have it be available throughout the lifecycle of a consumer...

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');
 For additional commands, e-mail: user-h...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');








-- 
- Charles


RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
It is called Indexed RDD https://github.com/amplab/spark-indexedrdd 

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 3:15 PM
To: Evo Eftimov
Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I 
can't seem to locate it exactly on Github. (Yes, to your point, our project is 
Spark streaming based). Thank you.

 

On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov evo.efti...@isecc.com wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:

You can use it as a broadcast variable, but if it's too large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 

 



Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg
Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I
can't seem to locate it exactly on Github. (Yes, to your point, our project
is Spark streaming based). Thank you.

On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
 Batch Jobs (besides anyone can put something like that in 5 min), while I
 am under the impression that Dmytiy is working on Spark Streaming app



 Besides the Job Server is essentially for sharing the Spark Context
 between multiple threads



 Re Dmytiis intial question – you can load large data sets as Batch
 (Static) RDD from any Spark Streaming App and then join DStream RDDs
 against them to emulate “lookups” , you can also try the “Lookup RDD” –
 there is a git hub project



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Friday, June 5, 2015 12:12 AM
 *To:* Yiannis Gkoufas
 *Cc:* Olivier Girardot; user@spark.apache.org
 *Subject:* Re: How to share large resources like dictionaries while
 processing data with Spark ?



 Thanks so much, Yiannis, Olivier, Huang!



 On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi there,



 I would recommend checking out
 https://github.com/spark-jobserver/spark-jobserver which I think gives
 the functionality you are looking for.

 I haven't tested it though.



 BR



 On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:

 You can use it as a broadcast variable, but if it's too large (more than
 1Gb I guess), you may need to share it joining this using some kind of key
 to the other RDDs.

 But this is the kind of thing broadcast variables were designed for.



 Regards,



 Olivier.



 Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a
 écrit :

 We have some pipelines defined where sometimes we need to load potentially
 large resources such as dictionaries.

 What would be the best strategy for sharing such resources among the
 transformations/actions within a consumer?  Can they be shared somehow
 across the RDD's?

 I'm looking for a way to load such a resource once into the cluster memory
 and have it be available throughout the lifecycle of a consumer...

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
Spark uses Tachyon internally ie all SERIALIZED IN-MEMORY RDDs are kept there – 
so if you have a BATCH RDD which is SERIALIZED IN_MEMORY then you are using 
Tachyon implicitly – the only difference is that if you are using Tachyon 
explicitly ie as a distributed, in-memory file system you can share data 
between Jobs, while an RDD is ALWAYS visible within Jobs using the same Spark 
Context 

 

From: Charles Earl [mailto:charles.ce...@gmail.com] 
Sent: Friday, June 5, 2015 12:10 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Would tachyon be appropriate here?

On Friday, June 5, 2015, Evo Eftimov evo.efti...@isecc.com wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com 
javascript:_e(%7B%7D,'cvml','dgoldenberg...@gmail.com'); ] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org 
javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); 
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com 
javascript:_e(%7B%7D,'cvml','johngou...@gmail.com');  wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com 
javascript:_e(%7B%7D,'cvml','ssab...@gmail.com');  wrote:

You can use it as a broadcast variable, but if it's too large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com 
javascript:_e(%7B%7D,'cvml','dgoldenberg...@gmail.com');  a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org'); 
For additional commands, e-mail: user-h...@spark.apache.org 
javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org'); 

 

 



-- 
- Charles



Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Olivier Girardot
You can use it as a broadcast variable, but if it's too large (more than
1Gb I guess), you may need to share it joining this using some kind of key
to the other RDDs.
But this is the kind of thing broadcast variables were designed for.

Regards,

Olivier.

Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a
écrit :

 We have some pipelines defined where sometimes we need to load potentially
 large resources such as dictionaries.

 What would be the best strategy for sharing such resources among the
 transformations/actions within a consumer?  Can they be shared somehow
 across the RDD's?

 I'm looking for a way to load such a resource once into the cluster memory
 and have it be available throughout the lifecycle of a consumer...

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Huang, Roger
Is the dictionary read-only?
Did you look at 
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables ?


-Original Message-
From: dgoldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Thursday, June 04, 2015 4:50 PM
To: user@spark.apache.org
Subject: How to share large resources like dictionaries while processing data 
with Spark ?

We have some pipelines defined where sometimes we need to load potentially 
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the 
transformations/actions within a consumer?  Can they be shared somehow across 
the RDD's?

I'm looking for a way to load such a resource once into the cluster memory and 
have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Yiannis Gkoufas
Hi there,

I would recommend checking out
https://github.com/spark-jobserver/spark-jobserver which I think gives the
functionality you are looking for.
I haven't tested it though.

BR

On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:

 You can use it as a broadcast variable, but if it's too large (more than
 1Gb I guess), you may need to share it joining this using some kind of key
 to the other RDDs.
 But this is the kind of thing broadcast variables were designed for.

 Regards,

 Olivier.

 Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a
 écrit :

 We have some pipelines defined where sometimes we need to load potentially
 large resources such as dictionaries.

 What would be the best strategy for sharing such resources among the
 transformations/actions within a consumer?  Can they be shared somehow
 across the RDD's?

 I'm looking for a way to load such a resource once into the cluster memory
 and have it be available throughout the lifecycle of a consumer...

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg
Thanks so much, Yiannis, Olivier, Huang!

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:

 Hi there,

 I would recommend checking out
 https://github.com/spark-jobserver/spark-jobserver which I think gives
 the functionality you are looking for.
 I haven't tested it though.

 BR

 On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:

 You can use it as a broadcast variable, but if it's too large (more
 than 1Gb I guess), you may need to share it joining this using some kind of
 key to the other RDDs.
 But this is the kind of thing broadcast variables were designed for.

 Regards,

 Olivier.

 Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a
 écrit :

 We have some pipelines defined where sometimes we need to load
 potentially
 large resources such as dictionaries.

 What would be the best strategy for sharing such resources among the
 transformations/actions within a consumer?  Can they be shared somehow
 across the RDD's?

 I'm looking for a way to load such a resource once into the cluster
 memory
 and have it be available throughout the lifecycle of a consumer...

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org