Would the IndexedRDD feature provide what the Lookup RDD does? I'Ve been using a broadcast variable map for a similar kind of thing -- It probably is within 1GB but interested to know if the lookup (or indexed) might be better. C
On Friday, June 5, 2015, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote: > Thanks everyone. Evo, could you provide a link to the Lookup RDD project? > I can't seem to locate it exactly on Github. (Yes, to your point, our > project is Spark streaming based). Thank you. > > On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov <evo.efti...@isecc.com > <javascript:_e(%7B%7D,'cvml','evo.efti...@isecc.com');>> wrote: > >> Oops, @Yiannis, sorry to be a party pooper but the Job Server is for >> Spark Batch Jobs (besides anyone can put something like that in 5 min), >> while I am under the impression that Dmytiy is working on Spark Streaming >> app >> >> >> >> Besides the Job Server is essentially for sharing the Spark Context >> between multiple threads >> >> >> >> Re Dmytiis intial question – you can load large data sets as Batch >> (Static) RDD from any Spark Streaming App and then join DStream RDDs >> against them to emulate “lookups” , you can also try the “Lookup RDD” – >> there is a git hub project >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com >> <javascript:_e(%7B%7D,'cvml','dgoldenberg...@gmail.com');>] >> *Sent:* Friday, June 5, 2015 12:12 AM >> *To:* Yiannis Gkoufas >> *Cc:* Olivier Girardot; user@spark.apache.org >> <javascript:_e(%7B%7D,'cvml','user@spark.apache.org');> >> *Subject:* Re: How to share large resources like dictionaries while >> processing data with Spark ? >> >> >> >> Thanks so much, Yiannis, Olivier, Huang! >> >> >> >> On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas <johngou...@gmail.com >> <javascript:_e(%7B%7D,'cvml','johngou...@gmail.com');>> wrote: >> >> Hi there, >> >> >> >> I would recommend checking out >> https://github.com/spark-jobserver/spark-jobserver which I think gives >> the functionality you are looking for. >> >> I haven't tested it though. >> >> >> >> BR >> >> >> >> On 5 June 2015 at 01:35, Olivier Girardot <ssab...@gmail.com >> <javascript:_e(%7B%7D,'cvml','ssab...@gmail.com');>> wrote: >> >> You can use it as a broadcast variable, but if it's "too" large (more >> than 1Gb I guess), you may need to share it joining this using some kind of >> key to the other RDDs. >> >> But this is the kind of thing broadcast variables were designed for. >> >> >> >> Regards, >> >> >> >> Olivier. >> >> >> >> Le jeu. 4 juin 2015 à 23:50, dgoldenberg <dgoldenberg...@gmail.com >> <javascript:_e(%7B%7D,'cvml','dgoldenberg...@gmail.com');>> a écrit : >> >> We have some pipelines defined where sometimes we need to load potentially >> large resources such as dictionaries. >> >> What would be the best strategy for sharing such resources among the >> transformations/actions within a consumer? Can they be shared somehow >> across the RDD's? >> >> I'm looking for a way to load such a resource once into the cluster memory >> and have it be available throughout the lifecycle of a consumer... >> >> Thanks. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> <javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');> >> For additional commands, e-mail: user-h...@spark.apache.org >> <javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');> >> >> >> >> >> > > -- - Charles