Hi guys, thanks the direction now I have some problem/question: - in local (test) mode I want to use ElasticClient.local to create es connection, but in prodution I want to use ElasticClient.remote, to this I want to pass ElasticClient to mapPartitions, or what is the best practices? - my stream output is write into elasticsearch. How can I test output.saveAsHadoopFile[ESOutputFormat]("-") in local environment? - After store the enriched data into ES, I want to generate aggregated data (EsInputFormat) how can I test it in local?
Thanks guys b0c1 ---------------------------------------------------------------------------------------------------------------------------------- Skype: boci13, Hangout: boci.b...@gmail.com On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau <hol...@pigscanfly.ca> wrote: > So I'm giving a talk at the Spark summit on using Spark & ElasticSearch, > but for now if you want to see a simple demo which uses elasticsearch for > geo input you can take a look at my quick & dirty implementation with > TopTweetsInALocation ( > https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/TopTweetsInALocation.scala > ). This approach uses the ESInputFormat which avoids the difficulty of > having to manually create ElasticSearch clients. > > This approach might not work for your data, e.g. if you need to create a > query for each record in your RDD. If this is the case, you could instead > look at using mapPartitions and setting up your Elasticsearch connection > inside of that, so you could then re-use the client for all of the queries > on each partition. This approach will avoid having to serialize the > Elasticsearch connection because it will be local to your function. > > Hope this helps! > > Cheers, > > Holden :) > > > On Tue, Jun 24, 2014 at 4:28 PM, Mayur Rustagi <mayur.rust...@gmail.com> > wrote: > >> Its not used as default serializer for some issues with compatibility & >> requirement to register the classes.. >> >> Which part are you getting as nonserializable... you need to serialize >> that class if you are sending it to spark workers inside a map, reduce , >> mappartition or any of the operations on RDD. >> >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> >> On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng <pc...@uow.edu.au> wrote: >> >>> I'm afraid persisting connection across two tasks is a dangerous act as >>> they >>> can't be guaranteed to be executed on the same machine. Your ES server >>> may >>> think its a man-in-the-middle attack! >>> >>> I think its possible to invoke a static method that give you a >>> connection in >>> a local 'pool', so nothing will sneak into your closure, but its too >>> complex >>> and there should be a better option. >>> >>> Never use kryo before, if its that good perhaps we should use it as the >>> default serializer >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/ElasticSearch-enrich-tp8209p8222.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >> >> > > > -- > Cell : 425-233-8271 >