Re: ElasticSearch enrich

boci Wed, 25 Jun 2014 16:17:30 -0700

Hi guys, thanks the direction now I have some problem/question:
- in local (test) mode I want to use ElasticClient.local to create es
connection, but in prodution I want to use ElasticClient.remote, to this I
want to pass ElasticClient to mapPartitions, or what is the best practices?
- my stream output is write into elasticsearch. How can I
test output.saveAsHadoopFile[ESOutputFormat]("-") in local environment?
- After store the enriched data into ES, I want to generate aggregated data
(EsInputFormat) how can I test it in local?


Thanks guys

b0c1



----------------------------------------------------------------------------------------------------------------------------------
Skype: boci13, Hangout: boci.b...@gmail.com


On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

> So I'm giving a talk at the Spark summit on using Spark & ElasticSearch,
> but for now if you want to see a simple demo which uses elasticsearch for
> geo input you can take a look at my quick & dirty implementation with
> TopTweetsInALocation (
> https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/TopTweetsInALocation.scala
> ). This approach uses the ESInputFormat which avoids the difficulty of
> having to manually create ElasticSearch clients.
>
> This approach might not work for your data, e.g. if you need to create a
> query for each record in your RDD. If this is the case, you could instead
> look at using mapPartitions and setting up your Elasticsearch connection
> inside of that, so you could then re-use the client for all of the queries
> on each partition. This approach will avoid having to serialize the
> Elasticsearch connection because it will be local to your function.
>
> Hope this helps!
>
> Cheers,
>
> Holden :)
>
>
> On Tue, Jun 24, 2014 at 4:28 PM, Mayur Rustagi <mayur.rust...@gmail.com>
> wrote:
>
>> Its not used as default serializer for some issues with compatibility &
>> requirement to register the classes..
>>
>> Which part are you getting as nonserializable... you need to serialize
>> that class if you are sending it to spark workers inside a map, reduce ,
>> mappartition or any of the operations on RDD.
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng <pc...@uow.edu.au> wrote:
>>
>>> I'm afraid persisting connection across two tasks is a dangerous act as
>>> they
>>> can't be guaranteed to be executed on the same machine. Your ES server
>>> may
>>> think its a man-in-the-middle attack!
>>>
>>> I think its possible to invoke a static method that give you a
>>> connection in
>>> a local 'pool', so nothing will sneak into your closure, but its too
>>> complex
>>> and there should be a better option.
>>>
>>> Never use kryo before, if its that good perhaps we should use it as the
>>> default serializer
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/ElasticSearch-enrich-tp8209p8222.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>
>
> --
> Cell : 425-233-8271
>

Re: ElasticSearch enrich

Reply via email to