Re: ElasticSearch enrich

boci Thu, 26 Jun 2014 14:24:17 -0700

Thanks. I without local option I can connect with es remote, now I only
have one problem. How can I use elasticsearch-hadoop with spark streaming?
I mean DStream doesn't have "saveAsHadoopFiles" method, my second problem
the output index is depend by the input data.


Thanks

----------------------------------------------------------------------------------------------------------------------------------
Skype: boci13, Hangout: boci.b...@gmail.com


On Thu, Jun 26, 2014 at 10:10 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> You can just add elasticsearch-hadoop as a dependency to your project to
> user the ESInputFormat and ESOutputFormat (
> https://github.com/elasticsearch/elasticsearch-hadoop). Some other basics
> here:
> http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
>
> For testing, yes I think you will need to start ES in local mode (just
> ./bin/elasticsearch) and use the default config (host = localhost, port =
> 9200).
>
>
> On Thu, Jun 26, 2014 at 9:04 AM, boci <boci.b...@gmail.com> wrote:
>
>> That's okay, but hadoop has ES integration. what happened if I run
>> saveAsHadoopFile without hadoop (or I must need to pull up hadoop
>> programatically? (if I can))
>>
>> b0c1
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------------
>> Skype: boci13, Hangout: boci.b...@gmail.com
>>
>>
>> On Thu, Jun 26, 2014 at 1:20 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jun 25, 2014 at 4:16 PM, boci <boci.b...@gmail.com> wrote:
>>>
>>>> Hi guys, thanks the direction now I have some problem/question:
>>>> - in local (test) mode I want to use ElasticClient.local to create es
>>>> connection, but in prodution I want to use ElasticClient.remote, to this I
>>>> want to pass ElasticClient to mapPartitions, or what is the best
>>>> practices?
>>>>
>>> In this case you probably want to make the ElasticClient inside of
>>> mapPartitions (since it isn't serializable) and if you want to use a
>>> different client in local mode just have a flag that control what type of
>>> client you create.
>>>
>>>> - my stream output is write into elasticsearch. How can I
>>>> test output.saveAsHadoopFile[ESOutputFormat]("-") in local environment?
>>>>
>>>>
>>> - After store the enriched data into ES, I want to generate aggregated
>>>> data (EsInputFormat) how can I test it in local?
>>>>
>>> I think the simplest thing to do would be use the same client in mode
>>> and just start single node elastic search cluster.
>>>
>>>>
>>>> Thanks guys
>>>>
>>>> b0c1
>>>>
>>>>
>>>>
>>>>
>>>> ----------------------------------------------------------------------------------------------------------------------------------
>>>> Skype: boci13, Hangout: boci.b...@gmail.com
>>>>
>>>>
>>>> On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> So I'm giving a talk at the Spark summit on using Spark &
>>>>> ElasticSearch, but for now if you want to see a simple demo which uses
>>>>> elasticsearch for geo input you can take a look at my quick & dirty
>>>>> implementation with TopTweetsInALocation (
>>>>> https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/TopTweetsInALocation.scala
>>>>> ). This approach uses the ESInputFormat which avoids the difficulty of
>>>>> having to manually create ElasticSearch clients.
>>>>>
>>>>> This approach might not work for your data, e.g. if you need to create
>>>>> a query for each record in your RDD. If this is the case, you could 
>>>>> instead
>>>>> look at using mapPartitions and setting up your Elasticsearch connection
>>>>> inside of that, so you could then re-use the client for all of the queries
>>>>> on each partition. This approach will avoid having to serialize the
>>>>> Elasticsearch connection because it will be local to your function.
>>>>>
>>>>> Hope this helps!
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>>
>>>>> On Tue, Jun 24, 2014 at 4:28 PM, Mayur Rustagi <
>>>>> mayur.rust...@gmail.com> wrote:
>>>>>
>>>>>> Its not used as default serializer for some issues with compatibility
>>>>>> & requirement to register the classes..
>>>>>>
>>>>>> Which part are you getting as nonserializable... you need to
>>>>>> serialize that class if you are sending it to spark workers inside a map,
>>>>>> reduce , mappartition or any of the operations on RDD.
>>>>>>
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng <pc...@uow.edu.au> wrote:
>>>>>>
>>>>>>> I'm afraid persisting connection across two tasks is a dangerous act
>>>>>>> as they
>>>>>>> can't be guaranteed to be executed on the same machine. Your ES
>>>>>>> server may
>>>>>>> think its a man-in-the-middle attack!
>>>>>>>
>>>>>>> I think its possible to invoke a static method that give you a
>>>>>>> connection in
>>>>>>> a local 'pool', so nothing will sneak into your closure, but its too
>>>>>>> complex
>>>>>>> and there should be a better option.
>>>>>>>
>>>>>>> Never use kryo before, if its that good perhaps we should use it as
>>>>>>> the
>>>>>>> default serializer
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/ElasticSearch-enrich-tp8209p8222.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cell : 425-233-8271
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>>
>>
>>
>

Re: ElasticSearch enrich

Reply via email to