Another question. In the foreachRDD I will initialize the JobConf, but in
this place how can I get information from the items?
I have an identifier in the data which identify the required ES index (so
how can I set dynamic index in the foreachRDD) ?
b0c1
Ok I found dynamic resources, but I have a frustrating problem. This is the
flow:
kafka - enrich X - enrich Y - enrich Z - foreachRDD - save
My problem is: if I do this it's not work, the enrich functions not called,
but if I put a print it's does. for example if I do this:
kafka - enrich X -
So a few quick questions:
1) What cluster are you running this against? Is it just local? Have you
tried local[4]?
2) When you say breakpoint, how are you setting this break point? There is
a good chance your breakpoint mechanism doesn't work in a distributed
environment, could you instead cause
: ElasticSearch enrich
Wow, thanks your fast answer, it's help a lot...
b0c1
--
Skype: boci13, Hangout: boci.b...@gmail.commailto:boci.b...@gmail.com
On Thu, Jun 26, 2014 at 11:48
This is a simply scalatest. I start a SparkConf, set the master to local
(set the serializer etc), pull up kafka and es connection send a message to
kafka and wait 30sec to processing.
It's run in IDEA no magick trick.
b0c1
Try setting the master to local[4]
On Fri, Jun 27, 2014 at 2:17 PM, boci boci.b...@gmail.com wrote:
This is a simply scalatest. I start a SparkConf, set the master to local
(set the serializer etc), pull up kafka and es connection send a message to
kafka and wait 30sec to processing.
It's
That's okay, but hadoop has ES integration. what happened if I run
saveAsHadoopFile without hadoop (or I must need to pull up hadoop
programatically? (if I can))
b0c1
You can just add elasticsearch-hadoop as a dependency to your project to
user the ESInputFormat and ESOutputFormat (
https://github.com/elasticsearch/elasticsearch-hadoop). Some other basics
here:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
For testing, yes I
Thanks. I without local option I can connect with es remote, now I only
have one problem. How can I use elasticsearch-hadoop with spark streaming?
I mean DStream doesn't have saveAsHadoopFiles method, my second problem
the output index is depend by the input data.
Thanks
Hi b0c1,
I have an example of how to do this in the repo for my talk as well, the
specific example is at
https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/IndexTweetsLive.scala
. Since DStream doesn't have a saveAsHadoopDataset we use foreachRDD and
Wow, thanks your fast answer, it's help a lot...
b0c1
--
Skype: boci13, Hangout: boci.b...@gmail.com
On Thu, Jun 26, 2014 at 11:48 PM, Holden Karau
Just your luck I happened to be working on that very talk today :) Let me
know how your experiences with Elasticsearch Spark go :)
On Thu, Jun 26, 2014 at 3:17 PM, boci boci.b...@gmail.com wrote:
Wow, thanks your fast answer, it's help a lot...
b0c1
Hi guys, thanks the direction now I have some problem/question:
- in local (test) mode I want to use ElasticClient.local to create es
connection, but in prodution I want to use ElasticClient.remote, to this I
want to pass ElasticClient to mapPartitions, or what is the best practices?
- my stream
On Wed, Jun 25, 2014 at 4:16 PM, boci boci.b...@gmail.com wrote:
Hi guys, thanks the direction now I have some problem/question:
- in local (test) mode I want to use ElasticClient.local to create es
connection, but in prodution I want to use ElasticClient.remote, to this I
want to pass
make sure all queries are called through class methods and wrap your query
info with a class having only simple properties (strings, collections etc).
If you can't find such wrapper you can also use SerializableWritable wrapper
out-of-the-box, but its not recommended. (developer-api and make fat
Ok but in this case where can I store the ES connection? Or all document
create new ES connection inside the worker?
--
Skype: boci13, Hangout: boci.b...@gmail.com
On
Mostly ES client is not serializable for you. You can do 3 workarounds,
1. Switch to kryo serialization, register the client in kryo , might solve
your serialization issue
2. Use mappartition for all your data initialize your client in the
mappartition code, this will create client for each
I'm afraid persisting connection across two tasks is a dangerous act as they
can't be guaranteed to be executed on the same machine. Your ES server may
think its a man-in-the-middle attack!
I think its possible to invoke a static method that give you a connection in
a local 'pool', so nothing
Its not used as default serializer for some issues with compatibility
requirement to register the classes..
Which part are you getting as nonserializable... you need to serialize that
class if you are sending it to spark workers inside a map, reduce ,
mappartition or any of the operations on
19 matches
Mail list logo