Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Etienne Chauchot Wed, 15 Nov 2017 01:29:04 -0800

Yes, exactly. Actually, it raised from a discussion we had with Romainabout ESIO.


Le 15/11/2017 à 10:08, Jean-Baptiste Onofré a écrit :

I think it's also related to the discussion Romain raised on the devmailing list (gap between batch size, checkpointing & bundles).
Regards
JB

On 11/15/2017 09:53 AM, Etienne Chauchot wrote:
Hi Chet,
What you say is totally true, docs written using ElasticSearchIO willalways have an ES generated id. But it might change in the future,indeed it might be a good thing to allow the user to pass an id. Justin 5 seconds thinking, I see 3 possible designs for that.
a.(simplest) use a json special field for the id, if it is providedby the user in the input json then it is used, auto-generated idotherwise.
b. (a bit less user friendly) PCollection<KV> with K as an id. Butforces the user to do a Pardo before writing to ES to output KV pairsof <id, json>
c. (a lot more complex) Allow the IO to serialize/deserialize javabeans and have an String id field. Matching java types to ES types isquite tricky, so, for now we just relied on the user to serialize hisbeans into json and let ES match the types automatically.
Related to the problems you raise bellow:
1. Well, the bundle is the commit entity of beam. Consider the caseof ESIO.batchSize being < to bundle size. While processing records,when the number of elements reaches batchSize, an ES bulk insert willbe issued but no finishBundle. If there is a problem later on in thebundle processing before the finishBundle, the checkpoint will stillbe at the beginning of the bundle, so all the bundle will be retriedleading to duplicate documents. Thanks for raising that! I'm CCingthe dev list so that someone could correct me on the checkpointingmecanism if I'm missing something. Besides I'm thinking about forcingthe user to provide an id in all cases to workaround this issue.
2. Correct.

Best,
Etienne

Le 15/11/2017 à 02:16, Chet Aldrich a écrit :
Hello all!
So I’ve been using the ElasticSearchIO sink for a project(unfortunately it’s Elasticsearch 5.x, and so I’ve been messingaround with the latest RC) and I’m finding that it doesn’t allow forchanging the document ID, but only lets you pass in a record, whichmeans that the document ID is auto-generated. See this line for whatspecifically is happening:
https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838
Essentially the data part of the document is being placed but itdoesn’t allow for other properties, such as the document ID, to be set.
This leads to two problems:
1. Beam doesn’t necessarily guarantee exactly-once execution for agiven item in a PCollection, as I understand it. This means that youmay get more than one record in Elastic for a given item in aPCollection that you pass in.
2. You can’t do partial updates to an index. If you run a batch jobonce, and then run the batch job again on the same index withoutclearing it, you just double everything in there.
Is there any good way around this?
I’d be happy to try writing up a PR for this in theory, but not surehow to best approach it. Also would like to figure out a way to getaround this in the meantime, if anyone has any ideas.
Best,

Chet
P.S. CCed echauc...@gmail.com <mailto:echauc...@gmail.com> becauseit seems like he’s been doing work related to the elastic sink.

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Reply via email to