Fwd: Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Etienne Chauchot Wed, 15 Nov 2017 01:05:22 -0800

Hi all,

CCing of the dev list failed, so I forward the email :)




-------- Message transféré --------

Sujet : Re: Does ElasticsearchIO in the latest RC support addingdocument IDs?

Date :  Wed, 15 Nov 2017 09:53:46 +0100
De :    Etienne Chauchot <echauc...@apache.org>
Pour :  Chet Aldrich <chet.aldr...@postmates.com>, u...@beam.apache.org
Copie à :       Philip Chan <phi...@postmates.com>, echauc...@gmail.com



Hi Chet,

What you say is totally true, docs written using ElasticSearchIO willalways have an ES generated id. But it might change in the future,indeed it might be a good thing to allow the user to pass an id. Just in5 seconds thinking, I see 3 possible designs for that.

a.(simplest) use a json special field for the id, if it is provided bythe user in the input json then it is used, auto-generated id otherwise.

b. (a bit less user friendly) PCollection<KV> with K as an id. Butforces the user to do a Pardo before writing to ES to output KV pairs of<id, json>

c. (a lot more complex) Allow the IO to serialize/deserialize java beansand have an String id field. Matching java types to ES types is quitetricky, so, for now we just relied on the user to serialize his beansinto json and let ES match the types automatically.


Related to the problems you raise bellow:

1. Well, the bundle is the commit entity of beam. Consider the case ofESIO.batchSize being < to bundle size. While processing records, whenthe number of elements reaches batchSize, an ES bulk insert will beissued but no finishBundle. If there is a problem later on in the bundleprocessing before the finishBundle, the checkpoint will still be at thebeginning of the bundle, so all the bundle will be retried leading toduplicate documents. Thanks for raising that! I'm CCing the dev list sothat someone could correct me on the checkpointing mecanism if I'mmissing something. Besides I'm thinking about forcing the user toprovide an id in all cases to workaround this issue.


2. Correct.

Best,
Etienne

Le 15/11/2017 à 02:16, Chet Aldrich a écrit :

Hello all!
So I’ve been using the ElasticSearchIO sink for a project(unfortunately it’s Elasticsearch 5.x, and so I’ve been messing aroundwith the latest RC) and I’m finding that it doesn’t allow for changingthe document ID, but only lets you pass in a record, which means thatthe document ID is auto-generated. See this line for what specificallyis happening:
https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838
Essentially the data part of the document is being placed but itdoesn’t allow for other properties, such as the document ID, to be set.
This leads to two problems:
1. Beam doesn’t necessarily guarantee exactly-once execution for agiven item in a PCollection, as I understand it. This means that youmay get more than one record in Elastic for a given item in aPCollection that you pass in.
2. You can’t do partial updates to an index. If you run a batch jobonce, and then run the batch job again on the same index withoutclearing it, you just double everything in there.
Is there any good way around this?
I’d be happy to try writing up a PR for this in theory, but not surehow to best approach it. Also would like to figure out a way to getaround this in the meantime, if anyone has any ideas.
Best,

Chet
P.S. CCed echauc...@gmail.com <mailto:echauc...@gmail.com> because itseems like he’s been doing work related to the elastic sink.

Fwd: Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Reply via email to