Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Chet Aldrich Thu, 16 Nov 2017 10:42:24 -0800

Sure, I’d be happy to take it on. My JIRA ID is chetaldrich. We can continue 
discussion on that ticket.


Chet

> On Nov 16, 2017, at 7:57 AM, Etienne Chauchot <echauc...@apache.org> wrote:
> 
> Chet, 
> FYI, here is the ticket and the design proposal: 
> https://issues.apache.org/jira/browse/BEAM-3201 
> <https://issues.apache.org/jira/browse/BEAM-3201>. If you still want to code 
> that improvement, give me your jira id and I will assign the ticket to you. 
> Otherwise I can code it as well.
> 
> Best
> 
> Etienne
> 
> Le 16/11/2017 à 09:19, Etienne Chauchot a écrit :
>> Hi, 
>> Thanks for the offer, I'd be happy to review your PR. Just wait a bit until 
>> I have opened a proper ticket for that. I still need to think more about the 
>> design. Among other things, I have to check what ES dev team did for other 
>> big data ES IO (es_hadoop) on that particular point. Besides, I think we 
>> also need to deal with the id at read time not only at write time. I'll give 
>> some details in the ticket.
>> 
>> Le 15/11/2017 à 20:08, Chet Aldrich a écrit :
>>> Given that this seems like a change that should probably happen, and I’d 
>>> like to help contribute if possible, a few questions and my current 
>>> opinion: 
>>> 
>>> So I’m leaning towards approach B here, which is:
>>>> b. (a bit less user friendly) PCollection<KV> with K as an id. But forces 
>>>> the user to do a Pardo before writing to ES to output KV pairs of <id, 
>>>> json>
>>>> 
>>> I think that the reduction in user-friendliness may be outweighed by the 
>>> fact that this obviates some of the issues surrounding a failure when 
>>> finishing a bundle. Additionally, this forces the user to provide a 
>>> document id, which I think is probably better practice.
>>> 
>> Yes as I wrote before, I think it is better to force the user to provide an 
>> id (at least for index updates, exactly-one semantics is a larger beam 
>> subject than this IO scope). Regarding design, plan b is not the better one 
>> IMHO because it changes the IO public API. I'm more in favor of plan a with 
>> the ability for the user to tell what field is his doc id.
>>> This will also probably lead to fewer frustrations around “magic” code that 
>>> just pulls something in if it happens to be there, and doesn’t if not. 
>>> We’ll need to rely on the user catching this functionality in the docs or 
>>> the code itself to take advantage of it. 
>>> 
>>> IMHO it’d be generally better to enforce this at compile time because it 
>>> does have an effect on whether the pipeline produces duplicates on failure. 
>>> Additionally, we get the benefit of relatively intuitive behavior where if 
>>> the user passes in the same Key value, it’ll update a record in ES, and if 
>>> the key is different then it will create a new record. 
>> Totally agree, id enforcement at compile time, no auto-generation
>>> 
>>> Curious to hear thoughts on this. If this seems reasonable I’ll go ahead 
>>> and create a JIRA for tracking and start working on a PR for this. Also, if 
>>> it’d be good to loop in the dev mailing list before starting let me know, 
>>> I’m pretty new to this. 
>> I'll create the ticket and we will loop on design in the comments.
>> Best
>> Etienne
>>> 
>>> Chet 
>>> 
>>>> On Nov 15, 2017, at 12:53 AM, Etienne Chauchot <echauc...@apache.org 
>>>> <mailto:echauc...@apache.org>> wrote:
>>>> 
>>>> Hi Chet,
>>>> 
>>>> What you say is totally true, docs written using ElasticSearchIO will 
>>>> always have an ES generated id. But it might change in the future, indeed 
>>>> it might be a good thing to allow the user to pass an id. Just in 5 
>>>> seconds thinking, I see 3 possible designs for that. 
>>>> a.(simplest) use a json special field for the id, if it is provided by the 
>>>> user in the input json then it is used, auto-generated id otherwise.
>>>> 
>>>> b. (a bit less user friendly) PCollection<KV> with K as an id. But forces 
>>>> the user to do a Pardo before writing to ES to output KV pairs of <id, 
>>>> json>
>>>> 
>>>> c. (a lot more complex) Allow the IO to serialize/deserialize java beans 
>>>> and have an String id field. Matching java types to ES types is quite 
>>>> tricky, so, for now we just relied on the user to serialize his beans into 
>>>> json and let ES match the types automatically.
>>>> 
>>>> Related to the problems you raise bellow:
>>>> 
>>>> 1. Well, the bundle is the commit entity of beam. Consider the case of 
>>>> ESIO.batchSize being < to bundle size. While processing records, when the 
>>>> number of elements reaches batchSize, an ES bulk insert will be issued but 
>>>> no finishBundle. If there is a problem later on in the bundle processing 
>>>> before the finishBundle, the checkpoint will still be at the beginning of 
>>>> the bundle, so all the bundle will be retried leading to duplicate 
>>>> documents. Thanks for raising that! I'm CCing the dev list so that someone 
>>>> could correct me on the checkpointing mecanism if I'm missing something. 
>>>> Besides I'm thinking about forcing the user to provide an id in all cases 
>>>> to workaround this issue.
>>>> 2. Correct.
>>>> 
>>>> Best,
>>>> Etienne
>>>> 
>>>> Le 15/11/2017 à 02:16, Chet Aldrich a écrit :
>>>>> Hello all! 
>>>>> 
>>>>> So I’ve been using the ElasticSearchIO sink for a project (unfortunately 
>>>>> it’s Elasticsearch 5.x, and so I’ve been messing around with the latest 
>>>>> RC) and I’m finding that it doesn’t allow for changing the document ID, 
>>>>> but only lets you pass in a record, which means that the document ID is 
>>>>> auto-generated. See this line for what specifically is happening: 
>>>>> 
>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838
>>>>>  
>>>>> <https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838>
>>>>>    
>>>>> 
>>>>> Essentially the data part of the document is being placed but it doesn’t 
>>>>> allow for other properties, such as the document ID, to be set. 
>>>>> 
>>>>> This leads to two problems: 
>>>>> 
>>>>> 1. Beam doesn’t necessarily guarantee exactly-once execution for a given 
>>>>> item in a PCollection, as I understand it. This means that you may get 
>>>>> more than one record in Elastic for a given item in a PCollection that 
>>>>> you pass in. 
>>>>> 
>>>>> 2. You can’t do partial updates to an index. If you run a batch job once, 
>>>>> and then run the batch job again on the same index without clearing it, 
>>>>> you just double everything in there. 
>>>>> 
>>>>> Is there any good way around this? 
>>>>> 
>>>>> I’d be happy to try writing up a PR for this in theory, but not sure how 
>>>>> to best approach it. Also would like to figure out a way to get around 
>>>>> this in the meantime, if anyone has any ideas. 
>>>>> 
>>>>> Best, 
>>>>> 
>>>>> Chet
>>>>> 
>>>>> P.S. CCed echauc...@gmail.com <mailto:echauc...@gmail.com> because it 
>>>>> seems like he’s been doing work related to the elastic sink. 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Reply via email to