Hey Tim,

Thanks for your proposal to mentor me through my first PR.
As we’re definitely planning to upgrade to ES6 when Beam supports it, we 
decided to postpone the feature (we have a fix that works for us, for now).
When Beam supports ES6, I’ll be happy to make a contribution to get bulk 
deletes working.

For reference, I opened a ticket 
(https://issues.apache.org/jira/browse/BEAM-5042).

Cheers,
Wout


From: Tim Robertson <timrobertson...@gmail.com>
Reply-To: "user@beam.apache.org" <user@beam.apache.org>
Date: Friday, 27 July 2018 at 17:43
To: "user@beam.apache.org" <user@beam.apache.org>
Subject: Re: ElasticsearchIO bulk delete

Hi Wout,

This is great, thank you. I wrote the partial update support you reference and 
I'll be happy to mentor you through your first PR - welcome aboard. Can you 
please open a Jira to reference this work and we'll assign it to you?

We discussed having the "_xxx" fields in the document and triggering actions 
based on that in the partial update jira but opted to avoid it. Based on that 
discussion the ActionFn would likely be the preferred approach.  Would that be 
possible?

It will be important to provide unit and integration tests as well.

Please be aware that there is a branch and work underway for ES6 already which 
is rather different on the write() path so this may become redundant rather 
quickly.

Thanks,
Tim

@timrobertson100 on the Beam slack channel



On Fri, Jul 27, 2018 at 2:53 PM, Wout Scheepers 
<wout.scheep...@vente-exclusive.com<mailto:wout.scheep...@vente-exclusive.com>> 
wrote:
Hey all,

A while ago, I patched ElasticsearchIO to be able to do partial updates and 
deletes.
However, I did not consider my patch pull-request-worthy as the json parsing 
was done inefficient (parsed it twice per document).

Since Beam 2.5.0 partial updates are supported, so the only thing I’m missing 
is the ability to send bulk delete requests.
We’re using entity updates for event sourcing in our data lake and need to 
persist deleted entities in elastic.
We’ve been using my patch in production for the last year, but I would like to 
contribute to get the functionality we need into one of the next releases.

I’ve created a gist that works for me, but is still inefficient (parsing twice: 
once to check the ‘_action` field, once to get the metadata).
Each document I want to delete needs an additional ‘_action’ field with the 
value ‘delete’. It doesn’t matter the document still contains the redundant 
field, as the delete action only requires the metadata.
I’ve added the method isDelete() and made some changes to the processElement() 
method.
https://gist.github.com/wscheep/26cca4bda0145ffd38faf7efaf2c21b9

I would like to make my solution more generic to fit into the current 
ElasticsearchIO and create a proper pull request.
As this would be my first pull request for beam, can anyone point me in the 
right direction before I spent too much time creating something that will be 
rejected?

Some questions on the top of my mind are:

  *   Is it a good idea it to make the ‘action’ part for the bulk api generic?
  *   Should it be even more generic? (e.g.: set an ‘ActionFn’ on the 
ElasticsearchIO)
  *   If I want to avoid parsing twice, the parsing should be done outside of 
the getDocumentMetaData() method. Would this be acceptable?
  *   Is it possible to avoid passing the action as a field in the document?
  *   Is there another or better way to get the delete functionality in general?

All feedback is more than welcome.

Cheers,
Wout



Reply via email to