Hey all,

A while ago, I patched ElasticsearchIO to be able to do partial updates and 
deletes.
However, I did not consider my patch pull-request-worthy as the json parsing 
was done inefficient (parsed it twice per document).

Since Beam 2.5.0 partial updates are supported, so the only thing I’m missing 
is the ability to send bulk delete requests.
We’re using entity updates for event sourcing in our data lake and need to 
persist deleted entities in elastic.
We’ve been using my patch in production for the last year, but I would like to 
contribute to get the functionality we need into one of the next releases.

I’ve created a gist that works for me, but is still inefficient (parsing twice: 
once to check the ‘_action` field, once to get the metadata).
Each document I want to delete needs an additional ‘_action’ field with the 
value ‘delete’. It doesn’t matter the document still contains the redundant 
field, as the delete action only requires the metadata.
I’ve added the method isDelete() and made some changes to the processElement() 
method.
https://gist.github.com/wscheep/26cca4bda0145ffd38faf7efaf2c21b9

I would like to make my solution more generic to fit into the current 
ElasticsearchIO and create a proper pull request.
As this would be my first pull request for beam, can anyone point me in the 
right direction before I spent too much time creating something that will be 
rejected?

Some questions on the top of my mind are:

  *   Is it a good idea it to make the ‘action’ part for the bulk api generic?
  *   Should it be even more generic? (e.g.: set an ‘ActionFn’ on the 
ElasticsearchIO)
  *   If I want to avoid parsing twice, the parsing should be done outside of 
the getDocumentMetaData() method. Would this be acceptable?
  *   Is it possible to avoid passing the action as a field in the document?
  *   Is there another or better way to get the delete functionality in general?

All feedback is more than welcome.

Cheers,
Wout


Reply via email to