Hello all! 

So I’ve been using the ElasticSearchIO sink for a project (unfortunately it’s 
Elasticsearch 5.x, and so I’ve been messing around with the latest RC) and I’m 
finding that it doesn’t allow for changing the document ID, but only lets you 
pass in a record, which means that the document ID is auto-generated. See this 
line for what specifically is happening: 

https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838
 
<https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838>
 

Essentially the data part of the document is being placed but it doesn’t allow 
for other properties, such as the document ID, to be set. 

This leads to two problems: 

1. Beam doesn’t necessarily guarantee exactly-once execution for a given item 
in a PCollection, as I understand it. This means that you may get more than one 
record in Elastic for a given item in a PCollection that you pass in. 

2. You can’t do partial updates to an index. If you run a batch job once, and 
then run the batch job again on the same index without clearing it, you just 
double everything in there. 

Is there any good way around this? 

I’d be happy to try writing up a PR for this in theory, but not sure how to 
best approach it. Also would like to figure out a way to get around this in the 
meantime, if anyone has any ideas. 

Best, 

Chet

P.S. CCed echauc...@gmail.com <mailto:echauc...@gmail.com> because it seems 
like he’s been doing work related to the elastic sink. 


Reply via email to