Hi all,

Currently the Elasticsearch 8 connector retries all items if the request
fails as a whole, and retries failed items if the request has partial
failures [1]. I think this infinitely retries might be problematic in some
cases when retrying can never eventually succeed. For example, if the
request is 400 (bad request) or 404 (not found), retries do not help. If
there are too many failed items non-retriable, new requests will get
processed less effectively. In extreme cases, it may stall the pipeline if
in-flight requests are occupied by those failed items.

FLIP-451 proposes timeout for retrying which helps with un-acknowledged
requests, but not addressing the case when request gets processed and
failed items keep failing no matter how many times we retry. Correct me if
I'm wrong.

One opinionated option is to fail fast for non-retriable errors like 400 /
404 and to drop items for 409. Or we can allow users to configure
"drop/fail" behavior for non-retriable errors. I prefer the latter. I
checked how LogStash ingests data to Elasticsearch and it takes a similar
approach for non-retriable errors [2]. In my day job, we have a
dead-letter-queue in AsynSinkWriter for failed entries that exhaust
retries. I guess that is too specific to our setup and seems an overkill
here for Elasticsearch connector.

Any thoughts on this?

[1]
https://github.com/apache/flink-connector-elasticsearch/blob/5d1f8d03e3cff197ed7fe30b79951e44808b48fe/flink-connector-elasticsearch8/src/main/java/org/apache/flink/connector/elasticsearch/sink/Elasticsearch8AsyncWriter.java#L152-L170
[2]
https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/main/lib/logstash/plugin_mixins/elasticsearch/common.rb#L283-L304

Reply via email to