A new reworked Elasticsearch 7+ IO module

Ludovic Boutros Wed, 22 Jan 2020 10:24:40 -0800

Dear all,

I have written a completely reworked Elasticsearch 7+ IO module.
It can be found here:
https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7

This is a quite advance WIP work but I'm a quite new user of Apache Beam
and I would like to get some help on this :)

I can create a JIRA issue now but I prefer to wait for your wise avises
first.

*Why a new module ?*

The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. This
seems to be a good point but so many things have been changed since
Elasticsearch 2.x.
Elasticsearch 7.x is now partially supported (document type are removed,
occ, updates...).

A fresh new module, only compliant with the last version of Elasticsearch,
can easily benefit a lot from the last evolutions of Elasticsearch (Java
High Level Http Client).

It is therefore far simpler than the current one.

*Error management*

Currently, errors are caught and transformed into simple exceptions. This
is not always what is needed. If we would like to do specific processing on
these errors (send documents in error topics for instance), it is not
possible with the current module.

*Philosophy*

This module directly uses the Elasticsearch Java client classes as inputs
and outputs.

This way you can configure any options you need directly in the
`DocWriteRequest` objects.

For instance:
- If you need to use external versioning (
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
you can.
- If you need to use an ingest pipelines, you can.
- If you need to configure an update document/script, you can.
- If you need to use upserts, you can.

Actually, you should be able to do everything you can do directly with
Elasticsearch.

Furthermore, it should be easier to keep updating the module with future
Elasticsearch evolutions.

*Write outputs*

Two outputs are available:
- Successful indexing output ;
- Failed indexing output.

They are available in a `WriteResult` object.

These two outputs are represented by
`PCollection<BulkItemResponseContainer>` objects.

A `BulkItemResponseContainer` contains:
- the original index request ;
- the Elasticsearch response ;
- a batch id.

You can apply any process afterwards (reprocessing, alerting, ...).

*Read input*

You can read documents from Elasticsearch with this module.
You can specify a `QueryBuilder` in order to filter the retrieved documents.
By default, it retrieves the whole document collection.

If the Elasticsearch index is sharded, multiple slices can be used during
fetch. That many bundles are created. The maximum bundle count is equal to
the index shard count.

Thank you !

Ludovic

A new reworked Elasticsearch 7+ IO module

Reply via email to