Re: A new reworked Elasticsearch 7+ IO module

Kenneth Knowles Fri, 24 Jan 2020 12:54:36 -0800

Would it make sense to have different version-specialized connectors with a
common core library and common API package?


On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath <[email protected]>
wrote:

> Thanks for the contribution. I agree with Alexey that we should try to add
> any new features brought in with the new PR into existing connector instead
> of trying to maintain two implementations.
>
> Thanks,
> Cham
>
> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko <[email protected]>
> wrote:
>
>> Hi Ludovic,
>>
>> Thank you for working on this and sharing the details with us. This is
>> really great job!
>>
>> As I recall, we already have some support of Elasticsearch7 in current
>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen
>> and Etienne Chauchot, who were working on adding this [1][2] and it should
>> be released in Beam 2.19.
>>
>> Would you think you can leverage this in your work on adding new
>> Elasticsearch7 features? IMHO, supporting two different related IOs can be
>> quite tough task and I‘d rather raise my hand to add a new functionality
>> into existing IO than creating a new one, if it’s possible.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-5192
>> [2] https://github.com/apache/beam/pull/10433
>>
>> On 22 Jan 2020, at 19:23, Ludovic Boutros <[email protected]> wrote:
>>
>> Dear all,
>>
>> I have written a completely reworked Elasticsearch 7+ IO module.
>> It can be found here:
>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>>
>> This is a quite advance WIP work but I'm a quite new user of Apache Beam
>> and I would like to get some help on this :)
>>
>> I can create a JIRA issue now but I prefer to wait for your wise avises
>> first.
>>
>> *Why a new module ?*
>>
>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x.
>> This seems to be a good point but so many things have been changed since
>> Elasticsearch 2.x.
>>
>>
> Probably this is not correct anymore due to
> https://github.com/apache/beam/pull/10433 ?
>
>
>> Elasticsearch 7.x is now partially supported (document type are removed,
>> occ, updates...).
>>
>> A fresh new module, only compliant with the last version of
>> Elasticsearch, can easily benefit a lot from the last evolutions of
>> Elasticsearch (Java High Level Http Client).
>>
>> It is therefore far simpler than the current one.
>>
>> *Error management*
>>
>> Currently, errors are caught and transformed into simple exceptions. This
>> is not always what is needed. If we would like to do specific processing on
>> these errors (send documents in error topics for instance), it is not
>> possible with the current module.
>>
>>
> Seems like this is some sort of a dead letter queue implementation.. This
> will be a very good feature to add to the existing connector.
>
>
>>
>> *Philosophy*
>>
>> This module directly uses the Elasticsearch Java client classes as inputs
>> and outputs.
>>
>> This way you can configure any options you need directly in the
>> `DocWriteRequest` objects.
>>
>> For instance:
>> - If you need to use external versioning (
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
>> you can.
>> - If you need to use an ingest pipelines, you can.
>> - If you need to configure an update document/script, you can.
>> - If you need to use upserts, you can.
>>
>> Actually, you should be able to do everything you can do directly with
>> Elasticsearch.
>>
>> Furthermore, it should be easier to keep updating the module with future
>> Elasticsearch evolutions.
>>
>> *Write outputs*
>>
>> Two outputs are available:
>> - Successful indexing output ;
>> - Failed indexing output.
>>
>> They are available in a `WriteResult` object.
>>
>> These two outputs are represented by
>> `PCollection<BulkItemResponseContainer>` objects.
>>
>> A `BulkItemResponseContainer` contains:
>> - the original index request ;
>> - the Elasticsearch response ;
>> - a batch id.
>>
>> You can apply any process afterwards (reprocessing, alerting, ...).
>>
>> *Read input*
>>
>> You can read documents from Elasticsearch with this module.
>> You can specify a `QueryBuilder` in order to filter the retrieved
>> documents.
>> By default, it retrieves the whole document collection.
>>
>> If the Elasticsearch index is sharded, multiple slices can be used during
>> fetch. That many bundles are created. The maximum bundle count is equal to
>> the index shard count.
>>
>> Thank you !
>>
>> Ludovic
>>
>>
>>

Re: A new reworked Elasticsearch 7+ IO module

Reply via email to