Re: A new reworked Elasticsearch 7+ IO module

Etienne Chauchot Thu, 30 Jan 2020 06:12:46 -0800

Hi Ludovic,

First of all thanks for your work.

Then, please be aware that the current ES IO on master supports ES7already and will be part of Beam 2.19.


I understand that your approach enables many new features which is great !

For the record, the current ES module was designed to have only oneproduction code (only the test modules are different among the versionsbecause they use and embedded ES):


- there is some if(version) but not that much.

- we use low level REST client because it is the only one that iscompatible with all the versions of ES.

My only concern is to reduce the maintenance burden. Having 2 modules(v2,v5, v6, v7) and (v7) looks like difficult to maintain. It can bedone for a reduced period of time by only doing bug fixes on the firstmodule like you suggest, but pretty quickly we would need to have onlyone module. You mentioned that you tried supporting these new featuresin the actual module (which would be not maintainable due to usingdifferent ES classes) but have you tried supporting V2, v5, v6 and v7 ofES in your new module with the high level client to enable the newfeatures and still get all the versions support with one productioncode? Would it be feasible with no spaghetti plate ? Because at somepoint, like you mention, we will end up retiring the old module (at amajor version) and thus we would need to still support older versions of ES.

Regarding the question on MIT license, it is a category A license thatcan be included in ASF projects.


Best,

Etienne


On 25/01/2020 14:23, Ludovic Boutros wrote:

Hi all,

First, thank you for your great answers.
I thank Zhong Chen and Etienne Chauchot for their great job on this too !

Alexey and Chamikara, I understand your point of view.
Actually, I have the same as much as possible.

But in this case, my goal was to be able to do all the followingthings in a Beam pipeline with Elasticsearch:


- be compliant with Elasticsearch 7.x (as the current one now) ;
- be able to retrieve errors in order to some processing on them ;
- be able to (atomic) update (with scripts) or delete documents ;
- be able to manage Optimistic Concurrency Control ;
- be able to manage document versioning ;

- be able to test the module easily, even with SSL and so on (I'musing Testcontainers, is the MIT Licence compliant with the Apache one?) ;

- and even more.

I can insure you that I first tried to implement all these functionsin the current module, but, because it tries to be compliant withElasticsearch 2.x-7.x (it's already a big challenge ;)), the resultwould have been like a spaghetti plate with quite a lot of"if-then-else" blocs everywhere.


And finally, I really don't think this is the way to go.

I'm playing a lot with another Apache project, Apache Camel.
And I think Apache Camel is the Master of Component Management :)

What they'are doing in this case is exactly what I would like toachieve here.They keep the old one, implement a new one which support newerversions in order to keep the module maintenable, precisely.

Both module supports Elasticsearch 7.x, but Elasticsearch 8 will bereleased in a few months with document type removal and more breakingchanges. It will be harder and harder to maintain it.


What I would like to propose is:

- as Kenneth said, we can keep both modules (and I think Elasticsearchdeserves it ;)) ;- current users, can keep using the old one and can migrate or not tothe new one ;- evolutions and migrations should be done on the new one whichdirectly uses the Elasticsearch classes ;

- only bug fixes are done on the old one.

Apache Camel does switch the modules really later on. Basically onlyon major releases (you can check the release note of the 3.0 release,see the mongoDb part).

And again, this is a work in progress, I would be glad and I will keepimproving it (more documentations for instance).I'm really happy to share this with you and I will take in accounteach remark.


Thank you again and have a good week-end,

Ludovic.

Le ven. 24 janv. 2020 à 21:54, Kenneth Knowles <k...@apache.org<mailto:k...@apache.org>> a écrit :


    Would it make sense to have different version-specialized
    connectors with a common core library and common API package?

    On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath
    <chamik...@google.com <mailto:chamik...@google.com>> wrote:

        Thanks for the contribution. I agree with Alexey that we
        should try to add any new features brought in with the new PR
        into existing connector instead of trying to maintain two
        implementations.

        Thanks,
        Cham

        On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
        <aromanenko....@gmail.com <mailto:aromanenko....@gmail.com>>
        wrote:

            Hi Ludovic,

            Thank you for working on this and sharing the details with
            us. This is really great job!

            As I recall, we already have some support
            of Elasticsearch7 in current ElasticsearchIO (afaik, at
            least they are compatible), thanks to Zhong Chen and
            Etienne Chauchot, who were working on adding this [1][2]
            and it should be released in Beam 2.19.

            Would you think you can leverage this in your work on
            adding new Elasticsearch7 features? IMHO, supporting two
            different related IOs can be quite tough task and I‘d
            rather raise my hand to add a new functionality into
            existing IO than creating a new one, if it’s possible.

            [1] https://issues.apache.org/jira/browse/BEAM-5192
            [2] https://github.com/apache/beam/pull/10433

            On 22 Jan 2020, at 19:23, Ludovic Boutros
            <boutr...@gmail.com <mailto:boutr...@gmail.com>> wrote:

            Dear all,

            I have written a completely reworked Elasticsearch 7+ IO
            module.
            It can be found here:
            
https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7

            This is a quite advance WIP work but I'm a quite new user
            of Apache Beam and I would like to get some help on this :)

            I can create a JIRA issue now but I prefer to wait for
            your wise avises first.

            _Why a new module ?_

            The current module was compliant with Elasticsearch 2.x,
            5.x and 6.x. This seems to be a good point but so many
            things have been changed since Elasticsearch 2.x.



        Probably this is not correct anymore due to
        https://github.com/apache/beam/pull/10433 ?

            Elasticsearch 7.x is now partially supported (document
            type are removed, occ, updates...).

            A fresh new module, only compliant with the last version
            of Elasticsearch, can easily benefit a lot from the last
            evolutions of Elasticsearch (Java High Level Http Client).

            It is therefore far simpler than the current one.

            _Error management_

            Currently, errors are caught and transformed into simple
            exceptions. This is not always what is needed. If we
            would like to do specific processing on these errors
            (send documents in error topics for instance), it is not
            possible with the current module.



        Seems like this is some sort of a dead letter queue
        implementation.. This will be a very good feature to add to
        the existing connector.

_Philosophy_

This module directly uses the Elasticsearch Java client
classes as inputs and outputs.

This way you can configure any options you need directly
in the `DocWriteRequest` objects.

For instance:
- If you need to use external versioning

(https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
you can.
- If you need to use an ingest pipelines, you can.
- If you need to configure an update document/script, you
can.
- If you need to use upserts, you can.

Actually, you should be able to do everything you can do
directly with Elasticsearch.

Furthermore, it should be easier to keep updating the
module with future Elasticsearch evolutions.

_Write outputs_

Two outputs are available:
- Successful indexing output ;
- Failed indexing output.

They are available in a `WriteResult` object.

These two outputs are represented by
`PCollection<BulkItemResponseContainer>` objects.

A `BulkItemResponseContainer` contains:
- the original index request ;
- the Elasticsearch response ;
- a batch id.

You can apply any process afterwards (reprocessing,
alerting, ...).

_Read input_

You can read documents from Elasticsearch with this module.
You can specify a `QueryBuilder` in order to filter the
retrieved documents.
By default, it retrieves the whole document collection.

If the Elasticsearch index is sharded, multiple slices
can be used during fetch. That many bundles are created.
The maximum bundle count is equal to the index shard count.

Thank you !

Ludovic

Re: A new reworked Elasticsearch 7+ IO module

Reply via email to