Re: A new reworked Elasticsearch 7+ IO module

Etienne Chauchot Wed, 05 Feb 2020 06:35:15 -0800

Still there is something I don't agree with is that IOs can be tested onmock. We don't really test IO behavior with mocks: there is alwaysspecial behaviors that cannot be reproduced in mocks (split, load, withcorner cases etc...). There was in the past IOs that were tested usingmocks and that happened to be nonfunctional.

Regarding ITests we have very few comparing to UTests and they are notas closely observed as UTests.


Etienne

On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:

Hi,

We talked in the past about multiple/single module.
IMHO the always preferred goal is to have a single module. However,it’s tricky when we have such difference, including on the user facingAPI. So, I would go with module per version, or use a specifiedversion for a target Beam release.
For the test, we should distinguish utest from itest. Utest can bedone with mock, the purpose is really to test the IO behavior. Then wecan have itest using concrete ES instance.
Anyway, I’m OK with the proposal and I would like to work on this IO(I have other improvements coming on other IOs anyway) with you guys(and Ludovic especially).
Regards
JB
Le 5 févr. 2020 à 10:44, Etienne Chauchot <[email protected]<mailto:[email protected]>> a écrit :
Hi all,
We had a long discussion with Ludovic about this IO. I'd like toshare with you to keep you informed and also gather your opinions
1. regarding version support: ES v2 is no more maintained by Elasticsince 2018/02 so we plan to remove it from the IO. In the past wealready retired versions (like spark 1.6 for instance).
2. regarding the user: the aim is to unlock some new features (listedby Ludovic) and give the user more flexibility on his request. Forthat, it requires to use high level java ES client in place of thelow level REST client (that was used because it is the only onecompatible with all ES versions). We plan to replace the API (jsondocument in and out) by more complete standard ES objects thatcontain de request logic (insert/update, doc routing etc...) and thedata. There are already IOs like SpannerIO that use similar objectsin input PCollection rather than pure POJOs.
3. regarding multiple/single module: the aim is to have only oneproduction code to ease the maintenance. The problem is that usinghigh level client makes the code dependent to an ES lib version. Wewould like to make it invisible to the user. He should select onlyone jar and the IO should decide the lib to use behind the scene. Weare thinking about using one module and sub-modules per version anduse relocation, wrappers and a factory that detects the version theIO actually points to to instantiate the correct client version. Itwould also require to have DTOs in the IO because the high level ESjava objects are not exactly the same among the ES versions.
4. regarding tests: the aim is always to target real ES backends tohave relevant tests (for reasons I already explained in anotherthread). The problem is that es-test-framework used today is versiondependent and is a pain to use. We plan on using test containers perversion (validated by ES dev advocate) and launching them as part ofthe UTests. Obviously we will launch only one container at the timeper version and do all the test with it to avoid paying the cost oflaunch too much. And the tests will be shipped in per-versionsub-modules and not in test dedicated modules like it is now.
WDYT ?

Best !

Etienne

On 30/01/2020 17:55, Alexey Romanenko wrote:
I’m second for this question. We have a similar (maybe a bit lesspainful) issue for KafkaIO and it would be useful to have a generalstrategy for such cases about how to deal with that.
On 24 Jan 2020, at 21:54, Kenneth Knowles <[email protected]<mailto:[email protected]>> wrote:
Would it make sense to have different version-specializedconnectors with a common core library and common API package?
On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath<[email protected] <mailto:[email protected]>> wrote:
    Thanks for the contribution. I agree with Alexey that we should
    try to add any new features brought in with the new PR into
    existing connector instead of trying to maintain two
    implementations.

    Thanks,
    Cham

    On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Ludovic,

        Thank you for working on this and sharing the details with
        us. This is really great job!

        As I recall, we already have some support of Elasticsearch7
        in current ElasticsearchIO (afaik, at least they are
        compatible), thanks to Zhong Chen and Etienne Chauchot, who
        were working on adding this [1][2] and it should be
        released in Beam 2.19.

        Would you think you can leverage this in your work on
        adding new Elasticsearch7 features? IMHO, supporting two
        different related IOs can be quite tough task and I‘d
        rather raise my hand to add a new functionality into
        existing IO than creating a new one, if it’s possible.

        [1] https://issues.apache.org/jira/browse/BEAM-5192
        [2] https://github.com/apache/beam/pull/10433
        On 22 Jan 2020, at 19:23, Ludovic Boutros
        <[email protected] <mailto:[email protected]>> wrote:

        Dear all,

        I have written a completely reworked Elasticsearch 7+ IO
        module.
        It can be found here:
        
https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7

        This is a quite advance WIP work but I'm a quite new user
        of Apache Beam and I would like to get some help on this :)

        I can create a JIRA issue now but I prefer to wait for
        your wise avises first.

        _Why a new module ?_

        The current module was compliant with Elasticsearch 2.x,
        5.x and 6.x. This seems to be a good point but so many
        things have been changed since Elasticsearch 2.x.
    Probably this is not correct anymore due to
    https://github.com/apache/beam/pull/10433 ?
        Elasticsearch 7.x is now partially supported (document
        type are removed, occ, updates...).

        A fresh new module, only compliant with the last version
        of Elasticsearch, can easily benefit a lot from the last
        evolutions of Elasticsearch (Java High Level Http Client).

        It is therefore far simpler than the current one.

        _Error management_

        Currently, errors are caught and transformed into simple
        exceptions. This is not always what is needed. If we would
        like to do specific processing on these errors (send
        documents in error topics for instance), it is not
        possible with the current module.
    Seems like this is some sort of a dead letter queue
    implementation.. This will be a very good feature to add to the
    existing connector.
        _Philosophy_

        This module directly uses the Elasticsearch Java client
        classes as inputs and outputs.

        This way you can configure any options you need directly
        in the `DocWriteRequest` objects.

        For instance:
        - If you need to use external versioning
        
(https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
        you can.
        - If you need to use an ingest pipelines, you can.
        - If you need to configure an update document/script, you can.
        - If you need to use upserts, you can.

        Actually, you should be able to do everything you can do
        directly with Elasticsearch.

        Furthermore, it should be easier to keep updating the
        module with future Elasticsearch evolutions.

        _Write outputs_

        Two outputs are available:
        - Successful indexing output ;
        - Failed indexing output.

        They are available in a `WriteResult` object.

        These two outputs are represented by
        `PCollection<BulkItemResponseContainer>` objects.

        A `BulkItemResponseContainer` contains:
        - the original index request ;
        - the Elasticsearch response ;
        - a batch id.

        You can apply any process afterwards (reprocessing,
        alerting, ...).

        _Read input_

        You can read documents from Elasticsearch with this module.
        You can specify a `QueryBuilder` in order to filter the
        retrieved documents.
        By default, it retrieves the whole document collection.

        If the Elasticsearch index is sharded, multiple slices can
        be used during fetch. That many bundles are created. The
        maximum bundle count is equal to the index shard count.

        Thank you !

        Ludovic

Re: A new reworked Elasticsearch 7+ IO module

Reply via email to