Re: A new reworked Elasticsearch 7+ IO module

Jean-Baptiste Onofre Thu, 06 Feb 2020 02:55:13 -0800

Hi,

Let’s sync together about this IO.


Regarding mock and IOs, and Etienne’s comment, there are two things:

1. Of course, it’s always preferable to use concrete backend, but several times 
it’s not possible. It’s there mock is required.
2. The mock can be smart enough to cover core IO behavior

So, I still think that mock is interesting as core first level test.

In the case of ES, we can easily bootstrap instance, but I think we can move 
forward without a mock.

Regards
JB

> Le 6 févr. 2020 à 09:47, Ludovic Boutros <boutr...@gmail.com> a écrit :
> 
> Hi all,
> 
> First, thank you all for your answers and especially, Etienne for your time, 
> advises and kindness :)
> @Jean-Baptiste, any help on this module is welcome of course.
> 
> @Chamikara Jayalath, my aswers are inline.
> 
> Have a good day !
> 
> Ludovic
> 
> Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath <chamik...@google.com 
> <mailto:chamik...@google.com>> a écrit :
> 
> 
> On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot <echauc...@apache.org 
> <mailto:echauc...@apache.org>> wrote:
> Still there is something I don't agree with is that IOs can be tested on 
> mock. We don't really test IO behavior with mocks: there is always special 
> behaviors that cannot be reproduced in mocks (split, load, with corner cases 
> etc...). There was in the past IOs that were tested using mocks and that 
> happened to be nonfunctional.
> 
> Regarding ITests we have very few comparing to UTests and they are not as 
> closely observed as UTests.
> 
> Etienne
> 
> On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:
>> Hi,
>> 
>> We talked in the past about multiple/single module.
>> 
>> IMHO the always preferred goal is to have a single module. However, it’s 
>> tricky when we have such difference, including on the user facing API. So, I 
>> would go with module per version, or use a specified version for a target 
>> Beam release.
>> 
>> For the test, we should distinguish utest from itest. Utest can be done with 
>> mock, the purpose is really to test the IO behavior. Then we can have itest 
>> using concrete ES instance.
>> 
>> Anyway, I’m OK with the proposal and I would like to work on this IO (I have 
>> other improvements coming on other IOs anyway) with you guys (and Ludovic 
>> especially).
>> 
>> Regards
>> JB
>> 
>>> Le 5 févr. 2020 à 10:44, Etienne Chauchot <echauc...@apache.org 
>>> <mailto:echauc...@apache.org>> a écrit :
>>> 
>>> Hi all, 
>>> 
>>> We had a long discussion with Ludovic about this IO. I'd like to share with 
>>> you to keep you informed and also gather your opinions
>>> 
>>> 1. regarding version support: ES v2 is no more maintained by Elastic since 
>>> 2018/02 so we plan to remove it from the IO. In the past we already retired 
>>> versions (like spark 1.6 for instance).
>>> 
> 
> 
> My only concern here is that there might be users who use the existing module 
> who might not be able to easily upgrade the Beam version if we remove it. But 
> given that V2 is 5 versions behind the latest release this might be OK.
> 
> It seems we have a consensus on this.
> I think there should be another general discussion on the long term support 
> of our prefered tool IO modules.
>  
>  
>>> 
>>> 2. regarding the user: the aim is to unlock some new features (listed by 
>>> Ludovic) and give the user more flexibility on his request. For that, it 
>>> requires to use high level java ES client in place of the low level REST 
>>> client (that was used because it is the only one compatible with all ES 
>>> versions). We plan to replace the API (json document in and out) by more 
>>> complete standard ES objects that contain de request logic (insert/update, 
>>> doc routing etc...) and the data. There are already IOs like SpannerIO that 
>>> use similar objects in input PCollection rather than pure POJOs. 
>>> 
> 
> 
> Won't this be a breaking change for all users ? IMO using POJOs in 
> PCollections is safer since we have to worry about changes to the underlying 
> client library API. Exception would be when underlying client library offers 
> a backwards compatibility guarantee that we can rely on for the foreseeable 
> future (for example, BQ TableRow).
> 
> Agreed but actually, there will be POJOs in order to abstract Elasticsearch's 
> version support. The following third point explains this.
>  
>  
>>> 
>>> 3. regarding multiple/single module: the aim is to have only one production 
>>> code to ease the maintenance.  The problem is that using high level client 
>>> makes the code dependent to an ES lib version. We would like to make it 
>>> invisible to the user. He should select only one jar and the IO should 
>>> decide the lib to use behind the scene. We are thinking about using one 
>>> module and sub-modules per version and use relocation, wrappers and a 
>>> factory that detects the version the IO actually points to to instantiate 
>>> the correct client version. It would also require to have DTOs in the IO 
>>> because the high level ES java objects are not exactly the same among the 
>>> ES versions.
>>> 
> 
> +1 for adding a level of indirection to make this easy for users. 
>>> 4. regarding tests: the aim is always to target real ES backends to have 
>>> relevant tests (for reasons I already explained in another thread). The 
>>> problem is that es-test-framework used today is version dependent and is a 
>>> pain to use. We plan on using test containers per version (validated by ES 
>>> dev advocate) and launching them as part of the UTests. Obviously we will 
>>> launch only one container at the time per version and do all the test with 
>>> it to avoid paying the cost of launch too much. And the tests will be 
>>> shipped in per-version sub-modules and not in test dedicated modules like 
>>> it is now.
>>> 
> 
> 
> Using a real ES backend for unit tests can be expensive ? Ideally we should 
> use a Fake (if one available) or mocking (test test out functionality) and 
> use real backend for IT tests that can be expensive. If this is a local 
> container that can be shared between unit tests with reasonable efficiency 
> that is OK. I'm mainly worried about introducing flakes into unit tests due 
> to network errors or slowness.
> 
> On this point I understand it and I agree as well, but I think Etienne is 
> right, IO modules should be exceptions to this rule.
> It is really difficult to really test an IO without a real backend and Docker 
> helps a lot here.
> But I take the point, these tests must be as small as posssible (one per 
> Elasticsearch version).
> 
> 
>  
> 
> Thanks,
> Cham
>  
>>> WDYT ?
>>> 
>>> Best !
>>> 
>>> Etienne
>>> 
>>> On 30/01/2020 17:55, Alexey Romanenko wrote:
>>>> I’m second for this question. We have a similar (maybe a bit less painful) 
>>>> issue for KafkaIO and it would be useful to have a general strategy for 
>>>> such cases about how to deal with that.
>>>> 
>>>>> On 24 Jan 2020, at 21:54, Kenneth Knowles <k...@apache.org 
>>>>> <mailto:k...@apache.org>> wrote:
>>>>> 
>>>>> Would it make sense to have different version-specialized connectors with 
>>>>> a common core library and common API package?
>>>>> 
>>>>> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath <chamik...@google.com 
>>>>> <mailto:chamik...@google.com>> wrote:
>>>>> Thanks for the contribution. I agree with Alexey that we should try to 
>>>>> add any new features brought in with the new PR into existing connector 
>>>>> instead of trying to maintain two implementations.  
>>>>> 
>>>>> Thanks,
>>>>> Cham
>>>>> 
>>>>> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko 
>>>>> <aromanenko....@gmail.com <mailto:aromanenko....@gmail.com>> wrote:
>>>>> Hi Ludovic,
>>>>> 
>>>>> Thank you for working on this and sharing the details with us. This is 
>>>>> really great job!
>>>>> 
>>>>> As I recall, we already have some support of Elasticsearch7 in current 
>>>>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong 
>>>>> Chen and Etienne Chauchot, who were working on adding this [1][2] and it 
>>>>> should be released in Beam 2.19.
>>>>> 
>>>>> Would you think you can leverage this in your work on adding new 
>>>>> Elasticsearch7 features? IMHO, supporting two different related IOs can 
>>>>> be quite tough task and I‘d rather raise my hand to add a new 
>>>>> functionality into existing IO than creating a new one, if it’s possible.
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/BEAM-5192 
>>>>> <https://issues.apache.org/jira/browse/BEAM-5192>
>>>>> [2] https://github.com/apache/beam/pull/10433 
>>>>> <https://github.com/apache/beam/pull/10433>
>>>>> 
>>>>>> On 22 Jan 2020, at 19:23, Ludovic Boutros <boutr...@gmail.com 
>>>>>> <mailto:boutr...@gmail.com>> wrote:
>>>>>> 
>>>>>> Dear all,
>>>>>> 
>>>>>> I have written a completely reworked Elasticsearch 7+ IO module.
>>>>>> It can be found here: 
>>>>>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>>>>>>  
>>>>>> <https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7>
>>>>>> 
>>>>>> This is a quite advance WIP work but I'm a quite new user of Apache Beam 
>>>>>> and I would like to get some help on this :)
>>>>>> 
>>>>>> I can create a JIRA issue now but I prefer to wait for your wise avises 
>>>>>> first.
>>>>>> 
>>>>>> Why a new module ?
>>>>>> 
>>>>>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. 
>>>>>> This seems to be a good point but so many things have been changed since 
>>>>>> Elasticsearch 2.x.
>>>>> 
>>>>> 
>>>>> Probably this is not correct anymore due to 
>>>>> https://github.com/apache/beam/pull/10433 
>>>>> <https://github.com/apache/beam/pull/10433> ?
>>>>>  
>>>>>> Elasticsearch 7.x is now partially supported (document type are removed, 
>>>>>> occ, updates...).
>>>>>> 
>>>>>> A fresh new module, only compliant with the last version of 
>>>>>> Elasticsearch, can easily benefit a lot from the last evolutions of 
>>>>>> Elasticsearch (Java High Level Http Client).
>>>>>> 
>>>>>> It is therefore far simpler than the current one.
>>>>>> 
>>>>>> Error management
>>>>>> 
>>>>>> Currently, errors are caught and transformed into simple exceptions. 
>>>>>> This is not always what is needed. If we would like to do specific 
>>>>>> processing on these errors (send documents in error topics for 
>>>>>> instance), it is not possible with the current module.
>>>>> 
>>>>> 
>>>>> Seems like this is some sort of a dead letter queue implementation.. This 
>>>>> will be a very good feature to add to the existing connector.
>>>>>  
>>>>>> 
>>>>>> Philosophy
>>>>>> 
>>>>>> This module directly uses the Elasticsearch Java client classes as 
>>>>>> inputs and outputs. 
>>>>>> 
>>>>>> This way you can configure any options you need directly in the 
>>>>>> `DocWriteRequest` objects.
>>>>>> 
>>>>>> For instance: 
>>>>>> - If you need to use external versioning 
>>>>>> (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning
>>>>>>  
>>>>>> <https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning>),
>>>>>>  you can.
>>>>>> - If you need to use an ingest pipelines, you can.
>>>>>> - If you need to configure an update document/script, you can.
>>>>>> - If you need to use upserts, you can.
>>>>>> 
>>>>>> Actually, you should be able to do everything you can do directly with 
>>>>>> Elasticsearch.
>>>>>> 
>>>>>> Furthermore, it should be easier to keep updating the module with future 
>>>>>> Elasticsearch evolutions.
>>>>>> 
>>>>>> Write outputs
>>>>>> 
>>>>>> Two outputs are available:
>>>>>> - Successful indexing output ;
>>>>>> - Failed indexing output.
>>>>>> 
>>>>>> They are available in a `WriteResult` object.
>>>>>> 
>>>>>> These two outputs are represented by 
>>>>>> `PCollection<BulkItemResponseContainer>` objects.
>>>>>> 
>>>>>> A `BulkItemResponseContainer` contains:
>>>>>> - the original index request ;
>>>>>> - the Elasticsearch response ;
>>>>>> - a batch id.
>>>>>> 
>>>>>> You can apply any process afterwards (reprocessing, alerting, ...).
>>>>>> 
>>>>>> Read input
>>>>>> 
>>>>>> You can read documents from Elasticsearch with this module.
>>>>>> You can specify a `QueryBuilder` in order to filter the retrieved 
>>>>>> documents.
>>>>>> By default, it retrieves the whole document collection.
>>>>>> 
>>>>>> If the Elasticsearch index is sharded, multiple slices can be used 
>>>>>> during fetch. That many bundles are created. The maximum bundle count is 
>>>>>> equal to the index shard count.
>>>>>> 
>>>>>> Thank you !
>>>>>> 
>>>>>> Ludovic
>>>>> 
>>>> 
>>

Re: A new reworked Elasticsearch 7+ IO module

Reply via email to