Hi, Let’s sync together about this IO.
Regarding mock and IOs, and Etienne’s comment, there are two things: 1. Of course, it’s always preferable to use concrete backend, but several times it’s not possible. It’s there mock is required. 2. The mock can be smart enough to cover core IO behavior So, I still think that mock is interesting as core first level test. In the case of ES, we can easily bootstrap instance, but I think we can move forward without a mock. Regards JB > Le 6 févr. 2020 à 09:47, Ludovic Boutros <boutr...@gmail.com> a écrit : > > Hi all, > > First, thank you all for your answers and especially, Etienne for your time, > advises and kindness :) > @Jean-Baptiste, any help on this module is welcome of course. > > @Chamikara Jayalath, my aswers are inline. > > Have a good day ! > > Ludovic > > Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath <chamik...@google.com > <mailto:chamik...@google.com>> a écrit : > > > On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot <echauc...@apache.org > <mailto:echauc...@apache.org>> wrote: > Still there is something I don't agree with is that IOs can be tested on > mock. We don't really test IO behavior with mocks: there is always special > behaviors that cannot be reproduced in mocks (split, load, with corner cases > etc...). There was in the past IOs that were tested using mocks and that > happened to be nonfunctional. > > Regarding ITests we have very few comparing to UTests and they are not as > closely observed as UTests. > > Etienne > > On 05/02/2020 11:32, Jean-Baptiste Onofre wrote: >> Hi, >> >> We talked in the past about multiple/single module. >> >> IMHO the always preferred goal is to have a single module. However, it’s >> tricky when we have such difference, including on the user facing API. So, I >> would go with module per version, or use a specified version for a target >> Beam release. >> >> For the test, we should distinguish utest from itest. Utest can be done with >> mock, the purpose is really to test the IO behavior. Then we can have itest >> using concrete ES instance. >> >> Anyway, I’m OK with the proposal and I would like to work on this IO (I have >> other improvements coming on other IOs anyway) with you guys (and Ludovic >> especially). >> >> Regards >> JB >> >>> Le 5 févr. 2020 à 10:44, Etienne Chauchot <echauc...@apache.org >>> <mailto:echauc...@apache.org>> a écrit : >>> >>> Hi all, >>> >>> We had a long discussion with Ludovic about this IO. I'd like to share with >>> you to keep you informed and also gather your opinions >>> >>> 1. regarding version support: ES v2 is no more maintained by Elastic since >>> 2018/02 so we plan to remove it from the IO. In the past we already retired >>> versions (like spark 1.6 for instance). >>> > > > My only concern here is that there might be users who use the existing module > who might not be able to easily upgrade the Beam version if we remove it. But > given that V2 is 5 versions behind the latest release this might be OK. > > It seems we have a consensus on this. > I think there should be another general discussion on the long term support > of our prefered tool IO modules. > > >>> >>> 2. regarding the user: the aim is to unlock some new features (listed by >>> Ludovic) and give the user more flexibility on his request. For that, it >>> requires to use high level java ES client in place of the low level REST >>> client (that was used because it is the only one compatible with all ES >>> versions). We plan to replace the API (json document in and out) by more >>> complete standard ES objects that contain de request logic (insert/update, >>> doc routing etc...) and the data. There are already IOs like SpannerIO that >>> use similar objects in input PCollection rather than pure POJOs. >>> > > > Won't this be a breaking change for all users ? IMO using POJOs in > PCollections is safer since we have to worry about changes to the underlying > client library API. Exception would be when underlying client library offers > a backwards compatibility guarantee that we can rely on for the foreseeable > future (for example, BQ TableRow). > > Agreed but actually, there will be POJOs in order to abstract Elasticsearch's > version support. The following third point explains this. > > >>> >>> 3. regarding multiple/single module: the aim is to have only one production >>> code to ease the maintenance. The problem is that using high level client >>> makes the code dependent to an ES lib version. We would like to make it >>> invisible to the user. He should select only one jar and the IO should >>> decide the lib to use behind the scene. We are thinking about using one >>> module and sub-modules per version and use relocation, wrappers and a >>> factory that detects the version the IO actually points to to instantiate >>> the correct client version. It would also require to have DTOs in the IO >>> because the high level ES java objects are not exactly the same among the >>> ES versions. >>> > > +1 for adding a level of indirection to make this easy for users. >>> 4. regarding tests: the aim is always to target real ES backends to have >>> relevant tests (for reasons I already explained in another thread). The >>> problem is that es-test-framework used today is version dependent and is a >>> pain to use. We plan on using test containers per version (validated by ES >>> dev advocate) and launching them as part of the UTests. Obviously we will >>> launch only one container at the time per version and do all the test with >>> it to avoid paying the cost of launch too much. And the tests will be >>> shipped in per-version sub-modules and not in test dedicated modules like >>> it is now. >>> > > > Using a real ES backend for unit tests can be expensive ? Ideally we should > use a Fake (if one available) or mocking (test test out functionality) and > use real backend for IT tests that can be expensive. If this is a local > container that can be shared between unit tests with reasonable efficiency > that is OK. I'm mainly worried about introducing flakes into unit tests due > to network errors or slowness. > > On this point I understand it and I agree as well, but I think Etienne is > right, IO modules should be exceptions to this rule. > It is really difficult to really test an IO without a real backend and Docker > helps a lot here. > But I take the point, these tests must be as small as posssible (one per > Elasticsearch version). > > > > > Thanks, > Cham > >>> WDYT ? >>> >>> Best ! >>> >>> Etienne >>> >>> On 30/01/2020 17:55, Alexey Romanenko wrote: >>>> I’m second for this question. We have a similar (maybe a bit less painful) >>>> issue for KafkaIO and it would be useful to have a general strategy for >>>> such cases about how to deal with that. >>>> >>>>> On 24 Jan 2020, at 21:54, Kenneth Knowles <k...@apache.org >>>>> <mailto:k...@apache.org>> wrote: >>>>> >>>>> Would it make sense to have different version-specialized connectors with >>>>> a common core library and common API package? >>>>> >>>>> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath <chamik...@google.com >>>>> <mailto:chamik...@google.com>> wrote: >>>>> Thanks for the contribution. I agree with Alexey that we should try to >>>>> add any new features brought in with the new PR into existing connector >>>>> instead of trying to maintain two implementations. >>>>> >>>>> Thanks, >>>>> Cham >>>>> >>>>> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko >>>>> <aromanenko....@gmail.com <mailto:aromanenko....@gmail.com>> wrote: >>>>> Hi Ludovic, >>>>> >>>>> Thank you for working on this and sharing the details with us. This is >>>>> really great job! >>>>> >>>>> As I recall, we already have some support of Elasticsearch7 in current >>>>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong >>>>> Chen and Etienne Chauchot, who were working on adding this [1][2] and it >>>>> should be released in Beam 2.19. >>>>> >>>>> Would you think you can leverage this in your work on adding new >>>>> Elasticsearch7 features? IMHO, supporting two different related IOs can >>>>> be quite tough task and I‘d rather raise my hand to add a new >>>>> functionality into existing IO than creating a new one, if it’s possible. >>>>> >>>>> [1] https://issues.apache.org/jira/browse/BEAM-5192 >>>>> <https://issues.apache.org/jira/browse/BEAM-5192> >>>>> [2] https://github.com/apache/beam/pull/10433 >>>>> <https://github.com/apache/beam/pull/10433> >>>>> >>>>>> On 22 Jan 2020, at 19:23, Ludovic Boutros <boutr...@gmail.com >>>>>> <mailto:boutr...@gmail.com>> wrote: >>>>>> >>>>>> Dear all, >>>>>> >>>>>> I have written a completely reworked Elasticsearch 7+ IO module. >>>>>> It can be found here: >>>>>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7 >>>>>> >>>>>> <https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7> >>>>>> >>>>>> This is a quite advance WIP work but I'm a quite new user of Apache Beam >>>>>> and I would like to get some help on this :) >>>>>> >>>>>> I can create a JIRA issue now but I prefer to wait for your wise avises >>>>>> first. >>>>>> >>>>>> Why a new module ? >>>>>> >>>>>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. >>>>>> This seems to be a good point but so many things have been changed since >>>>>> Elasticsearch 2.x. >>>>> >>>>> >>>>> Probably this is not correct anymore due to >>>>> https://github.com/apache/beam/pull/10433 >>>>> <https://github.com/apache/beam/pull/10433> ? >>>>> >>>>>> Elasticsearch 7.x is now partially supported (document type are removed, >>>>>> occ, updates...). >>>>>> >>>>>> A fresh new module, only compliant with the last version of >>>>>> Elasticsearch, can easily benefit a lot from the last evolutions of >>>>>> Elasticsearch (Java High Level Http Client). >>>>>> >>>>>> It is therefore far simpler than the current one. >>>>>> >>>>>> Error management >>>>>> >>>>>> Currently, errors are caught and transformed into simple exceptions. >>>>>> This is not always what is needed. If we would like to do specific >>>>>> processing on these errors (send documents in error topics for >>>>>> instance), it is not possible with the current module. >>>>> >>>>> >>>>> Seems like this is some sort of a dead letter queue implementation.. This >>>>> will be a very good feature to add to the existing connector. >>>>> >>>>>> >>>>>> Philosophy >>>>>> >>>>>> This module directly uses the Elasticsearch Java client classes as >>>>>> inputs and outputs. >>>>>> >>>>>> This way you can configure any options you need directly in the >>>>>> `DocWriteRequest` objects. >>>>>> >>>>>> For instance: >>>>>> - If you need to use external versioning >>>>>> (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning >>>>>> >>>>>> <https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning>), >>>>>> you can. >>>>>> - If you need to use an ingest pipelines, you can. >>>>>> - If you need to configure an update document/script, you can. >>>>>> - If you need to use upserts, you can. >>>>>> >>>>>> Actually, you should be able to do everything you can do directly with >>>>>> Elasticsearch. >>>>>> >>>>>> Furthermore, it should be easier to keep updating the module with future >>>>>> Elasticsearch evolutions. >>>>>> >>>>>> Write outputs >>>>>> >>>>>> Two outputs are available: >>>>>> - Successful indexing output ; >>>>>> - Failed indexing output. >>>>>> >>>>>> They are available in a `WriteResult` object. >>>>>> >>>>>> These two outputs are represented by >>>>>> `PCollection<BulkItemResponseContainer>` objects. >>>>>> >>>>>> A `BulkItemResponseContainer` contains: >>>>>> - the original index request ; >>>>>> - the Elasticsearch response ; >>>>>> - a batch id. >>>>>> >>>>>> You can apply any process afterwards (reprocessing, alerting, ...). >>>>>> >>>>>> Read input >>>>>> >>>>>> You can read documents from Elasticsearch with this module. >>>>>> You can specify a `QueryBuilder` in order to filter the retrieved >>>>>> documents. >>>>>> By default, it retrieves the whole document collection. >>>>>> >>>>>> If the Elasticsearch index is sharded, multiple slices can be used >>>>>> during fetch. That many bundles are created. The maximum bundle count is >>>>>> equal to the index shard count. >>>>>> >>>>>> Thank you ! >>>>>> >>>>>> Ludovic >>>>> >>>> >>