Re: A new reworked Elasticsearch 7+ IO module

2020-04-09 Thread Etienne Chauchot

Hi Kenn,

The user does not specify the backendVersion targeted (at least on the 
current version of the IO) it is transparent to him: the IO detects the 
version with a REST call and adapts its behavior. But, anyway, I agree, 
we need to put at least a WARN if detected version is 2. As the new IO 
will not be compatible with ESV2 (because ES classes differ too much to 
have a common production basis), the only option on the new IO is to 
reject completely if version is 2 IMHO.


Best

Etienne

On 06/03/2020 18:49, Kenneth Knowles wrote:
Since the user provides backendVersion, here are some possible levels 
of things to add in expand() based on that (these are extra niceties 
beyond the agreed number of releases to remove)


 - WARN for backendVersion < n
 - reject for backendVersion < n with opt-in pipeline option to keep 
it working one more version (gets their attention and indicates urgency)

 - reject completely

Kenn

On Fri, Mar 6, 2020 at 2:26 AM Etienne Chauchot > wrote:


Hi all,

it's been 3 weeks since the survey on ES versions the users use.

The survey received very few responses: only 9 responses for now
(multiple versions possible of course). The responses are the
following:

ES2: 0 clients, ES5: 1, ES6: 5, ES7: 8

It tends to go toward a drop of ES2 support but for now it is
still not very representative.

I'm cross-posting to @users to let you know that I'm closing the
survey within 1 or 2 weeks. So please respond if you're using ESIO.

Best

Etienne

On 13/02/2020 12:37, Etienne Chauchot wrote:


Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES
uses per version:


https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:



On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.


On 6 Feb 2020, at 10:50, Etienne Chauchot
mailto:echauc...@apache.org>> wrote:



1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan
to remove it from the IO. In the past we
already retired versions (like spark 1.6 for
instance).



My only concern here is that there might be users who
use the existing module who might not be able to
easily upgrade the Beam version if we remove it. But
given that V2 is 5 versions behind the latest release
this might be OK.


It seems we have a consensus on this.
I think there should be another general discussion on the
long term support of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2


We had (and still have) a similar problem with KafkaIO to
support different versions of Kafka, especially very old
version 0.9. We raised this question on user@ and it
appears that there are users who for some reasons still use
old Kafka versions. So, before dropping a support of any ES
versions, I’d suggest to ask it user@ and see if any people
will be affected by this.

Yes we can do a survey among users but the question is,
should we support an ES version that is no more supported by
Elastic themselves ?


+1 for asking in the user list. I guess this is more about
whether users need this specific version that we hope to drop
support for. Whether we need to support unsupported versions is
a more generic question that should prob. be addressed in the
dev list. (and I personally don't think we should unless there's
a large enough user base for a given version).


2. regarding the user: the aim is to unlock
some new features (listed by Ludovic) and give
the user more flexibility on his request. For
that, it requires to use high level java ES
client in place of the low level REST client
(that was used because it is the only one
compatible with all ES versions). We plan to
replace the API (json document in and out) by
more complete standard ES objects that contain
de request logic (insert/update, doc routing
etc...) and the data. There are already IOs
like SpannerIO that use similar objects in
input PCollection rather than pure POJOs.



Won't this be a breaking change for all users ? IMO
using POJOs in PCollections is safer since we have to

Re: A new reworked Elasticsearch 7+ IO module

2020-03-31 Thread Etienne Chauchot

Hi all,

The survey regarding Elasticsearch support in Beam is now closed.

Here are the results after 38 days:

users using

ESv2: 0

ESV5: 1

ESV6: 5

ESV7: 8

So, the new version of ElasticsearchIO after the refactoring discussed 
in this thread will no more support Elasticsearch v2.


Regards

Etienne Chauchot.


On 06/03/2020 11:26, Etienne Chauchot wrote:


Hi all,

it's been 3 weeks since the survey on ES versions the users use.

The survey received very few responses: only 9 responses for now 
(multiple versions possible of course). The responses are the following:


ES2: 0 clients, ES5: 1, ES6: 5, ES7: 8

It tends to go toward a drop of ES2 support but for now it is still 
not very representative.


I'm cross-posting to @users to let you know that I'm closing the 
survey within 1 or 2 weeks. So please respond if you're using ESIO.


Best

Etienne

On 13/02/2020 12:37, Etienne Chauchot wrote:


Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES uses 
per version:


https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:



On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot 
mailto:echauc...@apache.org>> wrote:


Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.


On 6 Feb 2020, at 10:50, Etienne Chauchot
mailto:echauc...@apache.org>> wrote:



1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan to
remove it from the IO. In the past we already
retired versions (like spark 1.6 for instance).



My only concern here is that there might be users who use
the existing module who might not be able to easily
upgrade the Beam version if we remove it. But given that
V2 is 5 versions behind the latest release this might be OK.


It seems we have a consensus on this.
I think there should be another general discussion on the
long term support of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2


We had (and still have) a similar problem with KafkaIO to
support different versions of Kafka, especially very old
version 0.9. We raised this question on user@ and it appears
that there are users who for some reasons still use old Kafka
versions. So, before dropping a support of any ES versions, I’d
suggest to ask it user@ and see if any people will be affected
by this.

Yes we can do a survey among users but the question is, should
we support an ES version that is no more supported by Elastic
themselves ?


+1 for asking in the user list. I guess this is more about whether 
users need this specific version that we hope to drop support for. 
Whether we need to support unsupported versions is a more generic 
question that should prob. be addressed in the dev list. (and I 
personally don't think we should unless there's a large enough user 
base for a given version).



2. regarding the user: the aim is to unlock some
new features (listed by Ludovic) and give the user
more flexibility on his request. For that, it
requires to use high level java ES client in place
of the low level REST client (that was used because
it is the only one compatible with all ES
versions). We plan to replace the API (json
document in and out) by more complete standard ES
objects that contain de request logic
(insert/update, doc routing etc...) and the data.
There are already IOs like SpannerIO that use
similar objects in input PCollection rather than
pure POJOs.



Won't this be a breaking change for all users ? IMO using
POJOs in PCollections is safer since we have to worry
about changes to the underlying client library API.
Exception would be when underlying client library offers
a backwards compatibility guarantee that we can rely on
for the foreseeable future (for example, BQ TableRow).


Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point
explains this.


=> indeed it will be a breaking change, hence this email to
get a consensus on that. Also I think our wrappers of ES
request objects will offer a backward compatible as the
underlying objects


I just want to remind that according to what we agreed some
time ago on dev@ (at least, for IOs), all breaking user API
changes have to be added along with deprecation of old API that
could be removed after 3 consecutive Beam releases. In this
case, users will have a time to move to new API smoothly.


We are more discussing 

Re: A new reworked Elasticsearch 7+ IO module

2020-03-06 Thread Jean-Baptiste Onofre
Hi,

I think WARN makes sense and the safest approach. It allows users to be notify 
and eventually update or back on previous Beam IO version.

Regards
JB

> Le 6 mars 2020 à 18:49, Kenneth Knowles  a écrit :
> 
> Since the user provides backendVersion, here are some possible levels of 
> things to add in expand() based on that (these are extra niceties beyond the 
> agreed number of releases to remove)
> 
>  - WARN for backendVersion < n
>  - reject for backendVersion < n with opt-in pipeline option to keep it 
> working one more version (gets their attention and indicates urgency)
>  - reject completely
> 
> Kenn
> 
> On Fri, Mar 6, 2020 at 2:26 AM Etienne Chauchot  > wrote:
> Hi all, 
> 
> it's been 3 weeks since the survey on ES versions the users use. 
> 
> The survey received very few responses: only 9 responses for now (multiple 
> versions possible of course). The responses are the following:
> 
> ES2: 0 clients, ES5: 1, ES6: 5, ES7: 8 
> 
> It tends to go toward a drop of ES2 support but for now it is still not very 
> representative.
> 
> I'm cross-posting to @users to let you know that I'm closing the survey 
> within 1 or 2 weeks. So please respond if you're using ESIO.
> 
> Best
> 
> Etienne
> 
> On 13/02/2020 12:37, Etienne Chauchot wrote:
>> Hi Cham, thanks for your comments !
>> 
>> I just sent an email to user ML with a survey link to count ES uses per 
>> version:
>> 
>> https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E
>>  
>> 
>> Best
>> 
>> Etienne
>> 
>> On 10/02/2020 19:46, Chamikara Jayalath wrote:
>>> 
>>> 
>>> On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot >> > wrote:
>>> Hi,
>>> 
>>> please see my comments inline
>>> 
>>> On 06/02/2020 16:24, Alexey Romanenko wrote:
 Please, see my comments inline.
 
> On 6 Feb 2020, at 10:50, Etienne Chauchot  > wrote:
 1. regarding version support: ES v2 is no more maintained by Elastic 
 since 2018/02 so we plan to remove it from the IO. In the past we 
 already retired versions (like spark 1.6 for instance).
 
>> 
>> 
>> My only concern here is that there might be users who use the existing 
>> module who might not be able to easily upgrade the Beam version if we 
>> remove it. But given that V2 is 5 versions behind the latest release 
>> this might be OK.
>> 
>> It seems we have a consensus on this.
>> I think there should be another general discussion on the long term 
>> support of our prefered tool IO modules.
> => yes, consensus, let's drop ESV2
> 
 We had (and still have) a similar problem with KafkaIO to support 
 different versions of Kafka, especially very old version 0.9. We raised 
 this question on user@ and it appears that there are users who for some 
 reasons still use old Kafka versions. So, before dropping a support of any 
 ES versions, I’d suggest to ask it user@ and see if any people will be 
 affected by this.
>>> Yes we can do a survey among users but the question is, should we support 
>>> an ES version that is no more supported by Elastic themselves ?
>>> 
>>> +1 for asking in the user list. I guess this is more about whether users 
>>> need this specific version that we hope to drop support for. Whether we 
>>> need to support unsupported versions is a more generic question that should 
>>> prob. be addressed in the dev list. (and I personally don't think we should 
>>> unless there's a large enough user base for a given version).
>>> 
> 
 2. regarding the user: the aim is to unlock some new features (listed 
 by Ludovic) and give the user more flexibility on his request. For 
 that, it requires to use high level java ES client in place of the low 
 level REST client (that was used because it is the only one compatible 
 with all ES versions). We plan to replace the API (json document in 
 and out) by more complete standard ES objects that contain de request 
 logic (insert/update, doc routing etc...) and the data. There are 
 already IOs like SpannerIO that use similar objects in input 
 PCollection rather than pure POJOs. 
 
>> 
>> 
>> Won't this be a breaking change for all users ? IMO using POJOs in 
>> PCollections is safer since we have to worry about changes to the 
>> underlying client library API. Exception would be when underlying client 
>> library offers a backwards compatibility guarantee that we can rely on 
>> for the foreseeable future (for example, BQ TableRow).
>> 
>> Agreed but actually, there will be POJOs in order to abstract 
>> Elasticsearch's version support. The 

Re: A new reworked Elasticsearch 7+ IO module

2020-03-06 Thread Kenneth Knowles
Since the user provides backendVersion, here are some possible levels of
things to add in expand() based on that (these are extra niceties beyond
the agreed number of releases to remove)

 - WARN for backendVersion < n
 - reject for backendVersion < n with opt-in pipeline option to keep it
working one more version (gets their attention and indicates urgency)
 - reject completely

Kenn

On Fri, Mar 6, 2020 at 2:26 AM Etienne Chauchot 
wrote:

> Hi all,
>
> it's been 3 weeks since the survey on ES versions the users use.
>
> The survey received very few responses: only 9 responses for now (multiple
> versions possible of course). The responses are the following:
>
> ES2: 0 clients, ES5: 1, ES6: 5, ES7: 8
>
> It tends to go toward a drop of ES2 support but for now it is still not
> very representative.
>
> I'm cross-posting to @users to let you know that I'm closing the survey
> within 1 or 2 weeks. So please respond if you're using ESIO.
>
> Best
>
> Etienne
> On 13/02/2020 12:37, Etienne Chauchot wrote:
>
> Hi Cham, thanks for your comments !
>
> I just sent an email to user ML with a survey link to count ES uses per
> version:
>
>
> https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E
>
> Best
>
> Etienne
> On 10/02/2020 19:46, Chamikara Jayalath wrote:
>
>
>
> On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot 
> wrote:
>
>> Hi,
>>
>> please see my comments inline
>> On 06/02/2020 16:24, Alexey Romanenko wrote:
>>
>> Please, see my comments inline.
>>
>> On 6 Feb 2020, at 10:50, Etienne Chauchot  wrote:
>>
>> 1. regarding version support: ES v2 is no more maintained by Elastic
 since 2018/02 so we plan to remove it from the IO. In the past we already
 retired versions (like spark 1.6 for instance).


>>> My only concern here is that there might be users who use the existing
>>> module who might not be able to easily upgrade the Beam version if we
>>> remove it. But given that V2 is 5 versions behind the latest release this
>>> might be OK.
>>>
>>
>> It seems we have a consensus on this.
>> I think there should be another general discussion on the long term
>> support of our prefered tool IO modules.
>>
>> => yes, consensus, let's drop ESV2
>>
>> We had (and still have) a similar problem with KafkaIO to support
>> different versions of Kafka, especially very old version 0.9. We raised
>> this question on user@ and it appears that there are users who for some
>> reasons still use old Kafka versions. So, before dropping a support of any
>> ES versions, I’d suggest to ask it user@ and see if any people will be
>> affected by this.
>>
>> Yes we can do a survey among users but the question is, should we support
>> an ES version that is no more supported by Elastic themselves ?
>>
>
> +1 for asking in the user list. I guess this is more about whether users
> need this specific version that we hope to drop support for. Whether we
> need to support unsupported versions is a more generic question that should
> prob. be addressed in the dev list. (and I personally don't think we should
> unless there's a large enough user base for a given version).
>
> 2. regarding the user: the aim is to unlock some new features (listed by
 Ludovic) and give the user more flexibility on his request. For that, it
 requires to use high level java ES client in place of the low level REST
 client (that was used because it is the only one compatible with all ES
 versions). We plan to replace the API (json document in and out) by more
 complete standard ES objects that contain de request logic (insert/update,
 doc routing etc...) and the data. There are already IOs like SpannerIO that
 use similar objects in input PCollection rather than pure POJOs.


>>> Won't this be a breaking change for all users ? IMO using POJOs in
>>> PCollections is safer since we have to worry about changes to the
>>> underlying client library API. Exception would be when underlying client
>>> library offers a backwards compatibility guarantee that we can rely on for
>>> the foreseeable future (for example, BQ TableRow).
>>>
>>
>> Agreed but actually, there will be POJOs in order to abstract
>> Elasticsearch's version support. The following third point explains this.
>>
>> => indeed it will be a breaking change, hence this email to get a
>> consensus on that. Also I think our wrappers of ES request objects will
>> offer a backward compatible as the underlying objects
>>
>> I just want to remind that according to what we agreed some time ago on
>> dev@ (at least, for IOs), all breaking user API changes have to be added
>> along with deprecation of old API that could be removed after 3 consecutive
>> Beam releases. In this case, users will have a time to move to new API
>> smoothly.
>>
>> We are more discussing the target architecture of the new module here but
>> the process of deprecation is important to recall, I agree. When I say DTOs
>> 

Re: A new reworked Elasticsearch 7+ IO module

2020-03-06 Thread Etienne Chauchot

Hi all,

it's been 3 weeks since the survey on ES versions the users use.

The survey received very few responses: only 9 responses for now 
(multiple versions possible of course). The responses are the following:


ES2: 0 clients, ES5: 1, ES6: 5, ES7: 8

It tends to go toward a drop of ES2 support but for now it is still not 
very representative.


I'm cross-posting to @users to let you know that I'm closing the survey 
within 1 or 2 weeks. So please respond if you're using ESIO.


Best

Etienne

On 13/02/2020 12:37, Etienne Chauchot wrote:


Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES uses 
per version:


https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:



On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot > wrote:


Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.


On 6 Feb 2020, at 10:50, Etienne Chauchot mailto:echauc...@apache.org>> wrote:



1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan to
remove it from the IO. In the past we already
retired versions (like spark 1.6 for instance).



My only concern here is that there might be users who use
the existing module who might not be able to easily
upgrade the Beam version if we remove it. But given that
V2 is 5 versions behind the latest release this might be OK.


It seems we have a consensus on this.
I think there should be another general discussion on the long
term support of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2


We had (and still have) a similar problem with KafkaIO to
support different versions of Kafka, especially very old version
0.9. We raised this question on user@ and it appears that there
are users who for some reasons still use old Kafka versions. So,
before dropping a support of any ES versions, I’d suggest to ask
it user@ and see if any people will be affected by this.

Yes we can do a survey among users but the question is, should we
support an ES version that is no more supported by Elastic
themselves ?


+1 for asking in the user list. I guess this is more about whether 
users need this specific version that we hope to drop support for. 
Whether we need to support unsupported versions is a more generic 
question that should prob. be addressed in the dev list. (and I 
personally don't think we should unless there's a large enough user 
base for a given version).



2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to
use high level java ES client in place of the low
level REST client (that was used because it is the
only one compatible with all ES versions). We plan
to replace the API (json document in and out) by
more complete standard ES objects that contain de
request logic (insert/update, doc routing etc...)
and the data. There are already IOs like SpannerIO
that use similar objects in input PCollection rather
than pure POJOs.



Won't this be a breaking change for all users ? IMO using
POJOs in PCollections is safer since we have to worry
about changes to the underlying client library API.
Exception would be when underlying client library offers a
backwards compatibility guarantee that we can rely on for
the foreseeable future (for example, BQ TableRow).


Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point
explains this.


=> indeed it will be a breaking change, hence this email to get
a consensus on that. Also I think our wrappers of ES request
objects will offer a backward compatible as the underlying objects


I just want to remind that according to what we agreed some time
ago on dev@ (at least, for IOs), all breaking user API changes
have to be added along with deprecation of old API that could be
removed after 3 consecutive Beam releases. In this case, users
will have a time to move to new API smoothly.


We are more discussing the target architecture of the new module
here but the process of deprecation is important to recall, I
agree. When I say DTOs backward compatible above I mean between
per-version sub-modules inside the new module. Anyway, sure, for
some time, both modules (the old REST-based that supports v2-7
and the new that supports v5-7) will cohabit and the old one will

Re: A new reworked Elasticsearch 7+ IO module

2020-02-13 Thread Etienne Chauchot

Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES uses per 
version:


https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:



On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot > wrote:


Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.


On 6 Feb 2020, at 10:50, Etienne Chauchot mailto:echauc...@apache.org>> wrote:



1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan to
remove it from the IO. In the past we already retired
versions (like spark 1.6 for instance).



My only concern here is that there might be users who use
the existing module who might not be able to easily upgrade
the Beam version if we remove it. But given that V2 is 5
versions behind the latest release this might be OK.


It seems we have a consensus on this.
I think there should be another general discussion on the long
term support of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2


We had (and still have) a similar problem with KafkaIO to support
different versions of Kafka, especially very old version 0.9. We
raised this question on user@ and it appears that there are users
who for some reasons still use old Kafka versions. So, before
dropping a support of any ES versions, I’d suggest to ask it
user@ and see if any people will be affected by this.

Yes we can do a survey among users but the question is, should we
support an ES version that is no more supported by Elastic
themselves ?


+1 for asking in the user list. I guess this is more about whether 
users need this specific version that we hope to drop support for. 
Whether we need to support unsupported versions is a more generic 
question that should prob. be addressed in the dev list. (and I 
personally don't think we should unless there's a large enough user 
base for a given version).



2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to
use high level java ES client in place of the low
level REST client (that was used because it is the
only one compatible with all ES versions). We plan to
replace the API (json document in and out) by more
complete standard ES objects that contain de request
logic (insert/update, doc routing etc...) and the
data. There are already IOs like SpannerIO that use
similar objects in input PCollection rather than pure
POJOs.



Won't this be a breaking change for all users ? IMO using
POJOs in PCollections is safer since we have to worry about
changes to the underlying client library API. Exception
would be when underlying client library offers a backwards
compatibility guarantee that we can rely on for the
foreseeable future (for example, BQ TableRow).


Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point
explains this.


=> indeed it will be a breaking change, hence this email to get
a consensus on that. Also I think our wrappers of ES request
objects will offer a backward compatible as the underlying objects


I just want to remind that according to what we agreed some time
ago on dev@ (at least, for IOs), all breaking user API changes
have to be added along with deprecation of old API that could be
removed after 3 consecutive Beam releases. In this case, users
will have a time to move to new API smoothly.


We are more discussing the target architecture of the new module
here but the process of deprecation is important to recall, I
agree. When I say DTOs backward compatible above I mean between
per-version sub-modules inside the new module. Anyway, sure, for
some time, both modules (the old REST-based that supports v2-7 and
the new that supports v5-7) will cohabit and the old one will
receive the deprecation annotations.


+1 for supporting both versions for at least three minor versions to 
give users time to migrate. Also, we should try to produce a warning 
for users who use the deprecated versions.


Thanks,
Cham

Best

Etienne






Re: A new reworked Elasticsearch 7+ IO module

2020-02-10 Thread Chamikara Jayalath
On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot 
wrote:

> Hi,
>
> please see my comments inline
> On 06/02/2020 16:24, Alexey Romanenko wrote:
>
> Please, see my comments inline.
>
> On 6 Feb 2020, at 10:50, Etienne Chauchot  wrote:
>
> 1. regarding version support: ES v2 is no more maintained by Elastic since
>>> 2018/02 so we plan to remove it from the IO. In the past we already retired
>>> versions (like spark 1.6 for instance).
>>>
>>>
>> My only concern here is that there might be users who use the existing
>> module who might not be able to easily upgrade the Beam version if we
>> remove it. But given that V2 is 5 versions behind the latest release this
>> might be OK.
>>
>
> It seems we have a consensus on this.
> I think there should be another general discussion on the long term
> support of our prefered tool IO modules.
>
> => yes, consensus, let's drop ESV2
>
> We had (and still have) a similar problem with KafkaIO to support
> different versions of Kafka, especially very old version 0.9. We raised
> this question on user@ and it appears that there are users who for some
> reasons still use old Kafka versions. So, before dropping a support of any
> ES versions, I’d suggest to ask it user@ and see if any people will be
> affected by this.
>
> Yes we can do a survey among users but the question is, should we support
> an ES version that is no more supported by Elastic themselves ?
>

+1 for asking in the user list. I guess this is more about whether users
need this specific version that we hope to drop support for. Whether we
need to support unsupported versions is a more generic question that should
prob. be addressed in the dev list. (and I personally don't think we should
unless there's a large enough user base for a given version).

2. regarding the user: the aim is to unlock some new features (listed by
>>> Ludovic) and give the user more flexibility on his request. For that, it
>>> requires to use high level java ES client in place of the low level REST
>>> client (that was used because it is the only one compatible with all ES
>>> versions). We plan to replace the API (json document in and out) by more
>>> complete standard ES objects that contain de request logic (insert/update,
>>> doc routing etc...) and the data. There are already IOs like SpannerIO that
>>> use similar objects in input PCollection rather than pure POJOs.
>>>
>>>
>> Won't this be a breaking change for all users ? IMO using POJOs in
>> PCollections is safer since we have to worry about changes to the
>> underlying client library API. Exception would be when underlying client
>> library offers a backwards compatibility guarantee that we can rely on for
>> the foreseeable future (for example, BQ TableRow).
>>
>
> Agreed but actually, there will be POJOs in order to abstract
> Elasticsearch's version support. The following third point explains this.
>
> => indeed it will be a breaking change, hence this email to get a
> consensus on that. Also I think our wrappers of ES request objects will
> offer a backward compatible as the underlying objects
>
> I just want to remind that according to what we agreed some time ago on
> dev@ (at least, for IOs), all breaking user API changes have to be added
> along with deprecation of old API that could be removed after 3 consecutive
> Beam releases. In this case, users will have a time to move to new API
> smoothly.
>
> We are more discussing the target architecture of the new module here but
> the process of deprecation is important to recall, I agree. When I say DTOs
> backward compatible above I mean between per-version sub-modules inside the
> new module. Anyway, sure, for some time, both modules (the old REST-based
> that supports v2-7 and the new that supports v5-7) will cohabit and the old
> one will receive the deprecation annotations.
>

+1 for supporting both versions for at least three minor versions to give
users time to migrate. Also, we should try to produce a warning for users
who use the deprecated versions.

Thanks,
Cham


> Best
>
> Etienne
>
>
>
>


Re: A new reworked Elasticsearch 7+ IO module

2020-02-06 Thread Etienne Chauchot

Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.

On 6 Feb 2020, at 10:50, Etienne Chauchot > wrote:



1. regarding version support: ES v2 is no more maintained
by Elastic since 2018/02 so we plan to remove it from the
IO. In the past we already retired versions (like spark
1.6 for instance).



My only concern here is that there might be users who use the
existing module who might not be able to easily upgrade the Beam
version if we remove it. But given that V2 is 5 versions behind
the latest release this might be OK.


It seems we have a consensus on this.
I think there should be another general discussion on the long term 
support of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2

We had (and still have) a similar problem with KafkaIO to support 
different versions of Kafka, especially very old version 0.9. We 
raised this question on user@ and it appears that there are users who 
for some reasons still use old Kafka versions. So, before dropping a 
support of any ES versions, I’d suggest to ask it user@ and see if any 
people will be affected by this.
Yes we can do a survey among users but the question is, should we 
support an ES version that is no more supported by Elastic themselves ?



2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to use
high level java ES client in place of the low level REST
client (that was used because it is the only one
compatible with all ES versions). We plan to replace the
API (json document in and out) by more complete standard
ES objects that contain de request logic (insert/update,
doc routing etc...) and the data. There are already IOs
like SpannerIO that use similar objects in input
PCollection rather than pure POJOs.



Won't this be a breaking change for all users ? IMO using POJOs
in PCollections is safer since we have to worry about changes to
the underlying client library API. Exception would be when
underlying client library offers a backwards
compatibility guarantee that we can rely on for the
foreseeable future (for example, BQ TableRow).


Agreed but actually, there will be POJOs in order to abstract 
Elasticsearch's version support. The following third point explains 
this.


=> indeed it will be a breaking change, hence this email to get a 
consensus on that. Also I think our wrappers of ES request objects 
will offer a backward compatible as the underlying objects


I just want to remind that according to what we agreed some time ago 
on dev@ (at least, for IOs), all breaking user API changes have to be 
added along with deprecation of old API that could be removed after 3 
consecutive Beam releases. In this case, users will have a time to 
move to new API smoothly.


We are more discussing the target architecture of the new module here 
but the process of deprecation is important to recall, I agree. When I 
say DTOs backward compatible above I mean between per-version 
sub-modules inside the new module. Anyway, sure, for some time, both 
modules (the old REST-based that supports v2-7 and the new that supports 
v5-7) will cohabit and the old one will receive the deprecation 
annotations.


Best

Etienne






Re: A new reworked Elasticsearch 7+ IO module

2020-02-06 Thread Alexey Romanenko
Please, see my comments inline.

> On 6 Feb 2020, at 10:50, Etienne Chauchot  wrote:
 1. regarding version support: ES v2 is no more maintained by Elastic since 
 2018/02 so we plan to remove it from the IO. In the past we already 
 retired versions (like spark 1.6 for instance).
>> 
>> 
>> My only concern here is that there might be users who use the existing 
>> module who might not be able to easily upgrade the Beam version if we remove 
>> it. But given that V2 is 5 versions behind the latest release this might be 
>> OK.
>> 
>> It seems we have a consensus on this.
>> I think there should be another general discussion on the long term support 
>> of our prefered tool IO modules.
> => yes, consensus, let's drop ESV2
> 
We had (and still have) a similar problem with KafkaIO to support different 
versions of Kafka, especially very old version 0.9. We raised this question on 
user@ and it appears that there are users who for some reasons still use old 
Kafka versions. So, before dropping a support of any ES versions, I’d suggest 
to ask it user@ and see if any people will be affected by this.
 2. regarding the user: the aim is to unlock some new features (listed by 
 Ludovic) and give the user more flexibility on his request. For that, it 
 requires to use high level java ES client in place of the low level REST 
 client (that was used because it is the only one compatible with all ES 
 versions). We plan to replace the API (json document in and out) by more 
 complete standard ES objects that contain de request logic (insert/update, 
 doc routing etc...) and the data. There are already IOs like SpannerIO 
 that use similar objects in input PCollection rather than pure POJOs. 
>> 
>> 
>> Won't this be a breaking change for all users ? IMO using POJOs in 
>> PCollections is safer since we have to worry about changes to the underlying 
>> client library API. Exception would be when underlying client library offers 
>> a backwards compatibility guarantee that we can rely on for the foreseeable 
>> future (for example, BQ TableRow).
>> 
>> Agreed but actually, there will be POJOs in order to abstract 
>> Elasticsearch's version support. The following third point explains this.
> => indeed it will be a breaking change, hence this email to get a consensus 
> on that. Also I think our wrappers of ES request objects will offer a 
> backward compatible as the underlying objects
> 
I just want to remind that according to what we agreed some time ago on dev@ 
(at least, for IOs), all breaking user API changes have to be added along with 
deprecation of old API that could be removed after 3 consecutive Beam releases. 
In this case, users will have a time to move to new API smoothly. 




Re: A new reworked Elasticsearch 7+ IO module

2020-02-06 Thread Jean-Baptiste Onofre
Hi,

Let’s sync together about this IO.

Regarding mock and IOs, and Etienne’s comment, there are two things:

1. Of course, it’s always preferable to use concrete backend, but several times 
it’s not possible. It’s there mock is required.
2. The mock can be smart enough to cover core IO behavior

So, I still think that mock is interesting as core first level test.

In the case of ES, we can easily bootstrap instance, but I think we can move 
forward without a mock.

Regards
JB

> Le 6 févr. 2020 à 09:47, Ludovic Boutros  a écrit :
> 
> Hi all,
> 
> First, thank you all for your answers and especially, Etienne for your time, 
> advises and kindness :)
> @Jean-Baptiste, any help on this module is welcome of course.
> 
> @Chamikara Jayalath, my aswers are inline.
> 
> Have a good day !
> 
> Ludovic
> 
> Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath  > a écrit :
> 
> 
> On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot  > wrote:
> Still there is something I don't agree with is that IOs can be tested on 
> mock. We don't really test IO behavior with mocks: there is always special 
> behaviors that cannot be reproduced in mocks (split, load, with corner cases 
> etc...). There was in the past IOs that were tested using mocks and that 
> happened to be nonfunctional.
> 
> Regarding ITests we have very few comparing to UTests and they are not as 
> closely observed as UTests.
> 
> Etienne
> 
> On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:
>> Hi,
>> 
>> We talked in the past about multiple/single module.
>> 
>> IMHO the always preferred goal is to have a single module. However, it’s 
>> tricky when we have such difference, including on the user facing API. So, I 
>> would go with module per version, or use a specified version for a target 
>> Beam release.
>> 
>> For the test, we should distinguish utest from itest. Utest can be done with 
>> mock, the purpose is really to test the IO behavior. Then we can have itest 
>> using concrete ES instance.
>> 
>> Anyway, I’m OK with the proposal and I would like to work on this IO (I have 
>> other improvements coming on other IOs anyway) with you guys (and Ludovic 
>> especially).
>> 
>> Regards
>> JB
>> 
>>> Le 5 févr. 2020 à 10:44, Etienne Chauchot >> > a écrit :
>>> 
>>> Hi all, 
>>> 
>>> We had a long discussion with Ludovic about this IO. I'd like to share with 
>>> you to keep you informed and also gather your opinions
>>> 
>>> 1. regarding version support: ES v2 is no more maintained by Elastic since 
>>> 2018/02 so we plan to remove it from the IO. In the past we already retired 
>>> versions (like spark 1.6 for instance).
>>> 
> 
> 
> My only concern here is that there might be users who use the existing module 
> who might not be able to easily upgrade the Beam version if we remove it. But 
> given that V2 is 5 versions behind the latest release this might be OK.
> 
> It seems we have a consensus on this.
> I think there should be another general discussion on the long term support 
> of our prefered tool IO modules.
>  
>  
>>> 
>>> 2. regarding the user: the aim is to unlock some new features (listed by 
>>> Ludovic) and give the user more flexibility on his request. For that, it 
>>> requires to use high level java ES client in place of the low level REST 
>>> client (that was used because it is the only one compatible with all ES 
>>> versions). We plan to replace the API (json document in and out) by more 
>>> complete standard ES objects that contain de request logic (insert/update, 
>>> doc routing etc...) and the data. There are already IOs like SpannerIO that 
>>> use similar objects in input PCollection rather than pure POJOs. 
>>> 
> 
> 
> Won't this be a breaking change for all users ? IMO using POJOs in 
> PCollections is safer since we have to worry about changes to the underlying 
> client library API. Exception would be when underlying client library offers 
> a backwards compatibility guarantee that we can rely on for the foreseeable 
> future (for example, BQ TableRow).
> 
> Agreed but actually, there will be POJOs in order to abstract Elasticsearch's 
> version support. The following third point explains this.
>  
>  
>>> 
>>> 3. regarding multiple/single module: the aim is to have only one production 
>>> code to ease the maintenance.  The problem is that using high level client 
>>> makes the code dependent to an ES lib version. We would like to make it 
>>> invisible to the user. He should select only one jar and the IO should 
>>> decide the lib to use behind the scene. We are thinking about using one 
>>> module and sub-modules per version and use relocation, wrappers and a 
>>> factory that detects the version the IO actually points to to instantiate 
>>> the correct client version. It would also require to have DTOs in the IO 
>>> because the high level ES java objects are not exactly the same among the 
>>> ES versions.
>>> 
> 
> +1 for adding 

Re: A new reworked Elasticsearch 7+ IO module

2020-02-06 Thread Etienne Chauchot

Hi,

Thanks all for your comments, my comments are inline

On 06/02/2020 09:47, Ludovic Boutros wrote:

Hi all,

First, thank you all for your answers and especially, Etienne for your 
time, advises and kindness :)

@Jean-Baptiste, any help on this module is welcome of course.

@Chamikara Jayalath, my aswers are inline.

Have a good day !

Ludovic

Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath > a écrit :




On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Still there is something I don't agree with is that IOs can be
tested on mock. We don't really test IO behavior with mocks:
there is always special behaviors that cannot be reproduced in
mocks (split, load, with corner cases etc...). There was in
the past IOs that were tested using mocks and that happened to
be nonfunctional.

Regarding ITests we have very few comparing to UTests and they
are not as closely observed as UTests.

Etienne

On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:

Hi,

We talked in the past about multiple/single module.

IMHO the always preferred goal is to have a single module.
However, it’s tricky when we have such difference, including
on the user facing API. So, I would go with module per
version, or use a specified version for a target Beam release.

For the test, we should distinguish utest from itest. Utest
can be done with mock, the purpose is really to test the IO
behavior. Then we can have itest using concrete ES instance.

Anyway, I’m OK with the proposal and I would like to work on
this IO (I have other improvements coming on other IOs
anyway) with you guys (and Ludovic especially).

Regards
JB


Le 5 févr. 2020 à 10:44, Etienne Chauchot
mailto:echauc...@apache.org>> a écrit :

Hi all,

We had a long discussion with Ludovic about this IO. I'd
like to share with you to keep you informed and also gather
your opinions

1. regarding version support: ES v2 is no more maintained by
Elastic since 2018/02 so we plan to remove it from the IO.
In the past we already retired versions (like spark 1.6 for
instance).



My only concern here is that there might be users who use the
existing module who might not be able to easily upgrade the Beam
version if we remove it. But given that V2 is 5 versions behind
the latest release this might be OK.


It seems we have a consensus on this.
I think there should be another general discussion on the long term 
support of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2


2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to use
high level java ES client in place of the low level REST
client (that was used because it is the only one compatible
with all ES versions). We plan to replace the API (json
document in and out) by more complete standard ES objects
that contain de request logic (insert/update, doc routing
etc...) and the data. There are already IOs like SpannerIO
that use similar objects in input PCollection rather than
pure POJOs.



Won't this be a breaking change for all users ? IMO using POJOs in
PCollections is safer since we have to worry about changes to the
underlying client library API. Exception would be when underlying
client library offers a backwards compatibility guarantee that we
can rely on for the foreseeable future (for example, BQ TableRow).


Agreed but actually, there will be POJOs in order to abstract 
Elasticsearch's version support. The following third point explains this.



=> indeed it will be a breaking change, hence this email to get a 
consensus on that. Also I think our wrappers of ES request objects will 
offer a backward compatible as the underlying objects




3. regarding multiple/single module: the aim is to have only
one production code to ease the maintenance.  The problem is
that using high level client makes the code dependent to an
ES lib version. We would like to make it invisible to the
user. He should select only one jar and the IO should decide
the lib to use behind the scene. We are thinking about using
one module and sub-modules per version and use relocation,
wrappers and a factory that detects the version the IO
actually points to to instantiate the correct client
version. It would also require to have DTOs in the IO
because the high level ES java objects are not exactly the
same among the ES versions.


+1 for adding a level of indirection to make this 

Re: A new reworked Elasticsearch 7+ IO module

2020-02-06 Thread Ludovic Boutros
Hi all,

First, thank you all for your answers and especially, Etienne for your
time, advises and kindness :)
@Jean-Baptiste, any help on this module is welcome of course.

@Chamikara Jayalath, my aswers are inline.

Have a good day !

Ludovic

Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath  a
écrit :

>
>
> On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot 
> wrote:
>
>> Still there is something I don't agree with is that IOs can be tested on
>> mock. We don't really test IO behavior with mocks: there is always special
>> behaviors that cannot be reproduced in mocks (split, load, with corner
>> cases etc...). There was in the past IOs that were tested using mocks and
>> that happened to be nonfunctional.
>>
>> Regarding ITests we have very few comparing to UTests and they are not as
>> closely observed as UTests.
>>
>> Etienne
>> On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:
>>
>> Hi,
>>
>> We talked in the past about multiple/single module.
>>
>> IMHO the always preferred goal is to have a single module. However, it’s
>> tricky when we have such difference, including on the user facing API. So,
>> I would go with module per version, or use a specified version for a target
>> Beam release.
>>
>> For the test, we should distinguish utest from itest. Utest can be done
>> with mock, the purpose is really to test the IO behavior. Then we can have
>> itest using concrete ES instance.
>>
>> Anyway, I’m OK with the proposal and I would like to work on this IO (I
>> have other improvements coming on other IOs anyway) with you guys (and
>> Ludovic especially).
>>
>> Regards
>> JB
>>
>> Le 5 févr. 2020 à 10:44, Etienne Chauchot  a écrit
>> :
>>
>> Hi all,
>>
>> We had a long discussion with Ludovic about this IO. I'd like to share
>> with you to keep you informed and also gather your opinions
>>
>> 1. regarding version support: ES v2 is no more maintained by Elastic
>> since 2018/02 so we plan to remove it from the IO. In the past we already
>> retired versions (like spark 1.6 for instance).
>>
>>
> My only concern here is that there might be users who use the existing
> module who might not be able to easily upgrade the Beam version if we
> remove it. But given that V2 is 5 versions behind the latest release this
> might be OK.
>

It seems we have a consensus on this.
I think there should be another general discussion on the long term support
of our prefered tool IO modules.


>
>
>> 2. regarding the user: the aim is to unlock some new features (listed by
>> Ludovic) and give the user more flexibility on his request. For that, it
>> requires to use high level java ES client in place of the low level REST
>> client (that was used because it is the only one compatible with all ES
>> versions). We plan to replace the API (json document in and out) by more
>> complete standard ES objects that contain de request logic (insert/update,
>> doc routing etc...) and the data. There are already IOs like SpannerIO that
>> use similar objects in input PCollection rather than pure POJOs.
>>
>>
> Won't this be a breaking change for all users ? IMO using POJOs in
> PCollections is safer since we have to worry about changes to the
> underlying client library API. Exception would be when underlying client
> library offers a backwards compatibility guarantee that we can rely on for
> the foreseeable future (for example, BQ TableRow).
>

Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point explains this.


>
>
>> 3. regarding multiple/single module: the aim is to have only one
>> production code to ease the maintenance.  The problem is that using high
>> level client makes the code dependent to an ES lib version. We would like
>> to make it invisible to the user. He should select only one jar and the IO
>> should decide the lib to use behind the scene. We are thinking about using
>> one module and sub-modules per version and use relocation, wrappers and a
>> factory that detects the version the IO actually points to to instantiate
>> the correct client version. It would also require to have DTOs in the IO
>> because the high level ES java objects are not exactly the same among the
>> ES versions.
>>
>> +1 for adding a level of indirection to make this easy for users.
>
>> 4. regarding tests: the aim is always to target real ES backends to have
>> relevant tests (for reasons I already explained in another thread). The
>> problem is that es-test-framework used today is version dependent and is a
>> pain to use. We plan on using test containers per version (validated by ES
>> dev advocate) and launching them as part of the UTests. Obviously we will
>> launch only one container at the time per version and do all the test with
>> it to avoid paying the cost of launch too much. And the tests will be
>> shipped in per-version sub-modules and not in test dedicated modules like
>> it is now.
>>
>>
> Using a real ES backend for unit tests can be expensive ? Ideally we
> 

Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Chamikara Jayalath
On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot 
wrote:

> Still there is something I don't agree with is that IOs can be tested on
> mock. We don't really test IO behavior with mocks: there is always special
> behaviors that cannot be reproduced in mocks (split, load, with corner
> cases etc...). There was in the past IOs that were tested using mocks and
> that happened to be nonfunctional.
>
> Regarding ITests we have very few comparing to UTests and they are not as
> closely observed as UTests.
>
> Etienne
> On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:
>
> Hi,
>
> We talked in the past about multiple/single module.
>
> IMHO the always preferred goal is to have a single module. However, it’s
> tricky when we have such difference, including on the user facing API. So,
> I would go with module per version, or use a specified version for a target
> Beam release.
>
> For the test, we should distinguish utest from itest. Utest can be done
> with mock, the purpose is really to test the IO behavior. Then we can have
> itest using concrete ES instance.
>
> Anyway, I’m OK with the proposal and I would like to work on this IO (I
> have other improvements coming on other IOs anyway) with you guys (and
> Ludovic especially).
>
> Regards
> JB
>
> Le 5 févr. 2020 à 10:44, Etienne Chauchot  a écrit :
>
> Hi all,
>
> We had a long discussion with Ludovic about this IO. I'd like to share
> with you to keep you informed and also gather your opinions
>
> 1. regarding version support: ES v2 is no more maintained by Elastic since
> 2018/02 so we plan to remove it from the IO. In the past we already retired
> versions (like spark 1.6 for instance).
>
>
My only concern here is that there might be users who use the existing
module who might not be able to easily upgrade the Beam version if we
remove it. But given that V2 is 5 versions behind the latest release this
might be OK.


> 2. regarding the user: the aim is to unlock some new features (listed by
> Ludovic) and give the user more flexibility on his request. For that, it
> requires to use high level java ES client in place of the low level REST
> client (that was used because it is the only one compatible with all ES
> versions). We plan to replace the API (json document in and out) by more
> complete standard ES objects that contain de request logic (insert/update,
> doc routing etc...) and the data. There are already IOs like SpannerIO that
> use similar objects in input PCollection rather than pure POJOs.
>
>
Won't this be a breaking change for all users ? IMO using POJOs in
PCollections is safer since we have to worry about changes to the
underlying client library API. Exception would be when underlying client
library offers a backwards compatibility guarantee that we can rely on for
the foreseeable future (for example, BQ TableRow).


> 3. regarding multiple/single module: the aim is to have only one
> production code to ease the maintenance.  The problem is that using high
> level client makes the code dependent to an ES lib version. We would like
> to make it invisible to the user. He should select only one jar and the IO
> should decide the lib to use behind the scene. We are thinking about using
> one module and sub-modules per version and use relocation, wrappers and a
> factory that detects the version the IO actually points to to instantiate
> the correct client version. It would also require to have DTOs in the IO
> because the high level ES java objects are not exactly the same among the
> ES versions.
>
> +1 for adding a level of indirection to make this easy for users.

> 4. regarding tests: the aim is always to target real ES backends to have
> relevant tests (for reasons I already explained in another thread). The
> problem is that es-test-framework used today is version dependent and is a
> pain to use. We plan on using test containers per version (validated by ES
> dev advocate) and launching them as part of the UTests. Obviously we will
> launch only one container at the time per version and do all the test with
> it to avoid paying the cost of launch too much. And the tests will be
> shipped in per-version sub-modules and not in test dedicated modules like
> it is now.
>
>
Using a real ES backend for unit tests can be expensive ? Ideally we should
use a Fake (if one available) or mocking (test test out functionality) and
use real backend for IT tests that can be expensive. If this is a local
container that can be shared between unit tests with reasonable efficiency
that is OK. I'm mainly worried about introducing flakes into unit tests due
to network errors or slowness.

Thanks,
Cham


> WDYT ?
>
> Best !
>
> Etienne
> On 30/01/2020 17:55, Alexey Romanenko wrote:
>
> I’m second for this question. We have a similar (maybe a bit less painful)
> issue for KafkaIO and it would be useful to have a general strategy for
> such cases about how to deal with that.
>
> On 24 Jan 2020, at 21:54, Kenneth Knowles  wrote:
>
> Would it make sense to have 

Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Etienne Chauchot
Still there is something I don't agree with is that IOs can be tested on 
mock. We don't really test IO behavior with mocks: there is always 
special behaviors that cannot be reproduced in mocks (split, load, with 
corner cases etc...). There was in the past IOs that were tested using 
mocks and that happened to be nonfunctional.


Regarding ITests we have very few comparing to UTests and they are not 
as closely observed as UTests.


Etienne

On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:

Hi,

We talked in the past about multiple/single module.

IMHO the always preferred goal is to have a single module. However, 
it’s tricky when we have such difference, including on the user facing 
API. So, I would go with module per version, or use a specified 
version for a target Beam release.


For the test, we should distinguish utest from itest. Utest can be 
done with mock, the purpose is really to test the IO behavior. Then we 
can have itest using concrete ES instance.


Anyway, I’m OK with the proposal and I would like to work on this IO 
(I have other improvements coming on other IOs anyway) with you guys 
(and Ludovic especially).


Regards
JB

Le 5 févr. 2020 à 10:44, Etienne Chauchot > a écrit :


Hi all,

We had a long discussion with Ludovic about this IO. I'd like to 
share with you to keep you informed and also gather your opinions


1. regarding version support: ES v2 is no more maintained by Elastic 
since 2018/02 so we plan to remove it from the IO. In the past we 
already retired versions (like spark 1.6 for instance).


2. regarding the user: the aim is to unlock some new features (listed 
by Ludovic) and give the user more flexibility on his request. For 
that, it requires to use high level java ES client in place of the 
low level REST client (that was used because it is the only one 
compatible with all ES versions). We plan to replace the API (json 
document in and out) by more complete standard ES objects that 
contain de request logic (insert/update, doc routing etc...) and the 
data. There are already IOs like SpannerIO that use similar objects 
in input PCollection rather than pure POJOs.


3. regarding multiple/single module: the aim is to have only one 
production code to ease the maintenance.  The problem is that using 
high level client makes the code dependent to an ES lib version. We 
would like to make it invisible to the user. He should select only 
one jar and the IO should decide the lib to use behind the scene. We 
are thinking about using one module and sub-modules per version and 
use relocation, wrappers and a factory that detects the version the 
IO actually points to to instantiate the correct client version. It 
would also require to have DTOs in the IO because the high level ES 
java objects are not exactly the same among the ES versions.


4. regarding tests: the aim is always to target real ES backends to 
have relevant tests (for reasons I already explained in another 
thread). The problem is that es-test-framework used today is version 
dependent and is a pain to use. We plan on using test containers per 
version (validated by ES dev advocate) and launching them as part of 
the UTests. Obviously we will launch only one container at the time 
per version and do all the test with it to avoid paying the cost of 
launch too much. And the tests will be shipped in per-version 
sub-modules and not in test dedicated modules like it is now.


WDYT ?

Best !

Etienne

On 30/01/2020 17:55, Alexey Romanenko wrote:
I’m second for this question. We have a similar (maybe a bit less 
painful) issue for KafkaIO and it would be useful to have a general 
strategy for such cases about how to deal with that.


On 24 Jan 2020, at 21:54, Kenneth Knowles > wrote:


Would it make sense to have different version-specialized 
connectors with a common core library and common API package?


On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath 
mailto:chamik...@google.com>> wrote:


Thanks for the contribution. I agree with Alexey that we should
try to add any new features brought in with the new PR into
existing connector instead of trying to maintain two
implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:

Hi Ludovic,

Thank you for working on this and sharing the details with
us. This is really great job!

As I recall, we already have some support of Elasticsearch7
in current ElasticsearchIO (afaik, at least they are
compatible), thanks to Zhong Chen and Etienne Chauchot, who
were working on adding this [1][2] and it should be
released in Beam 2.19.

Would you think you can leverage this in your work on
adding new Elasticsearch7 features? IMHO, supporting two
different related IOs can be quite tough task and I‘d
rather raise my hand to add a new 

Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Jean-Baptiste Onofre
Hi,

We talked in the past about multiple/single module.

IMHO the always preferred goal is to have a single module. However, it’s tricky 
when we have such difference, including on the user facing API. So, I would go 
with module per version, or use a specified version for a target Beam release.

For the test, we should distinguish utest from itest. Utest can be done with 
mock, the purpose is really to test the IO behavior. Then we can have itest 
using concrete ES instance.

Anyway, I’m OK with the proposal and I would like to work on this IO (I have 
other improvements coming on other IOs anyway) with you guys (and Ludovic 
especially).

Regards
JB

> Le 5 févr. 2020 à 10:44, Etienne Chauchot  a écrit :
> 
> Hi all, 
> 
> We had a long discussion with Ludovic about this IO. I'd like to share with 
> you to keep you informed and also gather your opinions
> 
> 1. regarding version support: ES v2 is no more maintained by Elastic since 
> 2018/02 so we plan to remove it from the IO. In the past we already retired 
> versions (like spark 1.6 for instance).
> 
> 2. regarding the user: the aim is to unlock some new features (listed by 
> Ludovic) and give the user more flexibility on his request. For that, it 
> requires to use high level java ES client in place of the low level REST 
> client (that was used because it is the only one compatible with all ES 
> versions). We plan to replace the API (json document in and out) by more 
> complete standard ES objects that contain de request logic (insert/update, 
> doc routing etc...) and the data. There are already IOs like SpannerIO that 
> use similar objects in input PCollection rather than pure POJOs. 
> 
> 3. regarding multiple/single module: the aim is to have only one production 
> code to ease the maintenance.  The problem is that using high level client 
> makes the code dependent to an ES lib version. We would like to make it 
> invisible to the user. He should select only one jar and the IO should decide 
> the lib to use behind the scene. We are thinking about using one module and 
> sub-modules per version and use relocation, wrappers and a factory that 
> detects the version the IO actually points to to instantiate the correct 
> client version. It would also require to have DTOs in the IO because the high 
> level ES java objects are not exactly the same among the ES versions.
> 
> 4. regarding tests: the aim is always to target real ES backends to have 
> relevant tests (for reasons I already explained in another thread). The 
> problem is that es-test-framework used today is version dependent and is a 
> pain to use. We plan on using test containers per version (validated by ES 
> dev advocate) and launching them as part of the UTests. Obviously we will 
> launch only one container at the time per version and do all the test with it 
> to avoid paying the cost of launch too much. And the tests will be shipped in 
> per-version sub-modules and not in test dedicated modules like it is now.
> 
> WDYT ?
> 
> Best !
> 
> Etienne
> 
> On 30/01/2020 17:55, Alexey Romanenko wrote:
>> I’m second for this question. We have a similar (maybe a bit less painful) 
>> issue for KafkaIO and it would be useful to have a general strategy for such 
>> cases about how to deal with that.
>> 
>>> On 24 Jan 2020, at 21:54, Kenneth Knowles >> > wrote:
>>> 
>>> Would it make sense to have different version-specialized connectors with a 
>>> common core library and common API package?
>>> 
>>> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath >> > wrote:
>>> Thanks for the contribution. I agree with Alexey that we should try to add 
>>> any new features brought in with the new PR into existing connector instead 
>>> of trying to maintain two implementations.  
>>> 
>>> Thanks,
>>> Cham
>>> 
>>> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko >> > wrote:
>>> Hi Ludovic,
>>> 
>>> Thank you for working on this and sharing the details with us. This is 
>>> really great job!
>>> 
>>> As I recall, we already have some support of Elasticsearch7 in current 
>>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen 
>>> and Etienne Chauchot, who were working on adding this [1][2] and it should 
>>> be released in Beam 2.19.
>>> 
>>> Would you think you can leverage this in your work on adding new 
>>> Elasticsearch7 features? IMHO, supporting two different related IOs can be 
>>> quite tough task and I‘d rather raise my hand to add a new functionality 
>>> into existing IO than creating a new one, if it’s possible.
>>> 
>>> [1] https://issues.apache.org/jira/browse/BEAM-5192 
>>> 
>>> [2] https://github.com/apache/beam/pull/10433 
>>> 
>>> 
 On 22 Jan 2020, at 19:23, Ludovic Boutros >>> > wrote:
 
 Dear all,
 
 I have written a 

Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Etienne Chauchot

Hi all,

We had a long discussion with Ludovic about this IO. I'd like to share 
with you to keep you informed and also gather your opinions


1. regarding version support: ES v2 is no more maintained by Elastic 
since 2018/02 so we plan to remove it from the IO. In the past we 
already retired versions (like spark 1.6 for instance).


2. regarding the user: the aim is to unlock some new features (listed by 
Ludovic) and give the user more flexibility on his request. For that, it 
requires to use high level java ES client in place of the low level REST 
client (that was used because it is the only one compatible with all ES 
versions). We plan to replace the API (json document in and out) by more 
complete standard ES objects that contain de request logic 
(insert/update, doc routing etc...) and the data. There are already IOs 
like SpannerIO that use similar objects in input PCollection rather than 
pure POJOs.


3. regarding multiple/single module: the aim is to have only one 
production code to ease the maintenance.  The problem is that using high 
level client makes the code dependent to an ES lib version. We would 
like to make it invisible to the user. He should select only one jar and 
the IO should decide the lib to use behind the scene. We are thinking 
about using one module and sub-modules per version and use relocation, 
wrappers and a factory that detects the version the IO actually points 
to to instantiate the correct client version. It would also require to 
have DTOs in the IO because the high level ES java objects are not 
exactly the same among the ES versions.


4. regarding tests: the aim is always to target real ES backends to have 
relevant tests (for reasons I already explained in another thread). The 
problem is that es-test-framework used today is version dependent and is 
a pain to use. We plan on using test containers per version (validated 
by ES dev advocate) and launching them as part of the UTests. Obviously 
we will launch only one container at the time per version and do all the 
test with it to avoid paying the cost of launch too much. And the tests 
will be shipped in per-version sub-modules and not in test dedicated 
modules like it is now.


WDYT ?

Best !

Etienne

On 30/01/2020 17:55, Alexey Romanenko wrote:
I’m second for this question. We have a similar (maybe a bit less 
painful) issue for KafkaIO and it would be useful to have a general 
strategy for such cases about how to deal with that.


On 24 Jan 2020, at 21:54, Kenneth Knowles > wrote:


Would it make sense to have different version-specialized connectors 
with a common core library and common API package?


On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath 
mailto:chamik...@google.com>> wrote:


Thanks for the contribution. I agree with Alexey that we should
try to add any new features brought in with the new PR into
existing connector instead of trying to maintain two
implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:

Hi Ludovic,

Thank you for working on this and sharing the details with
us. This is really great job!

As I recall, we already have some support of Elasticsearch7
in current ElasticsearchIO (afaik, at least they are
compatible), thanks to Zhong Chen and Etienne Chauchot, who
were working on adding this [1][2] and it should be released
in Beam 2.19.

Would you think you can leverage this in your work on adding
new Elasticsearch7 features? IMHO, supporting two different
related IOs can be quite tough task and I‘d rather raise my
hand to add a new functionality into existing IO than
creating a new one, if it’s possible.

[1] https://issues.apache.org/jira/browse/BEAM-5192
[2] https://github.com/apache/beam/pull/10433


On 22 Jan 2020, at 19:23, Ludovic Boutros
mailto:boutr...@gmail.com>> wrote:

Dear all,

I have written a completely reworked Elasticsearch 7+ IO module.
It can be found here:

https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7

This is a quite advance WIP work but I'm a quite new user of
Apache Beam and I would like to get some help on this :)

I can create a JIRA issue now but I prefer to wait for your
wise avises first.

_Why a new module ?_

The current module was compliant with Elasticsearch 2.x, 5.x
and 6.x. This seems to be a good point but so many things
have been changed since Elasticsearch 2.x.



Probably this is not correct anymore due to
https://github.com/apache/beam/pull/10433 ?


Elasticsearch 7.x is now partially supported (document type
are removed, occ, updates...).

A fresh new module, only compliant with the last 

Re: A new reworked Elasticsearch 7+ IO module

2020-01-30 Thread Alexey Romanenko
I’m second for this question. We have a similar (maybe a bit less painful) 
issue for KafkaIO and it would be useful to have a general strategy for such 
cases about how to deal with that.

> On 24 Jan 2020, at 21:54, Kenneth Knowles  wrote:
> 
> Would it make sense to have different version-specialized connectors with a 
> common core library and common API package?
> 
> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath  > wrote:
> Thanks for the contribution. I agree with Alexey that we should try to add 
> any new features brought in with the new PR into existing connector instead 
> of trying to maintain two implementations.  
> 
> Thanks,
> Cham
> 
> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko  > wrote:
> Hi Ludovic,
> 
> Thank you for working on this and sharing the details with us. This is really 
> great job!
> 
> As I recall, we already have some support of Elasticsearch7 in current 
> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen 
> and Etienne Chauchot, who were working on adding this [1][2] and it should be 
> released in Beam 2.19.
> 
> Would you think you can leverage this in your work on adding new 
> Elasticsearch7 features? IMHO, supporting two different related IOs can be 
> quite tough task and I‘d rather raise my hand to add a new functionality into 
> existing IO than creating a new one, if it’s possible.
> 
> [1] https://issues.apache.org/jira/browse/BEAM-5192 
> 
> [2] https://github.com/apache/beam/pull/10433 
> 
> 
>> On 22 Jan 2020, at 19:23, Ludovic Boutros > > wrote:
>> 
>> Dear all,
>> 
>> I have written a completely reworked Elasticsearch 7+ IO module.
>> It can be found here: 
>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>>  
>> 
>> 
>> This is a quite advance WIP work but I'm a quite new user of Apache Beam and 
>> I would like to get some help on this :)
>> 
>> I can create a JIRA issue now but I prefer to wait for your wise avises 
>> first.
>> 
>> Why a new module ?
>> 
>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. This 
>> seems to be a good point but so many things have been changed since 
>> Elasticsearch 2.x.
> 
> 
> Probably this is not correct anymore due to 
> https://github.com/apache/beam/pull/10433 
>  ?
>  
>> Elasticsearch 7.x is now partially supported (document type are removed, 
>> occ, updates...).
>> 
>> A fresh new module, only compliant with the last version of Elasticsearch, 
>> can easily benefit a lot from the last evolutions of Elasticsearch (Java 
>> High Level Http Client).
>> 
>> It is therefore far simpler than the current one.
>> 
>> Error management
>> 
>> Currently, errors are caught and transformed into simple exceptions. This is 
>> not always what is needed. If we would like to do specific processing on 
>> these errors (send documents in error topics for instance), it is not 
>> possible with the current module.
> 
> 
> Seems like this is some sort of a dead letter queue implementation.. This 
> will be a very good feature to add to the existing connector.
>  
>> 
>> Philosophy
>> 
>> This module directly uses the Elasticsearch Java client classes as inputs 
>> and outputs. 
>> 
>> This way you can configure any options you need directly in the 
>> `DocWriteRequest` objects.
>> 
>> For instance: 
>> - If you need to use external versioning 
>> (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning
>>  
>> ),
>>  you can.
>> - If you need to use an ingest pipelines, you can.
>> - If you need to configure an update document/script, you can.
>> - If you need to use upserts, you can.
>> 
>> Actually, you should be able to do everything you can do directly with 
>> Elasticsearch.
>> 
>> Furthermore, it should be easier to keep updating the module with future 
>> Elasticsearch evolutions.
>> 
>> Write outputs
>> 
>> Two outputs are available:
>> - Successful indexing output ;
>> - Failed indexing output.
>> 
>> They are available in a `WriteResult` object.
>> 
>> These two outputs are represented by 
>> `PCollection` objects.
>> 
>> A `BulkItemResponseContainer` contains:
>> - the original index request ;
>> - the Elasticsearch response ;
>> - a batch id.
>> 
>> You can apply any process afterwards (reprocessing, alerting, ...).
>> 
>> Read input
>> 
>> You can read documents from Elasticsearch with this module.
>> You can specify a `QueryBuilder` in order to filter the retrieved documents.
>> By default, it retrieves the whole document collection.

Re: A new reworked Elasticsearch 7+ IO module

2020-01-30 Thread Etienne Chauchot

Hi Ludovic,

First of all thanks for your work.

Then, please be aware that the current ES IO on master supports ES7 
already and will be part of Beam 2.19.


I understand that your approach enables many new features which is great !

For the record, the current ES module was designed to have only one 
production code (only the test modules are different among the versions 
because they use and embedded ES):


- there is some if(version) but not that much.

- we use low level REST client because it is the only one that is 
compatible with all the versions of ES.


My only concern is to reduce the maintenance burden. Having 2 modules 
(v2,v5, v6, v7) and (v7) looks like difficult to maintain. It can be 
done for a reduced period of time by only doing bug fixes on the first 
module like you suggest, but pretty quickly we would need to have only 
one module. You mentioned that you tried supporting these new features 
in the actual module (which would be not maintainable due to using 
different ES classes) but have you tried supporting V2, v5, v6 and v7 of 
ES in your new module with the high level client to enable the new 
features and still get all the versions support with one production 
code? Would it be feasible with no spaghetti plate ? Because at some 
point, like you mention, we will end up retiring the old module (at a 
major version) and thus we would need to still support older versions of ES.


Regarding the question on MIT license, it is a category A license that 
can be included in ASF projects.


Best,

Etienne


On 25/01/2020 14:23, Ludovic Boutros wrote:

Hi all,

First, thank you for your great answers.
I thank Zhong Chen and Etienne Chauchot for their great job on this too !

Alexey and Chamikara, I understand your point of view.
Actually, I have the same as much as possible.

But in this case, my goal was to be able to do all the following 
things in a Beam pipeline with Elasticsearch:


- be compliant with Elasticsearch 7.x (as the current one now) ;
- be able to retrieve errors in order to some processing on them ;
- be able to (atomic) update (with scripts) or delete documents ;
- be able to manage Optimistic Concurrency Control ;
- be able to manage document versioning ;
- be able to test the module easily, even with SSL and so on (I'm 
using Testcontainers, is the MIT Licence compliant with the Apache one 
?) ;

- and even more.

I can insure you that I first tried to implement all these functions 
in the current module, but, because it tries to be compliant with 
Elasticsearch 2.x-7.x (it's already a big challenge ;)), the result 
would have been like a spaghetti plate with quite a lot of 
"if-then-else" blocs everywhere.


And finally, I really don't think this is the way to go.

I'm playing a lot with another Apache project, Apache Camel.
And I think Apache Camel is the Master of Component Management :)

What they'are doing in this case is exactly what I would like to 
achieve here.
They keep the old one, implement a new one which support newer 
versions in order to keep the module maintenable, precisely.


Both module supports Elasticsearch 7.x, but Elasticsearch 8 will be 
released in a few months with document type removal and more breaking 
changes. It will be harder and harder to maintain it.


What I would like to propose is:
- as Kenneth said, we can keep both modules (and I think Elasticsearch 
deserves it ;)) ;
- current users, can keep using the old one and can migrate or not to 
the new one ;
- evolutions and migrations should be done on the new one which 
directly uses the Elasticsearch classes ;

- only bug fixes are done on the old one.

Apache Camel does switch the modules really later on. Basically only 
on major releases (you can check the release note of the 3.0 release, 
see the mongoDb part).


And again, this is a work in progress, I would be glad and I will keep 
improving it (more documentations for instance).
I'm really happy to share this with you and I will take in account 
each remark.


Thank you again and have a good week-end,

Ludovic.




Le ven. 24 janv. 2020 à 21:54, Kenneth Knowles > a écrit :


Would it make sense to have different version-specialized
connectors with a common core library and common API package?

On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:

Thanks for the contribution. I agree with Alexey that we
should try to add any new features brought in with the new PR
into existing connector instead of trying to maintain two
implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
mailto:aromanenko@gmail.com>>
wrote:

Hi Ludovic,

Thank you for working on this and sharing the details with
us. This is really great job!

As I recall, we already have some support
of Elasticsearch7 in current 

Re: A new reworked Elasticsearch 7+ IO module

2020-01-25 Thread Ludovic Boutros
Hi all,

First, thank you for your great answers.
I thank Zhong Chen and Etienne Chauchot for their great job on this too !

Alexey and Chamikara, I understand your point of view.
Actually, I have the same as much as possible.

But in this case, my goal was to be able to do all the following things in
a Beam pipeline with Elasticsearch:

- be compliant with Elasticsearch 7.x (as the current one now) ;
- be able to retrieve errors in order to some processing on them ;
- be able to (atomic) update (with scripts) or delete documents ;
- be able to manage Optimistic Concurrency Control ;
- be able to manage document versioning ;
- be able to test the module easily, even with SSL and so on (I'm using
Testcontainers, is the MIT Licence compliant with the Apache one ?) ;
- and even more.

I can insure you that I first tried to implement all these functions in the
current module, but, because it tries to be compliant with Elasticsearch
2.x-7.x (it's already a big challenge ;)), the result would have been like
a spaghetti plate with quite a lot of "if-then-else" blocs everywhere.

And finally, I really don't think this is the way to go.

I'm playing a lot with another Apache project, Apache Camel.
And I think Apache Camel is the Master of Component Management :)

What they'are doing in this case is exactly what I would like to achieve
here.
They keep the old one, implement a new one which support newer versions in
order to keep the module maintenable, precisely.

Both module supports Elasticsearch 7.x, but Elasticsearch 8 will be
released in a few months with document type removal and more breaking
changes. It will be harder and harder to maintain it.

What I would like to propose is:
- as Kenneth said, we can keep both modules (and I think Elasticsearch
deserves it ;)) ;
- current users, can keep using the old one and can migrate or not to the
new one ;
- evolutions and migrations should be done on the new one which directly
uses the Elasticsearch classes ;
- only bug fixes are done on the old one.

Apache Camel does switch the modules really later on. Basically only on
major releases (you can check the release note of the 3.0 release, see the
mongoDb part).

And again, this is a work in progress, I would be glad and I will keep
improving it (more documentations for instance).
I'm really happy to share this with you and I will take in account each
remark.

Thank you again and have a good week-end,

Ludovic.




Le ven. 24 janv. 2020 à 21:54, Kenneth Knowles  a écrit :

> Would it make sense to have different version-specialized connectors with
> a common core library and common API package?
>
> On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath 
> wrote:
>
>> Thanks for the contribution. I agree with Alexey that we should try to
>> add any new features brought in with the new PR into existing connector
>> instead of trying to maintain two implementations.
>>
>> Thanks,
>> Cham
>>
>> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Hi Ludovic,
>>>
>>> Thank you for working on this and sharing the details with us. This is
>>> really great job!
>>>
>>> As I recall, we already have some support of Elasticsearch7 in current
>>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen
>>> and Etienne Chauchot, who were working on adding this [1][2] and it should
>>> be released in Beam 2.19.
>>>
>>> Would you think you can leverage this in your work on adding new
>>> Elasticsearch7 features? IMHO, supporting two different related IOs can be
>>> quite tough task and I‘d rather raise my hand to add a new functionality
>>> into existing IO than creating a new one, if it’s possible.
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-5192
>>> [2] https://github.com/apache/beam/pull/10433
>>>
>>> On 22 Jan 2020, at 19:23, Ludovic Boutros  wrote:
>>>
>>> Dear all,
>>>
>>> I have written a completely reworked Elasticsearch 7+ IO module.
>>> It can be found here:
>>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>>>
>>> This is a quite advance WIP work but I'm a quite new user of Apache Beam
>>> and I would like to get some help on this :)
>>>
>>> I can create a JIRA issue now but I prefer to wait for your wise avises
>>> first.
>>>
>>> *Why a new module ?*
>>>
>>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x.
>>> This seems to be a good point but so many things have been changed since
>>> Elasticsearch 2.x.
>>>
>>>
>> Probably this is not correct anymore due to
>> https://github.com/apache/beam/pull/10433 ?
>>
>>
>>> Elasticsearch 7.x is now partially supported (document type are removed,
>>> occ, updates...).
>>>
>>> A fresh new module, only compliant with the last version of
>>> Elasticsearch, can easily benefit a lot from the last evolutions of
>>> Elasticsearch (Java High Level Http Client).
>>>
>>> It is therefore far simpler than the current one.
>>>
>>> *Error 

Re: A new reworked Elasticsearch 7+ IO module

2020-01-24 Thread Kenneth Knowles
Would it make sense to have different version-specialized connectors with a
common core library and common API package?

On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath 
wrote:

> Thanks for the contribution. I agree with Alexey that we should try to add
> any new features brought in with the new PR into existing connector instead
> of trying to maintain two implementations.
>
> Thanks,
> Cham
>
> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko 
> wrote:
>
>> Hi Ludovic,
>>
>> Thank you for working on this and sharing the details with us. This is
>> really great job!
>>
>> As I recall, we already have some support of Elasticsearch7 in current
>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen
>> and Etienne Chauchot, who were working on adding this [1][2] and it should
>> be released in Beam 2.19.
>>
>> Would you think you can leverage this in your work on adding new
>> Elasticsearch7 features? IMHO, supporting two different related IOs can be
>> quite tough task and I‘d rather raise my hand to add a new functionality
>> into existing IO than creating a new one, if it’s possible.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-5192
>> [2] https://github.com/apache/beam/pull/10433
>>
>> On 22 Jan 2020, at 19:23, Ludovic Boutros  wrote:
>>
>> Dear all,
>>
>> I have written a completely reworked Elasticsearch 7+ IO module.
>> It can be found here:
>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>>
>> This is a quite advance WIP work but I'm a quite new user of Apache Beam
>> and I would like to get some help on this :)
>>
>> I can create a JIRA issue now but I prefer to wait for your wise avises
>> first.
>>
>> *Why a new module ?*
>>
>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x.
>> This seems to be a good point but so many things have been changed since
>> Elasticsearch 2.x.
>>
>>
> Probably this is not correct anymore due to
> https://github.com/apache/beam/pull/10433 ?
>
>
>> Elasticsearch 7.x is now partially supported (document type are removed,
>> occ, updates...).
>>
>> A fresh new module, only compliant with the last version of
>> Elasticsearch, can easily benefit a lot from the last evolutions of
>> Elasticsearch (Java High Level Http Client).
>>
>> It is therefore far simpler than the current one.
>>
>> *Error management*
>>
>> Currently, errors are caught and transformed into simple exceptions. This
>> is not always what is needed. If we would like to do specific processing on
>> these errors (send documents in error topics for instance), it is not
>> possible with the current module.
>>
>>
> Seems like this is some sort of a dead letter queue implementation.. This
> will be a very good feature to add to the existing connector.
>
>
>>
>> *Philosophy*
>>
>> This module directly uses the Elasticsearch Java client classes as inputs
>> and outputs.
>>
>> This way you can configure any options you need directly in the
>> `DocWriteRequest` objects.
>>
>> For instance:
>> - If you need to use external versioning (
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
>> you can.
>> - If you need to use an ingest pipelines, you can.
>> - If you need to configure an update document/script, you can.
>> - If you need to use upserts, you can.
>>
>> Actually, you should be able to do everything you can do directly with
>> Elasticsearch.
>>
>> Furthermore, it should be easier to keep updating the module with future
>> Elasticsearch evolutions.
>>
>> *Write outputs*
>>
>> Two outputs are available:
>> - Successful indexing output ;
>> - Failed indexing output.
>>
>> They are available in a `WriteResult` object.
>>
>> These two outputs are represented by
>> `PCollection` objects.
>>
>> A `BulkItemResponseContainer` contains:
>> - the original index request ;
>> - the Elasticsearch response ;
>> - a batch id.
>>
>> You can apply any process afterwards (reprocessing, alerting, ...).
>>
>> *Read input*
>>
>> You can read documents from Elasticsearch with this module.
>> You can specify a `QueryBuilder` in order to filter the retrieved
>> documents.
>> By default, it retrieves the whole document collection.
>>
>> If the Elasticsearch index is sharded, multiple slices can be used during
>> fetch. That many bundles are created. The maximum bundle count is equal to
>> the index shard count.
>>
>> Thank you !
>>
>> Ludovic
>>
>>
>>


Re: A new reworked Elasticsearch 7+ IO module

2020-01-24 Thread Chamikara Jayalath
Thanks for the contribution. I agree with Alexey that we should try to add
any new features brought in with the new PR into existing connector instead
of trying to maintain two implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko 
wrote:

> Hi Ludovic,
>
> Thank you for working on this and sharing the details with us. This is
> really great job!
>
> As I recall, we already have some support of Elasticsearch7 in current
> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen
> and Etienne Chauchot, who were working on adding this [1][2] and it should
> be released in Beam 2.19.
>
> Would you think you can leverage this in your work on adding new
> Elasticsearch7 features? IMHO, supporting two different related IOs can be
> quite tough task and I‘d rather raise my hand to add a new functionality
> into existing IO than creating a new one, if it’s possible.
>
> [1] https://issues.apache.org/jira/browse/BEAM-5192
> [2] https://github.com/apache/beam/pull/10433
>
> On 22 Jan 2020, at 19:23, Ludovic Boutros  wrote:
>
> Dear all,
>
> I have written a completely reworked Elasticsearch 7+ IO module.
> It can be found here:
> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>
> This is a quite advance WIP work but I'm a quite new user of Apache Beam
> and I would like to get some help on this :)
>
> I can create a JIRA issue now but I prefer to wait for your wise avises
> first.
>
> *Why a new module ?*
>
> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. This
> seems to be a good point but so many things have been changed since
> Elasticsearch 2.x.
>
>
Probably this is not correct anymore due to
https://github.com/apache/beam/pull/10433 ?


> Elasticsearch 7.x is now partially supported (document type are removed,
> occ, updates...).
>
> A fresh new module, only compliant with the last version of Elasticsearch,
> can easily benefit a lot from the last evolutions of Elasticsearch (Java
> High Level Http Client).
>
> It is therefore far simpler than the current one.
>
> *Error management*
>
> Currently, errors are caught and transformed into simple exceptions. This
> is not always what is needed. If we would like to do specific processing on
> these errors (send documents in error topics for instance), it is not
> possible with the current module.
>
>
Seems like this is some sort of a dead letter queue implementation.. This
will be a very good feature to add to the existing connector.


>
> *Philosophy*
>
> This module directly uses the Elasticsearch Java client classes as inputs
> and outputs.
>
> This way you can configure any options you need directly in the
> `DocWriteRequest` objects.
>
> For instance:
> - If you need to use external versioning (
> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
> you can.
> - If you need to use an ingest pipelines, you can.
> - If you need to configure an update document/script, you can.
> - If you need to use upserts, you can.
>
> Actually, you should be able to do everything you can do directly with
> Elasticsearch.
>
> Furthermore, it should be easier to keep updating the module with future
> Elasticsearch evolutions.
>
> *Write outputs*
>
> Two outputs are available:
> - Successful indexing output ;
> - Failed indexing output.
>
> They are available in a `WriteResult` object.
>
> These two outputs are represented by
> `PCollection` objects.
>
> A `BulkItemResponseContainer` contains:
> - the original index request ;
> - the Elasticsearch response ;
> - a batch id.
>
> You can apply any process afterwards (reprocessing, alerting, ...).
>
> *Read input*
>
> You can read documents from Elasticsearch with this module.
> You can specify a `QueryBuilder` in order to filter the retrieved
> documents.
> By default, it retrieves the whole document collection.
>
> If the Elasticsearch index is sharded, multiple slices can be used during
> fetch. That many bundles are created. The maximum bundle count is equal to
> the index shard count.
>
> Thank you !
>
> Ludovic
>
>
>


Re: A new reworked Elasticsearch 7+ IO module

2020-01-24 Thread Alexey Romanenko
Hi Ludovic,

Thank you for working on this and sharing the details with us. This is really 
great job!

As I recall, we already have some support of Elasticsearch7 in current 
ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen and 
Etienne Chauchot, who were working on adding this [1][2] and it should be 
released in Beam 2.19.

Would you think you can leverage this in your work on adding new Elasticsearch7 
features? IMHO, supporting two different related IOs can be quite tough task 
and I‘d rather raise my hand to add a new functionality into existing IO than 
creating a new one, if it’s possible.

[1] https://issues.apache.org/jira/browse/BEAM-5192 

[2] https://github.com/apache/beam/pull/10433 


> On 22 Jan 2020, at 19:23, Ludovic Boutros  wrote:
> 
> Dear all,
> 
> I have written a completely reworked Elasticsearch 7+ IO module.
> It can be found here: 
> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>  
> 
> 
> This is a quite advance WIP work but I'm a quite new user of Apache Beam and 
> I would like to get some help on this :)
> 
> I can create a JIRA issue now but I prefer to wait for your wise avises first.
> 
> Why a new module ?
> 
> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. This 
> seems to be a good point but so many things have been changed since 
> Elasticsearch 2.x.
> Elasticsearch 7.x is now partially supported (document type are removed, occ, 
> updates...).
> 
> A fresh new module, only compliant with the last version of Elasticsearch, 
> can easily benefit a lot from the last evolutions of Elasticsearch (Java High 
> Level Http Client).
> 
> It is therefore far simpler than the current one.
> 
> Error management
> 
> Currently, errors are caught and transformed into simple exceptions. This is 
> not always what is needed. If we would like to do specific processing on 
> these errors (send documents in error topics for instance), it is not 
> possible with the current module.
> 
> Philosophy
> 
> This module directly uses the Elasticsearch Java client classes as inputs and 
> outputs. 
> 
> This way you can configure any options you need directly in the 
> `DocWriteRequest` objects.
> 
> For instance: 
> - If you need to use external versioning 
> (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning
>  
> ),
>  you can.
> - If you need to use an ingest pipelines, you can.
> - If you need to configure an update document/script, you can.
> - If you need to use upserts, you can.
> 
> Actually, you should be able to do everything you can do directly with 
> Elasticsearch.
> 
> Furthermore, it should be easier to keep updating the module with future 
> Elasticsearch evolutions.
> 
> Write outputs
> 
> Two outputs are available:
> - Successful indexing output ;
> - Failed indexing output.
> 
> They are available in a `WriteResult` object.
> 
> These two outputs are represented by `PCollection` 
> objects.
> 
> A `BulkItemResponseContainer` contains:
> - the original index request ;
> - the Elasticsearch response ;
> - a batch id.
> 
> You can apply any process afterwards (reprocessing, alerting, ...).
> 
> Read input
> 
> You can read documents from Elasticsearch with this module.
> You can specify a `QueryBuilder` in order to filter the retrieved documents.
> By default, it retrieves the whole document collection.
> 
> If the Elasticsearch index is sharded, multiple slices can be used during 
> fetch. That many bundles are created. The maximum bundle count is equal to 
> the index shard count.
> 
> Thank you !
> 
> Ludovic