Re: [DISCUSS] Autoformat python code with Black

2020-01-24 Thread Kamil Wasilewski
PR is ready: https://github.com/apache/beam/pull/10684. Please share your
comments ;-) I've managed to reduce the impact a bit:
501 files changed, 18245 insertions(+), 19495 deletions(-)

We still need to consider how to enforce the usage of autoformatter.
Pre-commit sounds like a nice addition, but it still needs to be installed
manually by a developer. On the other hand, Jenkins precommit job that
fails if any unformatted code is detected looks like too strict. What do
you think?

On Thu, Jan 23, 2020 at 8:37 PM Robert Bradshaw  wrote:

> Thanks! Now we get to debate what knobs to twiddle :-P
>
> FYI, I did a simple run (just pushed to
> https://github.com/apache/beam/compare/master...robertwb:yapf) to see
> the impact. The diff is
>
> $ git diff --stat master
> ...
>  547 files changed, 22118 insertions(+), 21129 deletions(-)
>
> For reference
>
> $ find sdks/python/apache_beam -name '*.py' | xargs wc
> ...
> 200424  612002 7431637 total
>
> which means a little over 10% of lines get touched. I think there are
> some options, such as SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES and
> COALESCE_BRACKETS, that will conform more to the style we are already
> (mostly) following.
>
>
> On Thu, Jan 23, 2020 at 1:59 AM Kamil Wasilewski
>  wrote:
> >
> > Thank you Michał for creating the ticket. I have some free time and I'd
> like to volunteer myself for this task.
> > Indeed, it looks like there's consensus for `yapf`, so I'll try `yapf`
> first.
> >
> > Best,
> > Kamil
> >
> >
> > On Thu, Jan 23, 2020 at 10:37 AM Michał Walenia <
> michal.wale...@polidea.com> wrote:
> >>
> >> Hi all,
> >> I created a JIRA issue for this and summarized the available tools
> >>
> >> https://issues.apache.org/jira/browse/BEAM-9175
> >>
> >> Cheers,
> >> Michal
> >>
> >> On Thu, Jan 23, 2020 at 1:49 AM Udi Meiri  wrote:
> >>>
> >>> Sorry, backing off on this due to time constraints.
> >>>
> >>> On Wed, Jan 22, 2020 at 3:39 PM Udi Meiri  wrote:
> 
>  It sounds like there's a consensus for yapf. I volunteer to take this
> on
> 
>  On Wed, Jan 22, 2020, 10:31 Udi Meiri  wrote:
> >
> > +1 to autoformatting
> >
> > On Wed, Jan 22, 2020 at 9:57 AM Luke Cwik  wrote:
> >>
> >> +1 to autoformatters. Also the Beam Java SDK went through a one
> time pass to apply the spotless formatting.
> >>
> >> On Tue, Jan 21, 2020 at 9:52 PM Ahmet Altay 
> wrote:
> >>>
> >>> +1 to autoformatters and yapf. It appears to be a well maintained
> project. I do support making a one time pass to apply formatting the whole
> code base.
> >>>
> >>> On Tue, Jan 21, 2020 at 5:38 PM Chad Dombrova 
> wrote:
> >
> >
> > It'd be good if there was a way to only apply to violating (or at
> > least changed) lines.
> 
> 
>  I assumed the first thing we’d do is convert all of the code in
> one go, since it’s a very safe operation. Did you have something else in
> mind?
> 
>  -chad
> 
> 
> 
> 
> >
> >
> > On Tue, Jan 21, 2020 at 1:56 PM Chad Dombrova 
> wrote:
> > >
> > > +1 to autoformatting
> > >
> > > Let me add some nuance to that.
> > >
> > > The way I see it there are 2 varieties of formatters:  those
> which take the original formatting into consideration (autopep8) and those
> which disregard it (yapf, black).
> > >
> > > I much prefer yapf to black, because you have plenty of
> options to tweak with yapf (enough to make the output a pretty close match
> to the current Beam style), and you can mark areas to preserve the original
> formatting, which could be very useful with Pipeline building with pipe
> operators.  Please don't pick black.
> > >
> > > autopep8 is more along the lines of spotless in Java -- it
> only corrects code that breaks the project's style rules.  The big problem
> with Beam's current style is that it is so esoteric that autopep8 can't
> enforce it -- and I'm not just talking about 2-spaces, which I don't really
> have a problem with -- the problem is the use of either 2 or 4 spaces
> depending on context (expression start vs hanging indent, etc).  This is my
> *biggest* gripe about the current style.  PyCharm doesn't have enough
> control either.  So, if we can choose a style that can be expressed by
> flake8 or pycodestyle then we can use autopep8 to enforce it.
> > >
> > > I'd prefer autopep8 to yapf because I like having a little
> wiggle room to influence the style, but on a big project like Beam all that
> wiggle room ends up to minor but noticeable inconsistencies in style
> throughout the project.  yapf ensures completely consistent style, but the
> tradeoff is that it's sometimes ugly, especially in scenarios with similar
> repeated entries like argparse, where yapf might insert line breaks in
> visually inconsistent and unappealin

Re: [DISCUSS] Autoformat python code with Black

2020-01-24 Thread Ismaël Mejía
Java build fails on any unformatted code so python probably should be like
that.
We have to ensure however that it fails early on that.
As Robert said time to debate the knobs :)

On Fri, Jan 24, 2020 at 3:19 PM Kamil Wasilewski <
kamil.wasilew...@polidea.com> wrote:

> PR is ready: https://github.com/apache/beam/pull/10684. Please share your
> comments ;-) I've managed to reduce the impact a bit:
> 501 files changed, 18245 insertions(+), 19495 deletions(-)
>
> We still need to consider how to enforce the usage of autoformatter.
> Pre-commit sounds like a nice addition, but it still needs to be installed
> manually by a developer. On the other hand, Jenkins precommit job that
> fails if any unformatted code is detected looks like too strict. What do
> you think?
>
> On Thu, Jan 23, 2020 at 8:37 PM Robert Bradshaw 
> wrote:
>
>> Thanks! Now we get to debate what knobs to twiddle :-P
>>
>> FYI, I did a simple run (just pushed to
>> https://github.com/apache/beam/compare/master...robertwb:yapf) to see
>> the impact. The diff is
>>
>> $ git diff --stat master
>> ...
>>  547 files changed, 22118 insertions(+), 21129 deletions(-)
>>
>> For reference
>>
>> $ find sdks/python/apache_beam -name '*.py' | xargs wc
>> ...
>> 200424  612002 7431637 total
>>
>> which means a little over 10% of lines get touched. I think there are
>> some options, such as SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES and
>> COALESCE_BRACKETS, that will conform more to the style we are already
>> (mostly) following.
>>
>>
>> On Thu, Jan 23, 2020 at 1:59 AM Kamil Wasilewski
>>  wrote:
>> >
>> > Thank you Michał for creating the ticket. I have some free time and I'd
>> like to volunteer myself for this task.
>> > Indeed, it looks like there's consensus for `yapf`, so I'll try `yapf`
>> first.
>> >
>> > Best,
>> > Kamil
>> >
>> >
>> > On Thu, Jan 23, 2020 at 10:37 AM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>> >>
>> >> Hi all,
>> >> I created a JIRA issue for this and summarized the available tools
>> >>
>> >> https://issues.apache.org/jira/browse/BEAM-9175
>> >>
>> >> Cheers,
>> >> Michal
>> >>
>> >> On Thu, Jan 23, 2020 at 1:49 AM Udi Meiri  wrote:
>> >>>
>> >>> Sorry, backing off on this due to time constraints.
>> >>>
>> >>> On Wed, Jan 22, 2020 at 3:39 PM Udi Meiri  wrote:
>> 
>>  It sounds like there's a consensus for yapf. I volunteer to take
>> this on
>> 
>>  On Wed, Jan 22, 2020, 10:31 Udi Meiri  wrote:
>> >
>> > +1 to autoformatting
>> >
>> > On Wed, Jan 22, 2020 at 9:57 AM Luke Cwik  wrote:
>> >>
>> >> +1 to autoformatters. Also the Beam Java SDK went through a one
>> time pass to apply the spotless formatting.
>> >>
>> >> On Tue, Jan 21, 2020 at 9:52 PM Ahmet Altay 
>> wrote:
>> >>>
>> >>> +1 to autoformatters and yapf. It appears to be a well maintained
>> project. I do support making a one time pass to apply formatting the whole
>> code base.
>> >>>
>> >>> On Tue, Jan 21, 2020 at 5:38 PM Chad Dombrova 
>> wrote:
>> >
>> >
>> > It'd be good if there was a way to only apply to violating (or
>> at
>> > least changed) lines.
>> 
>> 
>>  I assumed the first thing we’d do is convert all of the code in
>> one go, since it’s a very safe operation. Did you have something else in
>> mind?
>> 
>>  -chad
>> 
>> 
>> 
>> 
>> >
>> >
>> > On Tue, Jan 21, 2020 at 1:56 PM Chad Dombrova <
>> chad...@gmail.com> wrote:
>> > >
>> > > +1 to autoformatting
>> > >
>> > > Let me add some nuance to that.
>> > >
>> > > The way I see it there are 2 varieties of formatters:  those
>> which take the original formatting into consideration (autopep8) and those
>> which disregard it (yapf, black).
>> > >
>> > > I much prefer yapf to black, because you have plenty of
>> options to tweak with yapf (enough to make the output a pretty close match
>> to the current Beam style), and you can mark areas to preserve the original
>> formatting, which could be very useful with Pipeline building with pipe
>> operators.  Please don't pick black.
>> > >
>> > > autopep8 is more along the lines of spotless in Java -- it
>> only corrects code that breaks the project's style rules.  The big problem
>> with Beam's current style is that it is so esoteric that autopep8 can't
>> enforce it -- and I'm not just talking about 2-spaces, which I don't really
>> have a problem with -- the problem is the use of either 2 or 4 spaces
>> depending on context (expression start vs hanging indent, etc).  This is my
>> *biggest* gripe about the current style.  PyCharm doesn't have enough
>> control either.  So, if we can choose a style that can be expressed by
>> flake8 or pycodestyle then we can use autopep8 to enforce it.
>> > >
>> > > I'd prefer autopep8 to yapf because I lik

Re: Dynamic timers now supported!

2020-01-24 Thread Ismaël Mejía
This looks great, thanks for the contribution Rehman!

I have some questions (note I have not looked at the code at all).

- Is this working for both portable and non portable runners?
- What do other runners need to implement to support this (e.g. Spark)?

Maybe worth to add this to the website Compatibility Matrix.

Regards,
Ismaël


On Fri, Jan 24, 2020 at 8:42 AM Rehman Murad Ali <
rehman.murad...@venturedive.com> wrote:

> Thank you Reuven for the guidance throughout the development process. I am
> delighted to contribute my two cents to the Beam project.
>
> Looking forward to more active contributions.
>
>
> *Thanks & Regards*
>
>
>
> *Rehman Murad Ali*
> Software Engineer
> Mobile: +92 3452076766
> Skype: rehman.muradali
>
>
> On Thu, Jan 23, 2020 at 11:09 PM Reuven Lax  wrote:
>
>> Thanks to a lot of hard work by Rehman, Beam now supports dynamic timers.
>> As a reminder, this was discussed on the dev list some time back.
>>
>> As background, previously one had to statically declare all timers in
>> your code. So if you wanted to have two timers, you needed to create two
>> timer variables and two callbacks - one for each timer. A number of users
>> kept hitting stumbling blocks where they needed a dynamic set of timers
>> (often based on the element), which was not supported in Beam. The
>> workarounds were quite ugly and complicated.
>>
>> The new support allows declaring a TimerMap, which is a map of timers.
>> Each TimerMap is scoped by a family name, so you can create multiple
>> TimerMaps each with its own callback. The use looks as follows:
>>
>> class MyDoFn extends DoFn<...> {
>>@TimerFamily("timers")
>>private final TimerSpec timerMap =
>> TimerSpecs.timerMap(TimeDomain.EVENT_TIME);
>>
>>@ProcessElement
>> public void process(@TimerFamily("timers") TimerMap timers, @Element
>> Type e) {
>>timers.set("mainTimer", timestamp);
>>timers.set("actionType" + e.getActionType(), timestamp);
>>}
>>
>>   @OnTimerFamily .
>>   public void onTimer(@TimerId String timerId) {
>>  System.out.println("Timer fired. id: " + timerId);
>>   }
>> }
>>
>> This currently works for the Flink and the Dataflow runners.
>>
>> Thank you Rehman for getting this done! Beam users will find it very
>> valuable.
>>
>> Reuven
>>
>


Re: A new reworked Elasticsearch 7+ IO module

2020-01-24 Thread Alexey Romanenko
Hi Ludovic,

Thank you for working on this and sharing the details with us. This is really 
great job!

As I recall, we already have some support of Elasticsearch7 in current 
ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen and 
Etienne Chauchot, who were working on adding this [1][2] and it should be 
released in Beam 2.19.

Would you think you can leverage this in your work on adding new Elasticsearch7 
features? IMHO, supporting two different related IOs can be quite tough task 
and I‘d rather raise my hand to add a new functionality into existing IO than 
creating a new one, if it’s possible.

[1] https://issues.apache.org/jira/browse/BEAM-5192 

[2] https://github.com/apache/beam/pull/10433 


> On 22 Jan 2020, at 19:23, Ludovic Boutros  wrote:
> 
> Dear all,
> 
> I have written a completely reworked Elasticsearch 7+ IO module.
> It can be found here: 
> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>  
> 
> 
> This is a quite advance WIP work but I'm a quite new user of Apache Beam and 
> I would like to get some help on this :)
> 
> I can create a JIRA issue now but I prefer to wait for your wise avises first.
> 
> Why a new module ?
> 
> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. This 
> seems to be a good point but so many things have been changed since 
> Elasticsearch 2.x.
> Elasticsearch 7.x is now partially supported (document type are removed, occ, 
> updates...).
> 
> A fresh new module, only compliant with the last version of Elasticsearch, 
> can easily benefit a lot from the last evolutions of Elasticsearch (Java High 
> Level Http Client).
> 
> It is therefore far simpler than the current one.
> 
> Error management
> 
> Currently, errors are caught and transformed into simple exceptions. This is 
> not always what is needed. If we would like to do specific processing on 
> these errors (send documents in error topics for instance), it is not 
> possible with the current module.
> 
> Philosophy
> 
> This module directly uses the Elasticsearch Java client classes as inputs and 
> outputs. 
> 
> This way you can configure any options you need directly in the 
> `DocWriteRequest` objects.
> 
> For instance: 
> - If you need to use external versioning 
> (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning
>  
> ),
>  you can.
> - If you need to use an ingest pipelines, you can.
> - If you need to configure an update document/script, you can.
> - If you need to use upserts, you can.
> 
> Actually, you should be able to do everything you can do directly with 
> Elasticsearch.
> 
> Furthermore, it should be easier to keep updating the module with future 
> Elasticsearch evolutions.
> 
> Write outputs
> 
> Two outputs are available:
> - Successful indexing output ;
> - Failed indexing output.
> 
> They are available in a `WriteResult` object.
> 
> These two outputs are represented by `PCollection` 
> objects.
> 
> A `BulkItemResponseContainer` contains:
> - the original index request ;
> - the Elasticsearch response ;
> - a batch id.
> 
> You can apply any process afterwards (reprocessing, alerting, ...).
> 
> Read input
> 
> You can read documents from Elasticsearch with this module.
> You can specify a `QueryBuilder` in order to filter the retrieved documents.
> By default, it retrieves the whole document collection.
> 
> If the Elasticsearch index is sharded, multiple slices can be used during 
> fetch. That many bundles are created. The maximum bundle count is equal to 
> the index shard count.
> 
> Thank you !
> 
> Ludovic



Re: A new reworked Elasticsearch 7+ IO module

2020-01-24 Thread Chamikara Jayalath
Thanks for the contribution. I agree with Alexey that we should try to add
any new features brought in with the new PR into existing connector instead
of trying to maintain two implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko 
wrote:

> Hi Ludovic,
>
> Thank you for working on this and sharing the details with us. This is
> really great job!
>
> As I recall, we already have some support of Elasticsearch7 in current
> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen
> and Etienne Chauchot, who were working on adding this [1][2] and it should
> be released in Beam 2.19.
>
> Would you think you can leverage this in your work on adding new
> Elasticsearch7 features? IMHO, supporting two different related IOs can be
> quite tough task and I‘d rather raise my hand to add a new functionality
> into existing IO than creating a new one, if it’s possible.
>
> [1] https://issues.apache.org/jira/browse/BEAM-5192
> [2] https://github.com/apache/beam/pull/10433
>
> On 22 Jan 2020, at 19:23, Ludovic Boutros  wrote:
>
> Dear all,
>
> I have written a completely reworked Elasticsearch 7+ IO module.
> It can be found here:
> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>
> This is a quite advance WIP work but I'm a quite new user of Apache Beam
> and I would like to get some help on this :)
>
> I can create a JIRA issue now but I prefer to wait for your wise avises
> first.
>
> *Why a new module ?*
>
> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x. This
> seems to be a good point but so many things have been changed since
> Elasticsearch 2.x.
>
>
Probably this is not correct anymore due to
https://github.com/apache/beam/pull/10433 ?


> Elasticsearch 7.x is now partially supported (document type are removed,
> occ, updates...).
>
> A fresh new module, only compliant with the last version of Elasticsearch,
> can easily benefit a lot from the last evolutions of Elasticsearch (Java
> High Level Http Client).
>
> It is therefore far simpler than the current one.
>
> *Error management*
>
> Currently, errors are caught and transformed into simple exceptions. This
> is not always what is needed. If we would like to do specific processing on
> these errors (send documents in error topics for instance), it is not
> possible with the current module.
>
>
Seems like this is some sort of a dead letter queue implementation.. This
will be a very good feature to add to the existing connector.


>
> *Philosophy*
>
> This module directly uses the Elasticsearch Java client classes as inputs
> and outputs.
>
> This way you can configure any options you need directly in the
> `DocWriteRequest` objects.
>
> For instance:
> - If you need to use external versioning (
> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
> you can.
> - If you need to use an ingest pipelines, you can.
> - If you need to configure an update document/script, you can.
> - If you need to use upserts, you can.
>
> Actually, you should be able to do everything you can do directly with
> Elasticsearch.
>
> Furthermore, it should be easier to keep updating the module with future
> Elasticsearch evolutions.
>
> *Write outputs*
>
> Two outputs are available:
> - Successful indexing output ;
> - Failed indexing output.
>
> They are available in a `WriteResult` object.
>
> These two outputs are represented by
> `PCollection` objects.
>
> A `BulkItemResponseContainer` contains:
> - the original index request ;
> - the Elasticsearch response ;
> - a batch id.
>
> You can apply any process afterwards (reprocessing, alerting, ...).
>
> *Read input*
>
> You can read documents from Elasticsearch with this module.
> You can specify a `QueryBuilder` in order to filter the retrieved
> documents.
> By default, it retrieves the whole document collection.
>
> If the Elasticsearch index is sharded, multiple slices can be used during
> fetch. That many bundles are created. The maximum bundle count is equal to
> the index shard count.
>
> Thank you !
>
> Ludovic
>
>
>


Re: Dynamic timers now supported!

2020-01-24 Thread Reuven Lax
The new timer family is in the portability protos. I think TimerReceiver
needs to be updated to set it though (I think a 1-line change).

The TimerInternals class that runners implement today already handles
dynamic timers, so most of the work was in the Beam SDK  to provide an API
that allows users to access this feature.

The main work needed in the runner was to take in account the timer family.
Beam semantics say that if a timer is set twice with the same id, then the
second timer overwrites the first.  Several runners therefore had maps from
timer id -> timer. However since the timer family scopes the timers, we now
allow two timers with the same id as long as the timer families are
different. Runners had to be updated to include the timer family id in the
map keys.

Surprisingly, the new TimerMap tests seem to pass on Spark ValidatesRunner,
even though the Spark runner wasn't updated! I wonder if this means that
the Spark runner was incorrectly implementing the Beam semantics before,
and setTimer was not overwriting timers with the same id?

Reuven

On Fri, Jan 24, 2020 at 7:31 AM Ismaël Mejía  wrote:

> This looks great, thanks for the contribution Rehman!
>
> I have some questions (note I have not looked at the code at all).
>
> - Is this working for both portable and non portable runners?
> - What do other runners need to implement to support this (e.g. Spark)?
>
> Maybe worth to add this to the website Compatibility Matrix.
>
> Regards,
> Ismaël
>
>
> On Fri, Jan 24, 2020 at 8:42 AM Rehman Murad Ali <
> rehman.murad...@venturedive.com> wrote:
>
>> Thank you Reuven for the guidance throughout the development process. I
>> am delighted to contribute my two cents to the Beam project.
>>
>> Looking forward to more active contributions.
>>
>>
>> *Thanks & Regards*
>>
>>
>>
>> *Rehman Murad Ali*
>> Software Engineer
>> Mobile: +92 3452076766 <+92%20345%202076766>
>> Skype: rehman.muradali
>>
>>
>> On Thu, Jan 23, 2020 at 11:09 PM Reuven Lax  wrote:
>>
>>> Thanks to a lot of hard work by Rehman, Beam now supports dynamic
>>> timers. As a reminder, this was discussed on the dev list some time back.
>>>
>>> As background, previously one had to statically declare all timers in
>>> your code. So if you wanted to have two timers, you needed to create two
>>> timer variables and two callbacks - one for each timer. A number of users
>>> kept hitting stumbling blocks where they needed a dynamic set of timers
>>> (often based on the element), which was not supported in Beam. The
>>> workarounds were quite ugly and complicated.
>>>
>>> The new support allows declaring a TimerMap, which is a map of timers.
>>> Each TimerMap is scoped by a family name, so you can create multiple
>>> TimerMaps each with its own callback. The use looks as follows:
>>>
>>> class MyDoFn extends DoFn<...> {
>>>@TimerFamily("timers")
>>>private final TimerSpec timerMap =
>>> TimerSpecs.timerMap(TimeDomain.EVENT_TIME);
>>>
>>>@ProcessElement
>>> public void process(@TimerFamily("timers") TimerMap timers, @Element
>>> Type e) {
>>>timers.set("mainTimer", timestamp);
>>>timers.set("actionType" + e.getActionType(), timestamp);
>>>}
>>>
>>>   @OnTimerFamily .
>>>   public void onTimer(@TimerId String timerId) {
>>>  System.out.println("Timer fired. id: " + timerId);
>>>   }
>>> }
>>>
>>> This currently works for the Flink and the Dataflow runners.
>>>
>>> Thank you Rehman for getting this done! Beam users will find it very
>>> valuable.
>>>
>>> Reuven
>>>
>>


Re: A new reworked Elasticsearch 7+ IO module

2020-01-24 Thread Kenneth Knowles
Would it make sense to have different version-specialized connectors with a
common core library and common API package?

On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath 
wrote:

> Thanks for the contribution. I agree with Alexey that we should try to add
> any new features brought in with the new PR into existing connector instead
> of trying to maintain two implementations.
>
> Thanks,
> Cham
>
> On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko 
> wrote:
>
>> Hi Ludovic,
>>
>> Thank you for working on this and sharing the details with us. This is
>> really great job!
>>
>> As I recall, we already have some support of Elasticsearch7 in current
>> ElasticsearchIO (afaik, at least they are compatible), thanks to Zhong Chen
>> and Etienne Chauchot, who were working on adding this [1][2] and it should
>> be released in Beam 2.19.
>>
>> Would you think you can leverage this in your work on adding new
>> Elasticsearch7 features? IMHO, supporting two different related IOs can be
>> quite tough task and I‘d rather raise my hand to add a new functionality
>> into existing IO than creating a new one, if it’s possible.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-5192
>> [2] https://github.com/apache/beam/pull/10433
>>
>> On 22 Jan 2020, at 19:23, Ludovic Boutros  wrote:
>>
>> Dear all,
>>
>> I have written a completely reworked Elasticsearch 7+ IO module.
>> It can be found here:
>> https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7
>>
>> This is a quite advance WIP work but I'm a quite new user of Apache Beam
>> and I would like to get some help on this :)
>>
>> I can create a JIRA issue now but I prefer to wait for your wise avises
>> first.
>>
>> *Why a new module ?*
>>
>> The current module was compliant with Elasticsearch 2.x, 5.x and 6.x.
>> This seems to be a good point but so many things have been changed since
>> Elasticsearch 2.x.
>>
>>
> Probably this is not correct anymore due to
> https://github.com/apache/beam/pull/10433 ?
>
>
>> Elasticsearch 7.x is now partially supported (document type are removed,
>> occ, updates...).
>>
>> A fresh new module, only compliant with the last version of
>> Elasticsearch, can easily benefit a lot from the last evolutions of
>> Elasticsearch (Java High Level Http Client).
>>
>> It is therefore far simpler than the current one.
>>
>> *Error management*
>>
>> Currently, errors are caught and transformed into simple exceptions. This
>> is not always what is needed. If we would like to do specific processing on
>> these errors (send documents in error topics for instance), it is not
>> possible with the current module.
>>
>>
> Seems like this is some sort of a dead letter queue implementation.. This
> will be a very good feature to add to the existing connector.
>
>
>>
>> *Philosophy*
>>
>> This module directly uses the Elasticsearch Java client classes as inputs
>> and outputs.
>>
>> This way you can configure any options you need directly in the
>> `DocWriteRequest` objects.
>>
>> For instance:
>> - If you need to use external versioning (
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning),
>> you can.
>> - If you need to use an ingest pipelines, you can.
>> - If you need to configure an update document/script, you can.
>> - If you need to use upserts, you can.
>>
>> Actually, you should be able to do everything you can do directly with
>> Elasticsearch.
>>
>> Furthermore, it should be easier to keep updating the module with future
>> Elasticsearch evolutions.
>>
>> *Write outputs*
>>
>> Two outputs are available:
>> - Successful indexing output ;
>> - Failed indexing output.
>>
>> They are available in a `WriteResult` object.
>>
>> These two outputs are represented by
>> `PCollection` objects.
>>
>> A `BulkItemResponseContainer` contains:
>> - the original index request ;
>> - the Elasticsearch response ;
>> - a batch id.
>>
>> You can apply any process afterwards (reprocessing, alerting, ...).
>>
>> *Read input*
>>
>> You can read documents from Elasticsearch with this module.
>> You can specify a `QueryBuilder` in order to filter the retrieved
>> documents.
>> By default, it retrieves the whole document collection.
>>
>> If the Elasticsearch index is sharded, multiple slices can be used during
>> fetch. That many bundles are created. The maximum bundle count is equal to
>> the index shard count.
>>
>> Thank you !
>>
>> Ludovic
>>
>>
>>


Re: GSOC announced!

2020-01-24 Thread Xinbin Huang
Hi Rui,

Yes, I would like to contribute to Apache Beam, but I don't have a specific
topic of interest in mind.

I have reviewed some of the issues on JIRA, and would like to work on some
of them. I have read through the contributing page
https://beam.apache.org/contribute/ and it gives me an idea about the
desired workflow. Besides that, are there any other sources I should refer
to?

I will open a separate email to get permission on JIRA.

Cheers.
Bin

On Wed, Jan 15, 2020 at 1:16 PM Rui Wang  wrote:

> Hi Xinbin,
>
> I assume you want to contribute to Apache Beam while you are less
> experienced, thus you want to seek for some mentorship?
>
> This topic was discussed before. I don't think we decided to build a
> formal mentorship program for Beam. Instead, would you share your interest
> first and then probably we could ask if there are people that know the
> topic who can actually mentor?
>
>
> -Rui
>
> On Wed, Jan 15, 2020 at 9:30 AM Xinbin Huang 
> wrote:
>
>> Hi community,
>>
>> I am pretty new to the apache beam community and want to contribute to
>> the project. I think GCOS is a great opportunity for people to learn and
>> contribute, but I am not eligible for it because I am not a student. That
>> being said, would that be opportunities for non-students to participate in
>> this or other opportunities that is suitable for less experienced people
>> that want to contribute?
>>
>> Thanks!
>> Bin
>>
>> On Wed, Jan 15, 2020 at 8:52 AM Ismaël Mejía  wrote:
>>
>>> Thanks for bringing this info. +1 on the Nexmark + Python + Portability
>>> project.
>>> Let's sync on that one Pablo. I am interested on co-mentoring it.
>>>
>>>
>>> On Tue, Jan 14, 2020 at 7:55 PM Rui Wang  wrote:
>>>
 Great! I will try to propose something for BeamSQL.


 -Rui

 On Tue, Jan 14, 2020 at 10:40 AM Pablo Estrada 
 wrote:

> Hello everyone,
>
> As with every year, the Google Summer of Code has been announced[1],
> so we can start preparing for it if anyone is interested. It's early in 
> the
> process for now, but it's good to prepare early : )
>
> Here are the ASF mentor guidelines[2]. For now, the thing to do is to
> file JIRA issues for your projects, and apply the labels "mentor", "gsoc",
> "gsoc2020".
>
> When the time comes, the next steps are to join the
> ment...@community.apache.org list, and request the PMC for approval
> of a project.
>
> My current plan is to have these projects, though these are subject to
> change:
> - Build Nexmark pipelines for Python SDK (Ismael FYI)
> - Azure Blobstore File System for Java & Python
>
> I'll try to keep the dev@ list updated with other steps of the
> process.
> Thanks!
> -P.
>
> [1] https://summerofcode.withgoogle.com/
> [2]
> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
>



New contributor

2020-01-24 Thread Xinbin Huang
Hi

This is Xinbin Huang. Can someone add me as a contributor for Beam's Jira
issue tracker? I would like to create/assign tickets for my work.

Thanks
Bin


Re: Dynamic timers now supported!

2020-01-24 Thread Maximilian Michels
The Flink Runner was allowing to set a timer multiple times before we 
made it comply with the Beam semantics of overwriting past invocations. 
I wouldn't be surprised if the Spark Runner never addressed this. Flink 
and Spark itself allow for a timer to be set to multiple times. In order 
to fix this for Beam, the Flink Runner has to maintain a checkpointed 
map which sits outside of its builtin TimerService.


As far as I can see, multiple timer families are currently not supported 
in the Flink Runner due to the map not taking the family name into 
account. This can be easily fixed though.


-Max

On 24.01.20 21:31, Reuven Lax wrote:
The new timer family is in the portability protos. I think TimerReceiver 
needs to be updated to set it though (I think a 1-line change).


The TimerInternals class that runners implement today already handles 
dynamic timers, so most of the work was in the Beam SDK  to provide an 
API that allows users to access this feature.


The main work needed in the runner was to take in account the timer 
family. Beam semantics say that if a timer is set twice with the same 
id, then the second timer overwrites the first.  Several runners 
therefore had maps from timer id -> timer. However since the 
timer family scopes the timers, we now allow two timers with the same id 
as long as the timer families are different. Runners had to be updated 
to include the timer family id in the map keys.


Surprisingly, the new TimerMap tests seem to pass on Spark 
ValidatesRunner, even though the Spark runner wasn't updated! I wonder 
if this means that the Spark runner was incorrectly implementing the 
Beam semantics before, and setTimer was not overwriting timers with the 
same id?


Reuven

On Fri, Jan 24, 2020 at 7:31 AM Ismaël Mejía > wrote:


This looks great, thanks for the contribution Rehman!

I have some questions (note I have not looked at the code at all).

- Is this working for both portable and non portable runners?
- What do other runners need to implement to support this (e.g. Spark)?

Maybe worth to add this to the website Compatibility Matrix.

Regards,
Ismaël


On Fri, Jan 24, 2020 at 8:42 AM Rehman Murad Ali
mailto:rehman.murad...@venturedive.com>> wrote:

Thank you Reuven for the guidance throughout the development
process. I am delighted to contribute my two cents to the Beam
project.

Looking forward to more active contributions.

*
*

*Thanks & Regards*



*Rehman Murad Ali*
Software Engineer
Mobile: +92 3452076766 
Skype: rehman.muradali



On Thu, Jan 23, 2020 at 11:09 PM Reuven Lax mailto:re...@google.com>> wrote:

Thanks to a lot of hard work by Rehman, Beam now supports
dynamic timers. As a reminder, this was discussed on the dev
list some time back.

As background, previously one had to statically declare all
timers in your code. So if you wanted to have two timers,
you needed to create two timer variables and two callbacks -
one for each timer. A number of users kept hitting stumbling
blocks where they needed a dynamic set of timers (often
based on the element), which was not supported in Beam. The
workarounds were quite ugly and complicated.

The new support allows declaring a TimerMap, which is a map
of timers. Each TimerMap is scoped by a family name, so you
can create multiple TimerMaps each with its own callback.
The use looks as follows:

class MyDoFn extends DoFn<...> {
    @TimerFamily("timers")
    private final TimerSpec timerMap =
TimerSpecs.timerMap(TimeDomain.EVENT_TIME);

    @ProcessElement
     public void process(@TimerFamily("timers") TimerMap
timers, @Element Type e) {
        timers.set("mainTimer", timestamp);
        timers.set("actionType" + e.getActionType(), timestamp);
    }

   @OnTimerFamily .
   public void onTimer(@TimerId String timerId) {
      System.out.println("Timer fired. id: " + timerId);
   }
}

This currently works for the Flink and the Dataflow runners.

Thank you Rehman for getting this done! Beam users will find
it very valuable.

Reuven



Re: New contributor

2020-01-24 Thread Ahmet Altay
Welcome. What is your jira user name?

On Fri, Jan 24, 2020 at 1:25 PM Xinbin Huang  wrote:

> Hi
>
> This is Xinbin Huang. Can someone add me as a contributor for Beam's Jira
> issue tracker? I would like to create/assign tickets for my work.
>
> Thanks
> Bin
>


Re: Dynamic timers now supported!

2020-01-24 Thread Reuven Lax
Yes. For now we exclude the flink runner, but fixing this should be fairly
trivial.

On Fri, Jan 24, 2020 at 3:35 PM Maximilian Michels  wrote:

> The Flink Runner was allowing to set a timer multiple times before we
> made it comply with the Beam semantics of overwriting past invocations.
> I wouldn't be surprised if the Spark Runner never addressed this. Flink
> and Spark itself allow for a timer to be set to multiple times. In order
> to fix this for Beam, the Flink Runner has to maintain a checkpointed
> map which sits outside of its builtin TimerService.
>
> As far as I can see, multiple timer families are currently not supported
> in the Flink Runner due to the map not taking the family name into
> account. This can be easily fixed though.
>
> -Max
>
> On 24.01.20 21:31, Reuven Lax wrote:
> > The new timer family is in the portability protos. I think TimerReceiver
> > needs to be updated to set it though (I think a 1-line change).
> >
> > The TimerInternals class that runners implement today already handles
> > dynamic timers, so most of the work was in the Beam SDK  to provide an
> > API that allows users to access this feature.
> >
> > The main work needed in the runner was to take in account the timer
> > family. Beam semantics say that if a timer is set twice with the same
> > id, then the second timer overwrites the first.  Several runners
> > therefore had maps from timer id -> timer. However since the
> > timer family scopes the timers, we now allow two timers with the same id
> > as long as the timer families are different. Runners had to be updated
> > to include the timer family id in the map keys.
> >
> > Surprisingly, the new TimerMap tests seem to pass on Spark
> > ValidatesRunner, even though the Spark runner wasn't updated! I wonder
> > if this means that the Spark runner was incorrectly implementing the
> > Beam semantics before, and setTimer was not overwriting timers with the
> > same id?
> >
> > Reuven
> >
> > On Fri, Jan 24, 2020 at 7:31 AM Ismaël Mejía  > > wrote:
> >
> > This looks great, thanks for the contribution Rehman!
> >
> > I have some questions (note I have not looked at the code at all).
> >
> > - Is this working for both portable and non portable runners?
> > - What do other runners need to implement to support this (e.g.
> Spark)?
> >
> > Maybe worth to add this to the website Compatibility Matrix.
> >
> > Regards,
> > Ismaël
> >
> >
> > On Fri, Jan 24, 2020 at 8:42 AM Rehman Murad Ali
> >  > > wrote:
> >
> > Thank you Reuven for the guidance throughout the development
> > process. I am delighted to contribute my two cents to the Beam
> > project.
> >
> > Looking forward to more active contributions.
> >
> > *
> > *
> >
> > *Thanks & Regards*
> >
> >
> >
> > *Rehman Murad Ali*
> > Software Engineer
> > Mobile: +92 3452076766 <+92%20345%202076766>
> 
> > Skype: rehman.muradali
> >
> >
> >
> > On Thu, Jan 23, 2020 at 11:09 PM Reuven Lax  > > wrote:
> >
> > Thanks to a lot of hard work by Rehman, Beam now supports
> > dynamic timers. As a reminder, this was discussed on the dev
> > list some time back.
> >
> > As background, previously one had to statically declare all
> > timers in your code. So if you wanted to have two timers,
> > you needed to create two timer variables and two callbacks -
> > one for each timer. A number of users kept hitting stumbling
> > blocks where they needed a dynamic set of timers (often
> > based on the element), which was not supported in Beam. The
> > workarounds were quite ugly and complicated.
> >
> > The new support allows declaring a TimerMap, which is a map
> > of timers. Each TimerMap is scoped by a family name, so you
> > can create multiple TimerMaps each with its own callback.
> > The use looks as follows:
> >
> > class MyDoFn extends DoFn<...> {
> > @TimerFamily("timers")
> > private final TimerSpec timerMap =
> > TimerSpecs.timerMap(TimeDomain.EVENT_TIME);
> >
> > @ProcessElement
> >  public void process(@TimerFamily("timers") TimerMap
> > timers, @Element Type e) {
> > timers.set("mainTimer", timestamp);
> > timers.set("actionType" + e.getActionType(),
> timestamp);
> > }
> >
> >@OnTimerFamily .
> >public void onTimer(@TimerId String timerId) {
> >   System.out.println("Timer fired. id: " + timerId);
> >}
> > }
> >
> > This currently works for the Flink and the Dataflow runners.
> >
> >  

Re: help with this error, please

2020-01-24 Thread Vasu Nori
Just to close this issue out,
I found out what the problem is.

A class constructor was failing at runtime.
when constructor fails, it *somehow* becomes NoClassDef exception.
What confused me what that the error-causing line wasn't identified
correctly in the stacktrace.
I stripped out segments I added one at a time and finally found the
specific line that caused the failure.

thanks

On Thu, Jan 23, 2020 at 12:13 PM Vasu Nori  wrote:

> Luke, Toko
>
> Thanks for the responses.
> I will upload a change that is the minimum needed to reproduce this error.
> shortly.
>
> thanks,
>
> On Wed, Jan 22, 2020 at 7:48 PM Tomo Suzuki  wrote:
>
>> Hi Vasu,
>> (Ignore my message if Luke's advice resolves the issue already)
>>
>> Would you add the entire error message, if any? A NoClassDefFoundError
>> is usually caused by a ClassNotFoundError saying a class is not found.
>> I don't see the missing class name in your stacktrace. I'll need (1)
>> the entire error message and (2) your branch in GitHub and a command
>> to reproduce the error.
>>
>> Regards,
>> Tomo
>>
>>
>> On Wed, Jan 22, 2020 at 7:53 PM Luke Cwik  wrote:
>> >
>> > boolean properties only allow for getYYY, isYYY and setYYY, you can't
>> use "should". I think you should have gotten a better error message though
>> so it's likely something else is not working for you. How are you trying to
>> run the test?
>> >
>> > All pipeline options use a global namespace so UseGrpc will "reserve"
>> that name from it being used anywhere else. Do you want to define a new
>> property UseGrpc for all GCP IOs or use a name that is specific to GCS?
>> >
>> >
>> > On Wed, Jan 22, 2020 at 3:59 PM Vasu Nori  wrote:
>> >>
>> >> sorry I didn't realize I posted a screenshot link that wasn't visible
>> outside google.com.
>> >> here is the code I was trying to add to GcsOptions.java
>> >>
>> >>   @Description("Whether to use gRPC or not, as transport mechanism.")
>> >>   @Default.Boolean(false)
>> >>   Boolean shouldUseGrpc();
>> >>
>> >>   void setUseGrpc(Boolean useGrpc);
>> >>
>> >>
>> >> On Wed, Jan 22, 2020 at 2:37 PM Vasu Nori  wrote:
>> >>>
>> >>> Hello
>> >>>
>> >>> I am trying to add a new property to this file like this
>> >>> but this results in some error I can't understand the origins of.
>> >>> Stacktrace is below.
>> >>> any pointers would be appreciated.
>> >>>
>> >>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.beam.sdk.options.PipelineOptionsFactory
>> >>> at
>> org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystemTest.setUp(GcsFileSystemTest.java:65)
>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> >>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >>> at java.lang.reflect.Method.invoke(Method.java:498)
>> >>> at
>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>> >>> at
>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>> >>> at
>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>> >>> at
>> org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
>> >>> at
>> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>> >>> at
>> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:266)
>> >>> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:305)
>> >>> at
>> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>> >>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:365)
>> >>> at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>> >>> at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>> >>> at org.junit.runners.ParentRunner$4.run(ParentRunner.java:330)
>> >>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:78)
>> >>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:328)
>> >>> at org.junit.runners.ParentRunner.access$100(ParentRunner.java:65)
>> >>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292)
>> >>> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:305)
>> >>> at org.junit.runners.ParentRunner.run(ParentRunner.java:412)
>> >>> at
>> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
>> >>> at
>> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>> >>> at
>> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>> >>> at
>> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
>> >>> at
>> org.gradle.api.internal.tasks.testing.SuiteTest

Re: New contributor

2020-01-24 Thread Xinbin Huang
Oh sorry, forgot to paste it here. My username is xbhuang .

Thanks


On Fri, Jan 24, 2020, 4:25 PM Ahmet Altay  wrote:

> Welcome. What is your jira user name?
>
> On Fri, Jan 24, 2020 at 1:25 PM Xinbin Huang 
> wrote:
>
>> Hi
>>
>> This is Xinbin Huang. Can someone add me as a contributor for Beam's Jira
>> issue tracker? I would like to create/assign tickets for my work.
>>
>> Thanks
>> Bin
>>
>