Jenkins build is back to stable : beam_Release_NightlySnapshot #504

2017-08-15 Thread Apache Jenkins Server
See 




Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Shen Li
Hi Eugene,

Thanks for sharing the info. That PAssertionSite tracks where an assertion
error occurred. Do you know if it is possible to get the class name and
line number where a PTransform was added?

Thanks,
Shen

On Mon, Aug 14, 2017 at 10:54 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hi Shen,
> Responding just to one part of your message - "remember the line at which
> the PTransform was added": take a look at
> https://github.com/apache/beam/pull/2247 which does this for PAssert.
>
> On Mon, Aug 14, 2017 at 7:32 PM Shen Li  wrote:
>
> > In 0.5.0 or earlier releases, PipelineRunner provides an
> > apply(PTransform, InputT) method which allows runner
> > implementation to perform actions when the PTransform is added to the
> > pipeline. In later releases, that apply(...) method has been replaced by
> a
> > Pipeline#replaceAll() method, where the runner can only get involved
> after
> > the pipeline has been fully constructed. In terms of override
> PTransforms,
> > these two APIs are identical. But, the early API could still be helpful.
> > For example, the runner could remember the line at which the PTransform
> was
> > added, and provide that info to users to assist debugging. Is it possible
> > to add that API back? Or is there any other way to involve the runner
> when
> > a PTransform is added?
> >
> > Thanks,
> > Shen
> >
>


Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Eugene Kirpichov
In general, no - but the implementation of PAssertionSite exemplifies the
approach. I guess it could be useful to make this a general beam feature
and remember it for all transforms. It would probably be best to implement
inside Pipeline.apply().

On Tue, Aug 15, 2017, 7:02 AM Shen Li  wrote:

> Hi Eugene,
>
> Thanks for sharing the info. That PAssertionSite tracks where an assertion
> error occurred. Do you know if it is possible to get the class name and
> line number where a PTransform was added?
>
> Thanks,
> Shen
>
> On Mon, Aug 14, 2017 at 10:54 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Hi Shen,
> > Responding just to one part of your message - "remember the line at which
> > the PTransform was added": take a look at
> > https://github.com/apache/beam/pull/2247 which does this for PAssert.
> >
> > On Mon, Aug 14, 2017 at 7:32 PM Shen Li  wrote:
> >
> > > In 0.5.0 or earlier releases, PipelineRunner provides an
> > > apply(PTransform, InputT) method which allows runner
> > > implementation to perform actions when the PTransform is added to the
> > > pipeline. In later releases, that apply(...) method has been replaced
> by
> > a
> > > Pipeline#replaceAll() method, where the runner can only get involved
> > after
> > > the pipeline has been fully constructed. In terms of override
> > PTransforms,
> > > these two APIs are identical. But, the early API could still be
> helpful.
> > > For example, the runner could remember the line at which the PTransform
> > was
> > > added, and provide that info to users to assist debugging. Is it
> possible
> > > to add that API back? Or is there any other way to involve the runner
> > when
> > > a PTransform is added?
> > >
> > > Thanks,
> > > Shen
> > >
> >
>


Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Eugene Kirpichov
... And remember it and make available inside PCollection (which
application produced this collection).

On Tue, Aug 15, 2017, 8:39 AM Eugene Kirpichov  wrote:

> In general, no - but the implementation of PAssertionSite exemplifies the
> approach. I guess it could be useful to make this a general beam feature
> and remember it for all transforms. It would probably be best to implement
> inside Pipeline.apply().
>
> On Tue, Aug 15, 2017, 7:02 AM Shen Li  wrote:
>
>> Hi Eugene,
>>
>> Thanks for sharing the info. That PAssertionSite tracks where an assertion
>> error occurred. Do you know if it is possible to get the class name and
>> line number where a PTransform was added?
>>
>> Thanks,
>> Shen
>>
>> On Mon, Aug 14, 2017 at 10:54 PM, Eugene Kirpichov <
>> kirpic...@google.com.invalid> wrote:
>>
>> > Hi Shen,
>> > Responding just to one part of your message - "remember the line at
>> which
>> > the PTransform was added": take a look at
>> > https://github.com/apache/beam/pull/2247 which does this for PAssert.
>> >
>> > On Mon, Aug 14, 2017 at 7:32 PM Shen Li  wrote:
>> >
>> > > In 0.5.0 or earlier releases, PipelineRunner provides an
>> > > apply(PTransform, InputT) method which allows runner
>> > > implementation to perform actions when the PTransform is added to the
>> > > pipeline. In later releases, that apply(...) method has been replaced
>> by
>> > a
>> > > Pipeline#replaceAll() method, where the runner can only get involved
>> > after
>> > > the pipeline has been fully constructed. In terms of override
>> > PTransforms,
>> > > these two APIs are identical. But, the early API could still be
>> helpful.
>> > > For example, the runner could remember the line at which the
>> PTransform
>> > was
>> > > added, and provide that info to users to assist debugging. Is it
>> possible
>> > > to add that API back? Or is there any other way to involve the runner
>> > when
>> > > a PTransform is added?
>> > >
>> > > Thanks,
>> > > Shen
>> > >
>> >
>>
>


Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Thomas Groh
This style of method doesn't fit with the current approach of pipeline
construction, where the PipelineRunner need not be specified until the
pipeline is run; as such, the runner can't observe the construction of the
Pipeline, as it may not exist during the construction of the Pipeline.

On Tue, Aug 15, 2017 at 8:41 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> ... And remember it and make available inside PCollection (which
> application produced this collection).
>
> On Tue, Aug 15, 2017, 8:39 AM Eugene Kirpichov 
> wrote:
>
> > In general, no - but the implementation of PAssertionSite exemplifies the
> > approach. I guess it could be useful to make this a general beam feature
> > and remember it for all transforms. It would probably be best to
> implement
> > inside Pipeline.apply().
> >
> > On Tue, Aug 15, 2017, 7:02 AM Shen Li  wrote:
> >
> >> Hi Eugene,
> >>
> >> Thanks for sharing the info. That PAssertionSite tracks where an
> assertion
> >> error occurred. Do you know if it is possible to get the class name and
> >> line number where a PTransform was added?
> >>
> >> Thanks,
> >> Shen
> >>
> >> On Mon, Aug 14, 2017 at 10:54 PM, Eugene Kirpichov <
> >> kirpic...@google.com.invalid> wrote:
> >>
> >> > Hi Shen,
> >> > Responding just to one part of your message - "remember the line at
> >> which
> >> > the PTransform was added": take a look at
> >> > https://github.com/apache/beam/pull/2247 which does this for PAssert.
> >> >
> >> > On Mon, Aug 14, 2017 at 7:32 PM Shen Li  wrote:
> >> >
> >> > > In 0.5.0 or earlier releases, PipelineRunner provides an
> >> > > apply(PTransform, InputT) method which allows
> runner
> >> > > implementation to perform actions when the PTransform is added to
> the
> >> > > pipeline. In later releases, that apply(...) method has been
> replaced
> >> by
> >> > a
> >> > > Pipeline#replaceAll() method, where the runner can only get involved
> >> > after
> >> > > the pipeline has been fully constructed. In terms of override
> >> > PTransforms,
> >> > > these two APIs are identical. But, the early API could still be
> >> helpful.
> >> > > For example, the runner could remember the line at which the
> >> PTransform
> >> > was
> >> > > added, and provide that info to users to assist debugging. Is it
> >> possible
> >> > > to add that API back? Or is there any other way to involve the
> runner
> >> > when
> >> > > a PTransform is added?
> >> > >
> >> > > Thanks,
> >> > > Shen
> >> > >
> >> >
> >>
> >
>


Hello from a newbie to the data world living in the city by the bay!

2017-08-15 Thread Griselda Cuevas
Hi Beam community,

I’m Griselda (Gris) Cuevas and I’m very excited to join the community, I’m
looking forward to learning awesome things from you and to getting the
chance to collaborate on great initiatives.

I’m currently working at Google and I’m studying a masters in operations
research and data science at UC Berkeley. I’m interested in Natural
Language Processing, Information Retrieval and Online Communities. Some
other fun topics I love are juggling, camping and -just getting into it-
 listening to podcasts, so if you ever want to discuss and talk about any
of these topics, here I am!

Another reason why I’m here is because I want to help this project grow and
thrive. This means that you’ll see me contributing to the project, reaching
out to ask questions as I get familiar with our community, and I also
helping evangelize Apache Beam by organizing meetups, hangouts, etc.

I say bye for now, I’ll see you around,

Cheers,

G


Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Shen Li
Hi Thomas,

Does it mean future Pipeline implementations would allow applications to
set the runner after a pipeline has been constructed?

Thanks,
Shen

On Tue, Aug 15, 2017 at 12:36 PM, Thomas Groh 
wrote:

> This style of method doesn't fit with the current approach of pipeline
> construction, where the PipelineRunner need not be specified until the
> pipeline is run; as such, the runner can't observe the construction of the
> Pipeline, as it may not exist during the construction of the Pipeline.
>
> On Tue, Aug 15, 2017 at 8:41 AM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > ... And remember it and make available inside PCollection (which
> > application produced this collection).
> >
> > On Tue, Aug 15, 2017, 8:39 AM Eugene Kirpichov 
> > wrote:
> >
> > > In general, no - but the implementation of PAssertionSite exemplifies
> the
> > > approach. I guess it could be useful to make this a general beam
> feature
> > > and remember it for all transforms. It would probably be best to
> > implement
> > > inside Pipeline.apply().
> > >
> > > On Tue, Aug 15, 2017, 7:02 AM Shen Li  wrote:
> > >
> > >> Hi Eugene,
> > >>
> > >> Thanks for sharing the info. That PAssertionSite tracks where an
> > assertion
> > >> error occurred. Do you know if it is possible to get the class name
> and
> > >> line number where a PTransform was added?
> > >>
> > >> Thanks,
> > >> Shen
> > >>
> > >> On Mon, Aug 14, 2017 at 10:54 PM, Eugene Kirpichov <
> > >> kirpic...@google.com.invalid> wrote:
> > >>
> > >> > Hi Shen,
> > >> > Responding just to one part of your message - "remember the line at
> > >> which
> > >> > the PTransform was added": take a look at
> > >> > https://github.com/apache/beam/pull/2247 which does this for
> PAssert.
> > >> >
> > >> > On Mon, Aug 14, 2017 at 7:32 PM Shen Li 
> wrote:
> > >> >
> > >> > > In 0.5.0 or earlier releases, PipelineRunner provides an
> > >> > > apply(PTransform, InputT) method which allows
> > runner
> > >> > > implementation to perform actions when the PTransform is added to
> > the
> > >> > > pipeline. In later releases, that apply(...) method has been
> > replaced
> > >> by
> > >> > a
> > >> > > Pipeline#replaceAll() method, where the runner can only get
> involved
> > >> > after
> > >> > > the pipeline has been fully constructed. In terms of override
> > >> > PTransforms,
> > >> > > these two APIs are identical. But, the early API could still be
> > >> helpful.
> > >> > > For example, the runner could remember the line at which the
> > >> PTransform
> > >> > was
> > >> > > added, and provide that info to users to assist debugging. Is it
> > >> possible
> > >> > > to add that API back? Or is there any other way to involve the
> > runner
> > >> > when
> > >> > > a PTransform is added?
> > >> > >
> > >> > > Thanks,
> > >> > > Shen
> > >> > >
> > >> >
> > >>
> > >
> >
>


Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Robert Bradshaw
On Tue, Aug 15, 2017 at 10:21 AM, Shen Li  wrote:
> Hi Thomas,
>
> Does it mean future Pipeline implementations would allow applications to
> set the runner after a pipeline has been constructed?

Correct, that's the intent.

>
> Thanks,
> Shen
>
> On Tue, Aug 15, 2017 at 12:36 PM, Thomas Groh 
> wrote:
>
>> This style of method doesn't fit with the current approach of pipeline
>> construction, where the PipelineRunner need not be specified until the
>> pipeline is run; as such, the runner can't observe the construction of the
>> Pipeline, as it may not exist during the construction of the Pipeline.
>>
>> On Tue, Aug 15, 2017 at 8:41 AM, Eugene Kirpichov <
>> kirpic...@google.com.invalid> wrote:
>>
>> > ... And remember it and make available inside PCollection (which
>> > application produced this collection).
>> >
>> > On Tue, Aug 15, 2017, 8:39 AM Eugene Kirpichov 
>> > wrote:
>> >
>> > > In general, no - but the implementation of PAssertionSite exemplifies
>> the
>> > > approach. I guess it could be useful to make this a general beam
>> feature
>> > > and remember it for all transforms. It would probably be best to
>> > implement
>> > > inside Pipeline.apply().
>> > >
>> > > On Tue, Aug 15, 2017, 7:02 AM Shen Li  wrote:
>> > >
>> > >> Hi Eugene,
>> > >>
>> > >> Thanks for sharing the info. That PAssertionSite tracks where an
>> > assertion
>> > >> error occurred. Do you know if it is possible to get the class name
>> and
>> > >> line number where a PTransform was added?
>> > >>
>> > >> Thanks,
>> > >> Shen
>> > >>
>> > >> On Mon, Aug 14, 2017 at 10:54 PM, Eugene Kirpichov <
>> > >> kirpic...@google.com.invalid> wrote:
>> > >>
>> > >> > Hi Shen,
>> > >> > Responding just to one part of your message - "remember the line at
>> > >> which
>> > >> > the PTransform was added": take a look at
>> > >> > https://github.com/apache/beam/pull/2247 which does this for
>> PAssert.
>> > >> >
>> > >> > On Mon, Aug 14, 2017 at 7:32 PM Shen Li 
>> wrote:
>> > >> >
>> > >> > > In 0.5.0 or earlier releases, PipelineRunner provides an
>> > >> > > apply(PTransform, InputT) method which allows
>> > runner
>> > >> > > implementation to perform actions when the PTransform is added to
>> > the
>> > >> > > pipeline. In later releases, that apply(...) method has been
>> > replaced
>> > >> by
>> > >> > a
>> > >> > > Pipeline#replaceAll() method, where the runner can only get
>> involved
>> > >> > after
>> > >> > > the pipeline has been fully constructed. In terms of override
>> > >> > PTransforms,
>> > >> > > these two APIs are identical. But, the early API could still be
>> > >> helpful.
>> > >> > > For example, the runner could remember the line at which the
>> > >> PTransform
>> > >> > was
>> > >> > > added, and provide that info to users to assist debugging. Is it
>> > >> possible
>> > >> > > to add that API back? Or is there any other way to involve the
>> > runner
>> > >> > when
>> > >> > > a PTransform is added?
>> > >> > >
>> > >> > > Thanks,
>> > >> > > Shen
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>


Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Shen Li
thank you!

Shen

On Tue, Aug 15, 2017 at 1:29 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Tue, Aug 15, 2017 at 10:21 AM, Shen Li  wrote:
> > Hi Thomas,
> >
> > Does it mean future Pipeline implementations would allow applications to
> > set the runner after a pipeline has been constructed?
>
> Correct, that's the intent.
>
> >
> > Thanks,
> > Shen
> >
> > On Tue, Aug 15, 2017 at 12:36 PM, Thomas Groh 
> > wrote:
> >
> >> This style of method doesn't fit with the current approach of pipeline
> >> construction, where the PipelineRunner need not be specified until the
> >> pipeline is run; as such, the runner can't observe the construction of
> the
> >> Pipeline, as it may not exist during the construction of the Pipeline.
> >>
> >> On Tue, Aug 15, 2017 at 8:41 AM, Eugene Kirpichov <
> >> kirpic...@google.com.invalid> wrote:
> >>
> >> > ... And remember it and make available inside PCollection (which
> >> > application produced this collection).
> >> >
> >> > On Tue, Aug 15, 2017, 8:39 AM Eugene Kirpichov 
> >> > wrote:
> >> >
> >> > > In general, no - but the implementation of PAssertionSite
> exemplifies
> >> the
> >> > > approach. I guess it could be useful to make this a general beam
> >> feature
> >> > > and remember it for all transforms. It would probably be best to
> >> > implement
> >> > > inside Pipeline.apply().
> >> > >
> >> > > On Tue, Aug 15, 2017, 7:02 AM Shen Li  wrote:
> >> > >
> >> > >> Hi Eugene,
> >> > >>
> >> > >> Thanks for sharing the info. That PAssertionSite tracks where an
> >> > assertion
> >> > >> error occurred. Do you know if it is possible to get the class name
> >> and
> >> > >> line number where a PTransform was added?
> >> > >>
> >> > >> Thanks,
> >> > >> Shen
> >> > >>
> >> > >> On Mon, Aug 14, 2017 at 10:54 PM, Eugene Kirpichov <
> >> > >> kirpic...@google.com.invalid> wrote:
> >> > >>
> >> > >> > Hi Shen,
> >> > >> > Responding just to one part of your message - "remember the line
> at
> >> > >> which
> >> > >> > the PTransform was added": take a look at
> >> > >> > https://github.com/apache/beam/pull/2247 which does this for
> >> PAssert.
> >> > >> >
> >> > >> > On Mon, Aug 14, 2017 at 7:32 PM Shen Li 
> >> wrote:
> >> > >> >
> >> > >> > > In 0.5.0 or earlier releases, PipelineRunner provides an
> >> > >> > > apply(PTransform, InputT) method which allows
> >> > runner
> >> > >> > > implementation to perform actions when the PTransform is added
> to
> >> > the
> >> > >> > > pipeline. In later releases, that apply(...) method has been
> >> > replaced
> >> > >> by
> >> > >> > a
> >> > >> > > Pipeline#replaceAll() method, where the runner can only get
> >> involved
> >> > >> > after
> >> > >> > > the pipeline has been fully constructed. In terms of override
> >> > >> > PTransforms,
> >> > >> > > these two APIs are identical. But, the early API could still be
> >> > >> helpful.
> >> > >> > > For example, the runner could remember the line at which the
> >> > >> PTransform
> >> > >> > was
> >> > >> > > added, and provide that info to users to assist debugging. Is
> it
> >> > >> possible
> >> > >> > > to add that API back? Or is there any other way to involve the
> >> > runner
> >> > >> > when
> >> > >> > > a PTransform is added?
> >> > >> > >
> >> > >> > > Thanks,
> >> > >> > > Shen
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
>


Re: Policy for stale PRs

2017-08-15 Thread Kenneth Knowles
Yea, I think we will need a policy like this eventually, or face unbounded
old PRs. I would be OK with closing after 60 or 30 days of silence, too,
since all that is needed is a reply, plus they can always re-open. What
have other projects done?

On Mon, Aug 14, 2017 at 5:29 PM, Ted Yu  wrote:

> The proposal makes sense.
>
> If the author of PR doesn't respond for 90 days, the PR is likely out of
> sync with current repo.
>
> Cheers
>
> On Mon, Aug 14, 2017 at 5:27 PM, Ahmet Altay 
> wrote:
>
> > Hi all,
> >
> > Do we have an existing policy for handling stale PRs? If not could we
> come
> > up with one. We are getting close to 100 open PRs. Some of the open PRs
> > have not been touched for a while, and if we exclude the pings the number
> > will be higher.
> >
> > For example, we could close PRs that have not been updated by the
> original
> > author for 90 days even after multiple attempts to reach them (e.g. [1],
> > [2] are such PRs.)
> >
> > What do you think?
> >
> > Thank you,
> > Ahmet
> >
> > [1] https://github.com/apache/beam/pull/1464
> > [2] https://github.com/apache/beam/pull/2949
> >
>


Re: Hello from a newbie to the data world living in the city by the bay!

2017-08-15 Thread Umang Sharma
Hi Gris,
Nice to meet you.

I'd like to take this opportunity to introduce me to you and everyone else
in  the dev team.

I’m m Umang Sharma. I'm an associate in Data Science and Applications at
Accenture Digital.


I write in python, Java and a number of other languages.
I'd love to contribute to Beam. It'd br great if someone guides me to get
started with contributing :)

Among the other things i like are polo golf, giving talks and talking about
mu work .

Thanks,
Umang


On Aug 15, 2017 22:40, "Griselda Cuevas"  wrote:

Hi Beam community,

I’m Griselda (Gris) Cuevas and I’m very excited to join the community, I’m
looking forward to learning awesome things from you and to getting the
chance to collaborate on great initiatives.

I’m currently working at Google and I’m studying a masters in operations
research and data science at UC Berkeley. I’m interested in Natural
Language Processing, Information Retrieval and Online Communities. Some
other fun topics I love are juggling, camping and -just getting into it-
 listening to podcasts, so if you ever want to discuss and talk about any
of these topics, here I am!

Another reason why I’m here is because I want to help this project grow and
thrive. This means that you’ll see me contributing to the project, reaching
out to ask questions as I get familiar with our community, and I also
helping evangelize Apache Beam by organizing meetups, hangouts, etc.

I say bye for now, I’ll see you around,

Cheers,

G


Re: [ANNOUNCEMENT] New committers, August 2017 edition!

2017-08-15 Thread Mark Liu
Congrats! Excellent works!

On Mon, Aug 14, 2017 at 11:50 PM, Aviem Zur  wrote:

> Congrats!
>
> On Mon, Aug 14, 2017 at 6:43 PM Tyler Akidau 
> wrote:
>
> > Congrats and thanks all around!
> >
> > On Sat, Aug 12, 2017 at 12:09 AM Aljoscha Krettek 
> > wrote:
> >
> > > Congrats, everyone! It's well deserved.
> > >
> > > Best,
> > > Aljoscha
> > >
> > > > On 12. Aug 2017, at 08:06, Pei HE  wrote:
> > > >
> > > > Congratulations to all!
> > > > --
> > > > Pei
> > > >
> > > > On Sat, Aug 12, 2017 at 10:50 AM, James 
> wrote:
> > > >
> > > >> Thank you guys, glad to contribute to this great project,
> congratulate
> > > to
> > > >> all the new committers!
> > > >>
> > > >> On Sat, Aug 12, 2017 at 8:36 AM Manu Zhang  >
> > > >> wrote:
> > > >>
> > > >>> Thanks everyone !!! It's a great journey.
> > > >>> Congrats to other new committers !
> > > >>>
> > > >>> Thanks,
> > > >>> Manu
> > > >>>
> > > >>> On Sat, Aug 12, 2017 at 5:23 AM Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> > > >>> wrote:
> > > >>>
> > >  Congrats and welcome !
> > > 
> > >  Regards
> > >  JB
> > > 
> > >  On 08/11/2017 07:40 PM, Davor Bonaci wrote:
> > > > Please join me and the rest of Beam PMC in welcoming the
> following
> > > > contributors as our newest committers. They have significantly
> > >  contributed
> > > > to the project in different ways, and we look forward to many
> more
> > > > contributions in the future.
> > > >
> > > > * Reuven Lax
> > > > Reuven has been with the project since the very beginning,
> > > >> contributing
> > > > mostly to the core SDK and the GCP IO connectors. He accumulated
> 52
> > >  commits
> > > > (19,824 ++ / 12,039 --). Most recently, Reuven re-wrote several
> IO
> > > > connectors that significantly expanded their functionality.
> > > >>> Additionally,
> > > > Reuven authored important new design documents relating to update
> > and
> > > > snapshot functionality.
> > > >
> > > > * Jingsong Lee
> > > > Jingsong has been contributing to Apache Beam since the beginning
> > of
> > > >>> the
> > > > year, particularly to the Flink runner. He has accumulated 34
> > commits
> > > > (11,214 ++ / 6,314 --) of deep, fundamental changes that
> > > >> significantly
> > > > improved the quality of the runner. Additionally, Jingsong has
> > >  contributed
> > > > to the project in other ways too -- reviewing contributions, and
> > > > participating in discussions on the mailing list, design
> documents,
> > > >> and
> > > > JIRA issue tracker.
> > > >
> > > > * Mingmin Xu
> > > > Mingmin started the SQL DSL effort, and has driven it to the
> point
> > of
> > > > merging to the master branch. In this effort, he extended the
> > project
> > > >>> to
> > > > the significant new user community.
> > > >
> > > > * Mingming (James) Xu
> > > > James joined the SQL DSL effort, contributing some of the
> trickier
> > > >>> parts,
> > > > such as the Join functionality. Additionally, he's consistently
> > shown
> > > > himself to be an insightful code reviewer, significantly
> impacting
> > > >> the
> > > > project’s code quality and ensuring the success of the new major
> > >  component.
> > > >
> > > > * Manu Zhang
> > > > Manu initiated and developed a runner for the Apache Gearpump
> > >  (incubating)
> > > > engine, and has driven it to the point of merging to the master
> > > >> branch.
> > >  In
> > > > this effort, he accumulated 65 commits (7,812 ++ / 4,882 --) and
> > > >>> extended
> > > > the project to the new user community.
> > > >
> > > > Congratulations to all five! Welcome!
> > > >
> > > > Davor
> > > >
> > > 
> > >  --
> > >  Jean-Baptiste Onofré
> > >  jbono...@apache.org
> > >  http://blog.nanthrax.net
> > >  Talend - http://www.talend.com
> > > 
> > > >>>
> > > >>
> > >
> > >
> >
>


Re: Hello from a newbie to the data world living in the city by the bay!

2017-08-15 Thread Ahmet Altay
Welcome both of you!

Some helpful starting points:
- Contribution guide [1]
- Unassigned starter issues in JIRA [2]

Ahmet

[1] https://beam.apache.org/contribute/contribution-guide/
[2]
https://issues.apache.org/jira/browse/BEAM-2632?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20Reopened)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20starter%20AND%20assignee%20in%20(EMPTY)%20ORDER%20BY%20created%20DESC%2C%20priority%20DESC

On Tue, Aug 15, 2017 at 11:13 AM, Umang Sharma  wrote:

> Hi Gris,
> Nice to meet you.
>
> I'd like to take this opportunity to introduce me to you and everyone else
> in  the dev team.
>
> I’m m Umang Sharma. I'm an associate in Data Science and Applications at
> Accenture Digital.
>
>
> I write in python, Java and a number of other languages.
> I'd love to contribute to Beam. It'd br great if someone guides me to get
> started with contributing :)
>
> Among the other things i like are polo golf, giving talks and talking about
> mu work .
>
> Thanks,
> Umang
>
>
> On Aug 15, 2017 22:40, "Griselda Cuevas"  wrote:
>
> Hi Beam community,
>
> I’m Griselda (Gris) Cuevas and I’m very excited to join the community, I’m
> looking forward to learning awesome things from you and to getting the
> chance to collaborate on great initiatives.
>
> I’m currently working at Google and I’m studying a masters in operations
> research and data science at UC Berkeley. I’m interested in Natural
> Language Processing, Information Retrieval and Online Communities. Some
> other fun topics I love are juggling, camping and -just getting into it-
>  listening to podcasts, so if you ever want to discuss and talk about any
> of these topics, here I am!
>
> Another reason why I’m here is because I want to help this project grow and
> thrive. This means that you’ll see me contributing to the project, reaching
> out to ask questions as I get familiar with our community, and I also
> helping evangelize Apache Beam by organizing meetups, hangouts, etc.
>
> I say bye for now, I’ll see you around,
>
> Cheers,
>
> G
>


Re: [PROPOSAL] "Requires deterministic input"

2017-08-15 Thread Robert Bradshaw
On Sat, Aug 12, 2017 at 1:13 AM, Reuven Lax  wrote:
> On Fri, Aug 11, 2017 at 10:52 PM, Robert Bradshaw <
>> The question here is whether the ordering is part of the "content" of
>> an iterable.
>
> My initial instinct was to say yes - but maybe it should not be until Beam
> has a first-class notion of sorted values after a GBK?

Yeah, I'm not sure on this either. Interestingly, if we consider
ordering to be important, than the composite gbk + ungroup will be
stable despite its components not being so.

>> >> As I mention above, the iterable is semantically [part of] a single
>> >> element. So just to unpack this, to make sure we are talking about the
>> same
>> >> thing, I think you are talking about GBK as implemented via GBKO + GABW.
>> >>
>> >> When the output of GABW is required to be stable but the output of GBKO
>> is
>> >> not stable, we don't have stability for free in all cases by inserting a
>> >> GBK, but require something more to make the output of GABW stable, in
>> the
>> >> worst case a full materialization.
>> >>
>> >
>> > Correct. My point is that there are alternate, cheaper ways of doing
>> this.
>> > If GABW stores state in an ordered list, it can simply checkpoint a
>> market
>> > into that list to ensure that the output is stabl.
>>
>> In the presence of non-trivial triggering and/or late data, I'm not so
>> sure this is "easy." E.g. A bundle may fail, and more data may come in
>> from upstream (and get appended to the buffer) before it is retried.
>>
>
> That will still work. If the subsequent ParDo has processed the Iterable,
> that means we'll have successfully checkpointed a marker to the list (using
> whatever technique the runner supports). More data coming in will get
> appended after the marker, so we can ensure that the retry still sees the
> same elements in the Iterable.

I'm thinking of the following.

1. (k, v1) and (k, v2) come into the GABW and [v1, v2] gets stored in
the state. A trigger gets set.
2. The trigger is fired and (k, [v1, v2]) gets sent downstream, but
for some reason fails.
3. (k, v3) comes into the GABW and [v3] gets appended to the state.
4. The trigger is again fired, and this time (k, [v1, v2, v3]) is sent
downstream.

It is unclear when a marker would be added to the list. Is this in
step 2 which, despite failing, still result in modified state [v1, v2,
marker]? (And this state modification would have to be committed
before attempting the bundle, in case the "failure" was something like
a VM shutdown.) And only on success the state is modified to be (say
this is accumulating mode) [v1, v2]?

I think it could be done, but it may significantly complicate things.


Re: Hello from a newbie to the data world living in the city by the bay!

2017-08-15 Thread Justin T
Hello Beam community,

I am also a new member, and I feel a little better knowing that there
others on the same boat:)

My name is Justin and I work as a full stack engineer for Neustar, a
marketing analytics company in San Diego. Over the past few weeks I have
been getting more familiar with Beam via documentation, papers, videos, and
the old email archives and I am very excited to start making contributions.
Thank you Altay for the useful links!

-Justin Tumale

On Tue, Aug 15, 2017 at 11:19 AM, Ahmet Altay 
wrote:

> Welcome both of you!
>
> Some helpful starting points:
> - Contribution guide [1]
> - Unassigned starter issues in JIRA [2]
>
> Ahmet
>
> [1] https://beam.apache.org/contribute/contribution-guide/
> [2]
> https://issues.apache.org/jira/browse/BEAM-2632?jql=
> project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20Reopened)%20AND%
> 20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20starter%20AND%
> 20assignee%20in%20(EMPTY)%20ORDER%20BY%20created%20DESC%
> 2C%20priority%20DESC
>
> On Tue, Aug 15, 2017 at 11:13 AM, Umang Sharma 
> wrote:
>
> > Hi Gris,
> > Nice to meet you.
> >
> > I'd like to take this opportunity to introduce me to you and everyone
> else
> > in  the dev team.
> >
> > I’m m Umang Sharma. I'm an associate in Data Science and Applications at
> > Accenture Digital.
> >
> >
> > I write in python, Java and a number of other languages.
> > I'd love to contribute to Beam. It'd br great if someone guides me to get
> > started with contributing :)
> >
> > Among the other things i like are polo golf, giving talks and talking
> about
> > mu work .
> >
> > Thanks,
> > Umang
> >
> >
> > On Aug 15, 2017 22:40, "Griselda Cuevas" 
> wrote:
> >
> > Hi Beam community,
> >
> > I’m Griselda (Gris) Cuevas and I’m very excited to join the community,
> I’m
> > looking forward to learning awesome things from you and to getting the
> > chance to collaborate on great initiatives.
> >
> > I’m currently working at Google and I’m studying a masters in operations
> > research and data science at UC Berkeley. I’m interested in Natural
> > Language Processing, Information Retrieval and Online Communities. Some
> > other fun topics I love are juggling, camping and -just getting into it-
> >  listening to podcasts, so if you ever want to discuss and talk about any
> > of these topics, here I am!
> >
> > Another reason why I’m here is because I want to help this project grow
> and
> > thrive. This means that you’ll see me contributing to the project,
> reaching
> > out to ask questions as I get familiar with our community, and I also
> > helping evangelize Apache Beam by organizing meetups, hangouts, etc.
> >
> > I say bye for now, I’ll see you around,
> >
> > Cheers,
> >
> > G
> >
>


Re: [PROPOSAL] "Requires deterministic input"

2017-08-15 Thread Reuven Lax
On Tue, Aug 15, 2017 at 1:59 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Sat, Aug 12, 2017 at 1:13 AM, Reuven Lax 
> wrote:
> > On Fri, Aug 11, 2017 at 10:52 PM, Robert Bradshaw <
> >> The question here is whether the ordering is part of the "content" of
> >> an iterable.
> >
> > My initial instinct was to say yes - but maybe it should not be until
> Beam
> > has a first-class notion of sorted values after a GBK?
>
> Yeah, I'm not sure on this either. Interestingly, if we consider
> ordering to be important, than the composite gbk + ungroup will be
> stable despite its components not being so.
>
> >> >> As I mention above, the iterable is semantically [part of] a single
> >> >> element. So just to unpack this, to make sure we are talking about
> the
> >> same
> >> >> thing, I think you are talking about GBK as implemented via GBKO +
> GABW.
> >> >>
> >> >> When the output of GABW is required to be stable but the output of
> GBKO
> >> is
> >> >> not stable, we don't have stability for free in all cases by
> inserting a
> >> >> GBK, but require something more to make the output of GABW stable, in
> >> the
> >> >> worst case a full materialization.
> >> >>
> >> >
> >> > Correct. My point is that there are alternate, cheaper ways of doing
> >> this.
> >> > If GABW stores state in an ordered list, it can simply checkpoint a
> >> market
> >> > into that list to ensure that the output is stabl.
> >>
> >> In the presence of non-trivial triggering and/or late data, I'm not so
> >> sure this is "easy." E.g. A bundle may fail, and more data may come in
> >> from upstream (and get appended to the buffer) before it is retried.
> >>
> >
> > That will still work. If the subsequent ParDo has processed the Iterable,
> > that means we'll have successfully checkpointed a marker to the list
> (using
> > whatever technique the runner supports). More data coming in will get
> > appended after the marker, so we can ensure that the retry still sees the
> > same elements in the Iterable.
>
> I'm thinking of the following.
>
> 1. (k, v1) and (k, v2) come into the GABW and [v1, v2] gets stored in
> the state. A trigger gets set.
> 2. The trigger is fired and (k, [v1, v2]) gets sent downstream, but
> for some reason fails.
> 3. (k, v3) comes into the GABW and [v3] gets appended to the state.
> 4. The trigger is again fired, and this time (k, [v1, v2, v3]) is sent
> downstream.
>
>
If you add the annotation specifying stableinput, then we will not do this.
Before we send anything downstream, we will add a marker to the list, and
only forward data downstream once the marker has been checkpointed. This
adds a bit of cost and latency of course, but the assumption is that adding
this annotation will always add some cost.



> It is unclear when a marker would be added to the list. Is this in
> step 2 which, despite failing, still result in modified state [v1, v2,
> marker]? (And this state modification would have to be committed
> before attempting the bundle, in case the "failure" was something like
> a VM shutdown.) And only on success the state is modified to be (say
> this is accumulating mode) [v1, v2]?
>
> I think it could be done, but it may significantly complicate things.
>


Re: [PROPOSAL] "Requires deterministic input"

2017-08-15 Thread Robert Bradshaw
On Tue, Aug 15, 2017 at 2:14 PM, Reuven Lax  wrote:
> On Tue, Aug 15, 2017 at 1:59 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
>> On Sat, Aug 12, 2017 at 1:13 AM, Reuven Lax 
>> wrote:
>> > On Fri, Aug 11, 2017 at 10:52 PM, Robert Bradshaw <
>> >> The question here is whether the ordering is part of the "content" of
>> >> an iterable.
>> >
>> > My initial instinct was to say yes - but maybe it should not be until
>> Beam
>> > has a first-class notion of sorted values after a GBK?
>>
>> Yeah, I'm not sure on this either. Interestingly, if we consider
>> ordering to be important, than the composite gbk + ungroup will be
>> stable despite its components not being so.
>>
>> >> >> As I mention above, the iterable is semantically [part of] a single
>> >> >> element. So just to unpack this, to make sure we are talking about
>> the
>> >> same
>> >> >> thing, I think you are talking about GBK as implemented via GBKO +
>> GABW.
>> >> >>
>> >> >> When the output of GABW is required to be stable but the output of
>> GBKO
>> >> is
>> >> >> not stable, we don't have stability for free in all cases by
>> inserting a
>> >> >> GBK, but require something more to make the output of GABW stable, in
>> >> the
>> >> >> worst case a full materialization.
>> >> >>
>> >> >
>> >> > Correct. My point is that there are alternate, cheaper ways of doing
>> >> this.
>> >> > If GABW stores state in an ordered list, it can simply checkpoint a
>> >> market
>> >> > into that list to ensure that the output is stabl.
>> >>
>> >> In the presence of non-trivial triggering and/or late data, I'm not so
>> >> sure this is "easy." E.g. A bundle may fail, and more data may come in
>> >> from upstream (and get appended to the buffer) before it is retried.
>> >>
>> >
>> > That will still work. If the subsequent ParDo has processed the Iterable,
>> > that means we'll have successfully checkpointed a marker to the list
>> (using
>> > whatever technique the runner supports). More data coming in will get
>> > appended after the marker, so we can ensure that the retry still sees the
>> > same elements in the Iterable.
>>
>> I'm thinking of the following.
>>
>> 1. (k, v1) and (k, v2) come into the GABW and [v1, v2] gets stored in
>> the state. A trigger gets set.
>> 2. The trigger is fired and (k, [v1, v2]) gets sent downstream, but
>> for some reason fails.
>> 3. (k, v3) comes into the GABW and [v3] gets appended to the state.
>> 4. The trigger is again fired, and this time (k, [v1, v2, v3]) is sent
>> downstream.
>>
>>
> If you add the annotation specifying stableinput, then we will not do this.
> Before we send anything downstream, we will add a marker to the list, and
> only forward data downstream once the marker has been checkpointed. This
> adds a bit of cost and latency of course, but the assumption is that adding
> this annotation will always add some cost.

I don't think you can checkpoint anything "before sending data
downstream" if its being executed as part of a fused graph, unless we
add special support for this in the Fn API. I suppose the runner could
pre-emptively modify the state of any GABW operations before firing
triggers...

>> It is unclear when a marker would be added to the list. Is this in
>> step 2 which, despite failing, still result in modified state [v1, v2,
>> marker]? (And this state modification would have to be committed
>> before attempting the bundle, in case the "failure" was something like
>> a VM shutdown.) And only on success the state is modified to be (say
>> this is accumulating mode) [v1, v2]?
>>
>> I think it could be done, but it may significantly complicate things.
>>


Re: Policy for stale PRs

2017-08-15 Thread Jean-Baptiste Onofré
If we consider the author, it makes sense.

Regards
JB

On Aug 15, 2017, 01:29, at 01:29, Ted Yu  wrote:
>The proposal makes sense.
>
>If the author of PR doesn't respond for 90 days, the PR is likely out
>of
>sync with current repo.
>
>Cheers
>
>On Mon, Aug 14, 2017 at 5:27 PM, Ahmet Altay 
>wrote:
>
>> Hi all,
>>
>> Do we have an existing policy for handling stale PRs? If not could we
>come
>> up with one. We are getting close to 100 open PRs. Some of the open
>PRs
>> have not been touched for a while, and if we exclude the pings the
>number
>> will be higher.
>>
>> For example, we could close PRs that have not been updated by the
>original
>> author for 90 days even after multiple attempts to reach them (e.g.
>[1],
>> [2] are such PRs.)
>>
>> What do you think?
>>
>> Thank you,
>> Ahmet
>>
>> [1] https://github.com/apache/beam/pull/1464
>> [2] https://github.com/apache/beam/pull/2949
>>


Re: [PROPOSAL] "Requires deterministic input"

2017-08-15 Thread Reuven Lax
Well the Fn API is still being designed, so this is something we'd have to
think about.

On Tue, Aug 15, 2017 at 2:19 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Tue, Aug 15, 2017 at 2:14 PM, Reuven Lax 
> wrote:
> > On Tue, Aug 15, 2017 at 1:59 PM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> >> On Sat, Aug 12, 2017 at 1:13 AM, Reuven Lax 
> >> wrote:
> >> > On Fri, Aug 11, 2017 at 10:52 PM, Robert Bradshaw <
> >> >> The question here is whether the ordering is part of the "content" of
> >> >> an iterable.
> >> >
> >> > My initial instinct was to say yes - but maybe it should not be until
> >> Beam
> >> > has a first-class notion of sorted values after a GBK?
> >>
> >> Yeah, I'm not sure on this either. Interestingly, if we consider
> >> ordering to be important, than the composite gbk + ungroup will be
> >> stable despite its components not being so.
> >>
> >> >> >> As I mention above, the iterable is semantically [part of] a
> single
> >> >> >> element. So just to unpack this, to make sure we are talking about
> >> the
> >> >> same
> >> >> >> thing, I think you are talking about GBK as implemented via GBKO +
> >> GABW.
> >> >> >>
> >> >> >> When the output of GABW is required to be stable but the output of
> >> GBKO
> >> >> is
> >> >> >> not stable, we don't have stability for free in all cases by
> >> inserting a
> >> >> >> GBK, but require something more to make the output of GABW
> stable, in
> >> >> the
> >> >> >> worst case a full materialization.
> >> >> >>
> >> >> >
> >> >> > Correct. My point is that there are alternate, cheaper ways of
> doing
> >> >> this.
> >> >> > If GABW stores state in an ordered list, it can simply checkpoint a
> >> >> market
> >> >> > into that list to ensure that the output is stabl.
> >> >>
> >> >> In the presence of non-trivial triggering and/or late data, I'm not
> so
> >> >> sure this is "easy." E.g. A bundle may fail, and more data may come
> in
> >> >> from upstream (and get appended to the buffer) before it is retried.
> >> >>
> >> >
> >> > That will still work. If the subsequent ParDo has processed the
> Iterable,
> >> > that means we'll have successfully checkpointed a marker to the list
> >> (using
> >> > whatever technique the runner supports). More data coming in will get
> >> > appended after the marker, so we can ensure that the retry still sees
> the
> >> > same elements in the Iterable.
> >>
> >> I'm thinking of the following.
> >>
> >> 1. (k, v1) and (k, v2) come into the GABW and [v1, v2] gets stored in
> >> the state. A trigger gets set.
> >> 2. The trigger is fired and (k, [v1, v2]) gets sent downstream, but
> >> for some reason fails.
> >> 3. (k, v3) comes into the GABW and [v3] gets appended to the state.
> >> 4. The trigger is again fired, and this time (k, [v1, v2, v3]) is sent
> >> downstream.
> >>
> >>
> > If you add the annotation specifying stableinput, then we will not do
> this.
> > Before we send anything downstream, we will add a marker to the list, and
> > only forward data downstream once the marker has been checkpointed. This
> > adds a bit of cost and latency of course, but the assumption is that
> adding
> > this annotation will always add some cost.
>
> I don't think you can checkpoint anything "before sending data
> downstream" if its being executed as part of a fused graph, unless we
> add special support for this in the Fn API. I suppose the runner could
> pre-emptively modify the state of any GABW operations before firing
> triggers...
>
> >> It is unclear when a marker would be added to the list. Is this in
> >> step 2 which, despite failing, still result in modified state [v1, v2,
> >> marker]? (And this state modification would have to be committed
> >> before attempting the bundle, in case the "failure" was something like
> >> a VM shutdown.) And only on success the state is modified to be (say
> >> this is accumulating mode) [v1, v2]?
> >>
> >> I think it could be done, but it may significantly complicate things.
> >>
>


Re: [VOTE] Release 2.1.0, release candidate #3

2017-08-15 Thread Eugene Kirpichov
Hey all,

Seems like we're missing one more affirmative vote from a PMC member (so
far we have JB and Ahmet) to proceed with the release.

On Mon, Aug 14, 2017 at 9:30 AM Ahmet Altay 
wrote:

> On Mon, Aug 14, 2017 at 6:32 AM, Ismaël Mejía  wrote:
>
> > +1 (non-binding)
> >
> > - Validated signatures OK
> > - mvn clean verify -Prelease on both OpenJDK 1.7 and Oracle JDK 8 with
> > the docker development images (WIP), both OK
> > - Run WordCount on local Flink and Spark runners OK
> >
> > Everything looks nice, only one minor thing (not blocking at all). The
> > proto generated files for python are not cleaned correctly and this
> > causes the validation to complain because the maven rat plugin does
> > not find the apache headers on the files  (this happens if you execute
> > mvn clean verify -Prelease immediately after the validation).
> >
>
> Ismaël, could you create a JIRA issue for this (to be fixed at a future
> release)?
>
>
> >
> > On Sun, Aug 13, 2017 at 6:52 AM, Jean-Baptiste Onofré 
> > wrote:
> > > +1 (binding)
> > >
> > > I do my own tests and casting my own vote ;)
> > >
> > > Regards
> > > JB
> > >
> > > On 08/09/2017 07:08 AM, Jean-Baptiste Onofré wrote:
> > >>
> > >> Hi everyone,
> > >>
> > >> Please review and vote on the release candidate #3 for the version
> > 2.1.0,
> > >> as follows:
> > >>
> > >> [ ] +1, Approve the release
> > >> [ ] -1, Do not approve the release (please provide specific comments)
> > >>
> > >>
> > >> The complete staging area is available for your review, which
> includes:
> > >> * JIRA release notes [1],
> > >> * the official Apache source release to be deployed to
> dist.apache.org
> > >> [2], which is signed with the key with fingerprint C8282E76 [3],
> > >> * all artifacts to be deployed to the Maven Central Repository [4],
> > >> * source code tag "v2.1.0-RC3" [5],
> > >> * website pull request listing the release and publishing the API
> > >> reference manual [6].
> > >> * Python artifacts are deployed along with the source release to the
> > >> dist.apache.org [2].
> > >>
> > >> The vote will be open for at least 72 hours. It is adopted by majority
> > >> approval, with at least 3 PMC affirmative votes.
> > >>
> > >> Thanks,
> > >> JB
> > >>
> > >> [1]
> > >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> > projectId=12319527&version=12340528
> > >> [2] https://dist.apache.org/repos/dist/dev/beam/2.1.0/
> > >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > >> [4] https://repository.apache.org/content/repositories/
> > orgapachebeam-1020/
> > >> [5] https://github.com/apache/beam/tree/v2.1.0-RC3
> > >> [6] https://github.com/apache/beam-site/pull/270
> > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> >
>