Re: Selectively running tests?

2019-03-18 Thread Alan Myrvold
The includedRegions was set up as part of
https://issues.apache.org/jira/browse/BEAM-4445 and there are additional
paths added from
https://github.com/apache/beam/blob/6bb4b2332b11bd8295ac6965be8426b9c38fa454/.test-infra/jenkins/PrecommitJobBuilder.groovy#L65

Not sure why they are not working, but it would be good to get this going
again. Might have stopped at the same time Jenkins was updated.

On Mon, Mar 18, 2019 at 3:28 PM Pablo Estrada  wrote:

> We used to have tests run selectively depending on which directories were
> changes. I've just noticed that this is not the case anymore.
>
> Did we stop doing that? Or maybe the selector is faulty? Anyone know what
> happened here?
> Thanks!
> -P.
>


Connected streams with Beam

2019-03-18 Thread Suneel Marthi
Could someone point me to how to do connected stream from 2 sources with both 
java and python api?

Danke

Sent from my iPhone

Re: JIRA hygiene

2019-03-18 Thread Pablo Estrada
I feel resistance to put that burden on the committer, but that's a mostly
selfish feeling : )

I think it makes sense to expect committers to prompt/ask the contributor
or decide whether to close a JIRA issue. I do find that an auto-close on
merge would be more convenient, except for the more rare case, where there
will be multiple PRs addressing a single JIRA issue.

It would be great if there was some tool of JIRA/Github integration that
would help with these tasks (e.g. notify on the PR that the JIRA issue will
be closed, so the committer can decide to reopen, or some workflow like
that).

Best
-P.

On Mon, Mar 18, 2019 at 2:22 PM Reuven Lax  wrote:

> Oh I agree with this. I'm just saying that an automated close on merge
> might not always do the right thing.
>
> On Mon, Mar 18, 2019 at 4:51 AM Etienne Chauchot 
> wrote:
>
>> Well, I agree but a contributor might not have the rights on jira and
>> more important he might be unable to chose a target version for the jira.
>> Targeting the ticket to the correct version requires to know the release
>> cut date which is not the date of the commit to which the release tag
>> points in case of cherry picks. It seems a bit complicated for a one-time
>> contributor. This is why I proposed that the committer/reviewer does the
>> jira closing.
>>
>> Etienne
>>
>>
>> Le mercredi 13 mars 2019 à 17:08 -0700, Ahmet Altay a écrit :
>>
>> I agree with defining the workflow for closing JIRAs. Would not
>> contributor be in a better position to close JIRAs or keep it open? It
>> would make sense for the committer to ask about this but I think
>> contributor (presumably the person who is the assignee of the JIRA) could
>> be the responsible party for updating their JIRAs. On the other hand, I
>> understand the argument that committer could do this at the time of merging
>> and fill a gap in the process.
>>
>> On Wed, Mar 13, 2019 at 4:59 PM Michael Luckey 
>> wrote:
>>
>> Hi,
>>
>> definitely +1 to properly establish a workflow to maintain jira status.
>> Naively I d think, the reporter should close as she is the one to confirm
>> whether the reported issue is fixed or not. But for obvious reasons that
>> will not work here, so - although it puts another burden on committers, you
>> are probably right that the committer is the best choice to ensure that the
>> ticket gets promoted. Whether it will be resolved or clarified what's still
>> to be done.
>>
>> Looking into the current state, we seem to have tons of issues whith
>> merged PRs, which for anyone trying to find an existing jira issue to work
>> on makes it unnecessary difficult to decide whether to look into that or
>>  not. From my personal experience, it is somehow frustrating going through
>> open issues, selecting one and after investing some (or even more) time to
>> first understand a problem and then the PR to realise nothing has to be
>> done anymore. Or not knowing what's left out and for what reason. But of
>> course, this is another issue which we definitely need to invest time into
>> - kenn already asked for our support here.
>>
>> thx,
>>
>> michel
>>
>> On Tue, Mar 12, 2019 at 11:30 AM Etienne Chauchot 
>> wrote:
>>
>> Hi Thomas,
>>
>> I agree, the committer that merges a PR should close the ticket. And, if
>> needed, he could discuss with the author (inside the PR) to assess if the
>> PR covers the ticket scope.
>>
>> This is the rule I apply to myself when I merge a PR (even thought it has
>> happened that I forgot to close one or two tickets :) ) .
>>
>> Etienne
>>
>>
>> Le lundi 11 mars 2019 à 14:17 -0700, Thomas Weise a écrit :
>>
>> JIRA probably deserves a separate discussion. It is messy.. We also have
>> examples of tickets being referenced by users that were not closed,
>> although the feature long implemented or issue fixed.
>>
>> There is no clear ownership in our workflow.
>>
>> A while ago I proposed in another context to make resolving JIRA part of
>> committer duty. I would like to bring this up for discussion again:
>>
>> https://github.com/apache/beam/pull/7129#discussion_r236405202
>>
>> Thomas
>>
>>
>> On Mon, Mar 11, 2019 at 1:47 PM Ahmet Altay  wrote:
>>
>> I agree this is a good idea. I used the same technique for 2.11 blog post
>> (JIRA release notes -> editorialized list + diffed the dependencies).
>>
>> On Mon, Mar 11, 2019 at 1:40 PM Kenneth Knowles  wrote:
>>
>> That is a good idea. The blog post is probably the main avenue where
>> folks will find out about new features or big fixes.
>>
>> When I did 2.10.0 I just used the automated Jira release notes and pulled
>> out significant things based on my judgment. I would also suggest that our
>> Jira hygiene could be significantly improved to make this process more
>> effective.
>>
>>
>> +1 to improving JIRA notes as well. Often times issues are closed with no
>> real comments on what happened, has it been resolved or not. It becomes an
>> exercise on reading the linked PRs to figure out what happened.
>>
>>
>>
>> Kenn
>>
>

Selectively running tests?

2019-03-18 Thread Pablo Estrada
We used to have tests run selectively depending on which directories were
changes. I've just noticed that this is not the case anymore.

Did we stop doing that? Or maybe the selector is faulty? Anyone know what
happened here?
Thanks!
-P.


[Announcement] New Website for Beam Summits

2019-03-18 Thread Aizhamal Nurmamat kyzy
Hi folks!

We are thrilled to announce the launch of beamsummit.org dedicated to Beam
Summits!

The current version of the website provides information about the upcoming
Beam Summit in Europe on June 19-20th, 2019. We will update it for the
future summits in Asia and North America accordingly. You can access all
necessary information about the conference theme, speakers and sessions,
the abstract submission timeline and the registration process, the
conference venues and much more that you will find useful until and during
the Beam Summits 2019.

We are working to make the website easy to use, so that anyone who is
organizing a Beam event can rely on it. You can find the code for it in
Github .

The pages will be updated on a regular basis, but we also love hearing
thoughts from our community! Let us know if you have any questions,
comments or suggestions, and help us improve. Also, if you are thinking of
organizing a Beam event, please feel free to reach out
 for support, and to use the code in GitHub as well.

We sincerely hope that you like the new Beam Summit site and will find it
useful for accessing information. Enjoy browsing around!

See you in Berlin :)


Aizhamal


Re: JIRA hygiene

2019-03-18 Thread Reuven Lax
Oh I agree with this. I'm just saying that an automated close on merge
might not always do the right thing.

On Mon, Mar 18, 2019 at 4:51 AM Etienne Chauchot 
wrote:

> Well, I agree but a contributor might not have the rights on jira and more
> important he might be unable to chose a target version for the jira.
> Targeting the ticket to the correct version requires to know the release
> cut date which is not the date of the commit to which the release tag
> points in case of cherry picks. It seems a bit complicated for a one-time
> contributor. This is why I proposed that the committer/reviewer does the
> jira closing.
>
> Etienne
>
>
> Le mercredi 13 mars 2019 à 17:08 -0700, Ahmet Altay a écrit :
>
> I agree with defining the workflow for closing JIRAs. Would not
> contributor be in a better position to close JIRAs or keep it open? It
> would make sense for the committer to ask about this but I think
> contributor (presumably the person who is the assignee of the JIRA) could
> be the responsible party for updating their JIRAs. On the other hand, I
> understand the argument that committer could do this at the time of merging
> and fill a gap in the process.
>
> On Wed, Mar 13, 2019 at 4:59 PM Michael Luckey 
> wrote:
>
> Hi,
>
> definitely +1 to properly establish a workflow to maintain jira status.
> Naively I d think, the reporter should close as she is the one to confirm
> whether the reported issue is fixed or not. But for obvious reasons that
> will not work here, so - although it puts another burden on committers, you
> are probably right that the committer is the best choice to ensure that the
> ticket gets promoted. Whether it will be resolved or clarified what's still
> to be done.
>
> Looking into the current state, we seem to have tons of issues whith
> merged PRs, which for anyone trying to find an existing jira issue to work
> on makes it unnecessary difficult to decide whether to look into that or
>  not. From my personal experience, it is somehow frustrating going through
> open issues, selecting one and after investing some (or even more) time to
> first understand a problem and then the PR to realise nothing has to be
> done anymore. Or not knowing what's left out and for what reason. But of
> course, this is another issue which we definitely need to invest time into
> - kenn already asked for our support here.
>
> thx,
>
> michel
>
> On Tue, Mar 12, 2019 at 11:30 AM Etienne Chauchot 
> wrote:
>
> Hi Thomas,
>
> I agree, the committer that merges a PR should close the ticket. And, if
> needed, he could discuss with the author (inside the PR) to assess if the
> PR covers the ticket scope.
>
> This is the rule I apply to myself when I merge a PR (even thought it has
> happened that I forgot to close one or two tickets :) ) .
>
> Etienne
>
>
> Le lundi 11 mars 2019 à 14:17 -0700, Thomas Weise a écrit :
>
> JIRA probably deserves a separate discussion. It is messy.. We also have
> examples of tickets being referenced by users that were not closed,
> although the feature long implemented or issue fixed.
>
> There is no clear ownership in our workflow.
>
> A while ago I proposed in another context to make resolving JIRA part of
> committer duty. I would like to bring this up for discussion again:
>
> https://github.com/apache/beam/pull/7129#discussion_r236405202
>
> Thomas
>
>
> On Mon, Mar 11, 2019 at 1:47 PM Ahmet Altay  wrote:
>
> I agree this is a good idea. I used the same technique for 2.11 blog post
> (JIRA release notes -> editorialized list + diffed the dependencies).
>
> On Mon, Mar 11, 2019 at 1:40 PM Kenneth Knowles  wrote:
>
> That is a good idea. The blog post is probably the main avenue where folks
> will find out about new features or big fixes.
>
> When I did 2.10.0 I just used the automated Jira release notes and pulled
> out significant things based on my judgment. I would also suggest that our
> Jira hygiene could be significantly improved to make this process more
> effective.
>
>
> +1 to improving JIRA notes as well. Often times issues are closed with no
> real comments on what happened, has it been resolved or not. It becomes an
> exercise on reading the linked PRs to figure out what happened.
>
>
>
> Kenn
>
> On Mon, Mar 11, 2019 at 1:04 PM Thomas Weise  wrote:
>
> Ahmet, thanks managing the release!
>
> I have a suggestion (not specific to only this release):
>
> The release blogs could be more useful to users. In this case, we have a
> long list of dependency updates on the top, but probably the improvements
> and features section should come first. I was also very surprised to find
> "Portable Flink runner support for running cross-language transforms."
> mentioned, since that is only being worked on now. On the other hand, there
> are probably items that we miss.
>
> Since this can only be addressed by more eyes, I suggest that going
> forward the blog pull request is included and reviewed as part of the
> release vote.
>
> Also, we should make announcing the releas

Re: Executing gradlew build

2019-03-18 Thread Brian Hulette
> Also on our contributor site [3] we recommend ensuring to be able to run
all tests with 'gradlew check' which is not to far away from full build.
Probably no one would expect a full build to fail here, which again makes
me think we need an equivalent Job on Jenkins here?

I think this is a really important point. When I first started, I wasted a
decent amount of time trying to make `./gradlew check` work because it's
mentioned in the contributor guide. Eventually my teammates told me not to
worry about it, but it would have been nice to get that warm fuzzy of
completing a full build (And also, had I not been part of a team with Beam
experience it could have been very discouraging). Maybe we should put a
note there pointing out that not all checks are expected to pass on master?
We could suggest new contributors run the checks from the latest release
tag instead.

Brian


On Thu, Mar 14, 2019 at 4:53 AM Michael Luckey  wrote:

> Hi,
>
> +1 for getting a controlled environment. I am a bit scared to rely on
> developers configuration to actually build release artefacts. For instance
> during my release process testing I used a docker image, only to realise
> after quite a few cycles, that default locale was off which leads to java
> using ascii file encodings. This is most likely something we do not want to
> happen in real live. And the plain fact, that probably every devel uses a
> different jdk build makes - apart from other differences -, the release
> build not reproducible, no?
>
> On the other hand, I finally managed to find that Jenkins build, which
> actually executes a full build [1]. Unfortunately this also seems to be
> consistently failing since March, 1. [2]
>
> [1] https://builds.apache.org/job/beam_Release_NightlySnapshot/
> [2]
> https://builds.apache.org/job/beam_Release_NightlySnapshot/buildTimeTrend
>
> On Mon, Mar 11, 2019 at 9:41 PM Ahmet Altay  wrote:
>
>>
>>
>> On Mon, Mar 11, 2019 at 7:03 AM Michael Luckey 
>> wrote:
>>
>>>
>>>
>>> On Mon, Mar 11, 2019 at 3:51 AM Kenneth Knowles  wrote:
>>>
 I have never actually tried to run a full build recently. It takes a
 long time and usually isn't what is needed for a particular change. FWIW I
 view Beam at this point as a mini-monorepo, so each directory and target
 can be healthy/unhealthy on its own.

>>>
>>> Fair Point. Totally agree.
>>>
>>>

 But it does seem like we should at least know what is unhealthy and
 why. Have you been filing Jiras about the failures? Are they predictable?
 Are they targets that pass in Jenkins but not in vanilla build? That would
 mean our Jenkins environment is too rich and should be slimmed down
 probably.

 Kenn

>>>
>>> Unfortunately those failures are not really predictable. Usually, I
>>> start with plain './gradlew build' and keep adding some '-x
>>> :beam-sdks-python:testPy2Gcp -x :beam-sdks-python:testPython' until build
>>> succeeds. Afterwards it seems to be possible to remove this exclusions step
>>> by step, thereby filling the build cache, which on next reruns might have
>>> some impact on how tasks are executed.
>>>
>>> Most of failures are python related. Had not much success getting into
>>> those. From time to time I see 'seemingly' similar failures on Jenkins, but
>>> tracing on python is more difficult coming from the more java background.
>>> Also using Mac I believe to remember that preinstalled python had some
>>> issues/differences compared with private installs. Others are those
>>> Elasticsearch - which were worked on lately - and ClickHouse tests which
>>> seem to be still flaky.
>>>
>>
>>> So I mainly blamed my setup and did not yet have the time to further
>>> track those failures down. But
>>>
>>> As I did use a vanilla system and was not able to get beam to build, i
>>> got thinking about
>>>
>>> 1. The release process
>>> The release manager has lot of stuff to take care for, but is also
>>> supposed to run a full gradle build on her local machine [1]. Apart from
>>> that being a long lasting task, if it keeps failing this puts additional
>>> burden on the release manager. So I was wondering, why we can not push that
>>> to Jenkins as we do with all these other tests [2]. Here I did not find any
>>> existing Job doing such, so wanted to ask for feedback here.
>>>
>>> If a full build has to be run - and of course it makes some sense on
>>> release - I would suggest to get that running on a regular base on Jenkins
>>> just to ensure not to be surprised during release. And as a sideeffect
>>> enable the release manager to also delegate this to Jenkins to free her
>>> time (and dev box).
>>>
>>
>> +1, this will be great. Quite often we end up catching issues right when
>> we are doing the release. I would one up this request and suggest a Jenkins
>> job running most of the release process as much as possible to avoid last
>> minute surprises.
>>
>> Also, I wonder if we could build our releases in a controlled environment
>> (e

Re: "Contributors" Wiki Page

2019-03-18 Thread Aizhamal Nurmamat kyzy
I think this is a great idea, Max.

I am currently working on introductory materials for new Beam contributors,
and this is exactly a section that I've wanted to include in my materials.
I agree that having a high-level overview of Beam components and the people
that work in them would be very useful to contributors trying to become
familiar with the project. It would also allow potential contributors to
get a better understanding of Beam, and where their skills may be most
useful.

Thanks,
Aizhamal

On Mon, Mar 18, 2019 at 11:56 AM Maximilian Michels  wrote:

> This is different from the Git log. For new contributors a project can
> look very opaque. The Wiki page could provide information that is not
> contained in the Git log (e.g. component description, area of interest,
> contact information), but more importantly it conveys a general openness
> to new contributors. Having to search the Git log is not the most
> pleasant on-boarding experience.
>
> The disadvantage (and advantage) is that the page needs to be maintained
> but I think that is doable.
>
> To give an example:
>
> Maximilian Michels
> Components: Flink Runner, Portability
> Contact: m...@apache.org / "mxm" @ ASF Slack
>
> That's pretty coarse but I think it is nice to have some information on
> the people behind Beam. And again, this shouldn't be limited to
> committers / PMC members.
>
> -Max
>
> On 18.03.19 18:38, Ruoyun Huang wrote:
> > Sounds something helpful for new starters, though trying to understand
> > what is proposed to be exactly listed on this page.
> >
> > What extra information one can get from this page, comparing to looking
> > at "History" or "Blame" page from github?
> >
> >
> >
> > On Mon, Mar 18, 2019 at 8:45 AM Maximilian Michels  > > wrote:
> >
> > Hi,
> >
> > This is a follow-up from a past thread. We often get questions like
> > "Who
> > is working on component XY?" or "Whom can I ping for a review/ask a
> > question?" Providing insight into the project structure is important
> > for
> > new contributors to get started.
> >
> > What do you think about creating a Wiki page with Beam contributors?
> > Contributors would be free to leave their name, contact information,
> > and
> > a description of their work in Beam. Note that this is should be for
> > everybody, not only committers/PMC members. The page could be
> organized
> > by Beam components.
> >
> > Let me know what you think.
> >
> > Cheers,
> > Max
> >
> >
> >
> > --
> > 
> > Ruoyun  Huang
> >


Re: "Contributors" Wiki Page

2019-03-18 Thread Maximilian Michels
This is different from the Git log. For new contributors a project can 
look very opaque. The Wiki page could provide information that is not 
contained in the Git log (e.g. component description, area of interest, 
contact information), but more importantly it conveys a general openness 
to new contributors. Having to search the Git log is not the most 
pleasant on-boarding experience.


The disadvantage (and advantage) is that the page needs to be maintained 
but I think that is doable.


To give an example:

Maximilian Michels
Components: Flink Runner, Portability
Contact: m...@apache.org / "mxm" @ ASF Slack

That's pretty coarse but I think it is nice to have some information on 
the people behind Beam. And again, this shouldn't be limited to 
committers / PMC members.


-Max

On 18.03.19 18:38, Ruoyun Huang wrote:
Sounds something helpful for new starters, though trying to understand 
what is proposed to be exactly listed on this page.


What extra information one can get from this page, comparing to looking 
at "History" or "Blame" page from github?




On Mon, Mar 18, 2019 at 8:45 AM Maximilian Michels > wrote:


Hi,

This is a follow-up from a past thread. We often get questions like
"Who
is working on component XY?" or "Whom can I ping for a review/ask a
question?" Providing insight into the project structure is important
for
new contributors to get started.

What do you think about creating a Wiki page with Beam contributors?
Contributors would be free to leave their name, contact information,
and
a description of their work in Beam. Note that this is should be for
everybody, not only committers/PMC members. The page could be organized
by Beam components.

Let me know what you think.

Cheers,
Max



--

Ruoyun  Huang



Re: "Contributors" Wiki Page

2019-03-18 Thread Ruoyun Huang
Sounds something helpful for new starters, though trying to understand what
is proposed to be exactly listed on this page.

What extra information one can get from this page, comparing to looking at
"History" or "Blame" page from github?



On Mon, Mar 18, 2019 at 8:45 AM Maximilian Michels  wrote:

> Hi,
>
> This is a follow-up from a past thread. We often get questions like "Who
> is working on component XY?" or "Whom can I ping for a review/ask a
> question?" Providing insight into the project structure is important for
> new contributors to get started.
>
> What do you think about creating a Wiki page with Beam contributors?
> Contributors would be free to leave their name, contact information, and
> a description of their work in Beam. Note that this is should be for
> everybody, not only committers/PMC members. The page could be organized
> by Beam components.
>
> Let me know what you think.
>
> Cheers,
> Max
>


-- 

Ruoyun  Huang


Re: Beam JobService Problem

2019-03-18 Thread Lukasz Cwik
I believe at one point in time we wanted to separate the preparation_id
from the job_id so that you could have one definition but multiple
instances of it. (e.g. preparation_id is a class name while job_id is the
pointer to the instance of the class)

On Tue, Jan 15, 2019 at 1:45 PM Sam Rohde  wrote:

> On Tue, Jan 15, 2019 at 5:23 AM Robert Bradshaw 
> wrote:
>
>> On Tue, Jan 15, 2019 at 1:19 AM Ankur Goenka  wrote:
>> >
>> > Thanks Sam for bringing this to the list.
>> >
>> > As preparation_ids are not reusable, having preparation_id and job_id
>> same makes sense to me for Flink.
>>
>> I think we change the protocol and only have one kind of ID. As well
>> as solving the problem at hand, it also simplifies the API.
>>
> That sounds fantastic.
>
> On Tue, Jan 15, 2019 at 5:23 AM Robert Bradshaw 
> wrote:
>
>> On Tue, Jan 15, 2019 at 1:19 AM Ankur Goenka  wrote:
>
> > Another option is to have a subscription for all states/messages on the
>> JobServer.
>> The problem is forcing the job service to remember all logs that were
>> ever logged ever in case someone requests them at some future date.
>> Best to have a way to register a listener earlier.
>
> I agree with Robert that it should be the caller in charge of what to do
> with generated monitoring data. This is especially true with long-running
> jobs that generate potentially gigabytes worth of logs.
>
> I made https://issues.apache.org/jira/browse/BEAM-6442 to track this. Let
> me know if I missed anything.
>
> On Tue, Jan 15, 2019 at 5:23 AM Robert Bradshaw 
> wrote:
>
>> On Tue, Jan 15, 2019 at 1:19 AM Ankur Goenka  wrote:
>> >
>> > Thanks Sam for bringing this to the list.
>> >
>> > As preparation_ids are not reusable, having preparation_id and job_id
>> same makes sense to me for Flink.
>>
>> I think we change the protocol and only have one kind of ID. As well
>> as solving the problem at hand, it also simplifies the API.
>>
>> > Another option is to have a subscription for all states/messages on the
>> JobServer.
>>
>> The problem is forcing the job service to remember all logs that were
>> ever logged ever in case someone requests them at some future date.
>> Best to have a way to register a listener earlier.
>>
>> > This will be similar to "docker". As the container id is created after
>> the container creation, the only way to get the container creation even is
>> to start "docker events" before starting a container.
>> >
>> > On Mon, Jan 14, 2019 at 11:13 AM Maximilian Michels 
>> wrote:
>> >>
>> >> Hi Sam,
>> >>
>> >> Good observation. Looks like we should fix that.
>> >>
>> >> Looking at InMemoryJobService, it appears that the state can only be
>> retrieved
>> >> by the client once the job is running with a job/invocation id
>> associated.
>> >> Indeed, any messages until that could be lost.
>> >>
>> >> For Flink the JobId is generated here:
>> >>
>> https://github.com/apache/beam/blob/3db71dd9f6f32684903c54b15a5368991cd41f36/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkJobInvoker.java#L64
>> >>
>> >> I don't see any benefit of having two separate IDs, as the IDs are
>> already
>> >> scoped by preparation and invocation phase.
>> >>
>> >> - Would it be possible to just pass the preparation id as the
>> invocation id at
>> >> JobInvoker#invoke(..)?
>> >>
>> >> - Alternatively, we could have an additional prepare phase for
>> JobInvoker to get
>> >> the job id for the invocation, before we start the job.
>> >>
>> >> Thanks,
>> >> Max
>> >>
>> >> On 14.01.19 12:39, Sam Rohde wrote:
>> >> > Hello all,
>> >> >
>> >> > While going through the codebase I noticed a problem with the Beam
>> JobService.
>> >> > In particular, the API allows for the possibility of never seeing
>> some messages
>> >> > or states with Get(State|Message)Stream. This is because the
>> >> > Get(State|Message)Stream calls need to have the job id which can
>> only be
>> >> > obtained from the RunJobResponse. But in order to see all
>> messages/states the
>> >> > streams need to be opened before the job starts.
>> >> >
>> >> > This is fine in Dataflow as the preparation_id == job_id, but this
>> is not true
>> >> > in Flink. What do you all think of this? Am I misunderstanding
>> something?
>> >> >
>> >> > Thanks,
>> >> > Sam
>> >> >
>>
>


Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2019-03-18 Thread Lukasz Cwik
Thanks Kenn, based upon the error message there was a small amount of code
that I missed when updating the code. I'll attempt to fix this in the next
few days.

On Mon, Jan 14, 2019 at 7:26 PM Kenneth Knowles  wrote:

> I wanted to use this thread to ping that the change to the user-facing API
> in order to wrap RestrictionTracker broke the Watch transform, which has
> been sickbayed for a long time. It would be helpful for experts to weigh in
> on https://issues.apache.org/jira/browse/BEAM-6352 about how the
> functionality used here should be implemented.
>
> Kenn
>
> On Wed, Dec 5, 2018 at 4:45 PM Lukasz Cwik  wrote:
>
>> Based upon the current Java SDK API, I was able to implement Runner
>> initiated checkpointing that the Java SDK honors within PR
>> https://github.com/apache/beam/pull/7200.
>>
>> This is an exciting first step to a splitting implementation, feel free
>> to take a look and comment. I have added two basic tests, execute SDF
>> without splitting and execute SDF with a runner initiated checkpoint.
>>
>> On Fri, Nov 30, 2018 at 4:52 PM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, Nov 30, 2018 at 10:14 PM Lukasz Cwik  wrote:
>>> >
>>> > On Fri, Nov 30, 2018 at 1:02 PM Robert Bradshaw 
>>> wrote:
>>> >>
>>> >> On Fri, Nov 30, 2018 at 6:38 PM Lukasz Cwik  wrote:
>>> >> >
>>> >> > Sorry, for some reason I thought I had answered these.
>>> >>
>>> >> No problem, thanks for you patience :).
>>> >>
>>> >> > On Fri, Nov 30, 2018 at 2:20 AM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>> >> >>
>>> >> >> I still have outstanding questions (above) about
>>> >> >>
>>> >> >> 1) Why we need arbitrary precision for backlog, instead of just
>>> using
>>> >> >> a (much simpler) double.
>>> >> >
>>> >> >
>>> >> > Double lacks the precision for reporting backlogs for byte key
>>> ranges (HBase, Bigtable, ...). Scanning a key range such as ["a", "b") and
>>> with a large number of keys with a really long common prefix such as
>>> "aab" and "aac", ... leads
>>> to the backlog not changing even though we are making progress through the
>>> key space. This also prevents splitting within such an area since the
>>> double can't provide that necessary precision (without multiple rounds of
>>> splitting which adds complexity).
>>> >>
>>> >> We'll have to support multiple rounds of splitting regardless. I can
>>> >> see how this gives more information up front though.
>>> >
>>> > I agree that we will need to support multiple rounds of splitting from
>>> the SDK side but this adds complexity from the runner side since it can
>>> only increase the accuracy for a split by performing multiple rounds of
>>> splitting at once.
>>> >
>>> >> (As an aside, I've been thinking about some ways of solving the dark
>>> >> matter problem, and it might depend on knowing the actual key, using
>>> >> the fact that character boundaries are likely cut-off points for
>>> >> changes in density, which would get obscured by alternative
>>> >> representations.)
>>> >
>>> > Every time I think about this issue, I can never get it to apply
>>> meaningfully for unbounded sources such as a message queue like pubsub.
>>>
>>> Yeah, neither can I.
>>>
>>> > Also, having an infinitely precise backlog such as the decimal format
>>> would still provide density information as the rate of change through the
>>> backlog for a bounded source would change once a "cluster" was hit.
>>>
>>> This is getting to somewhat of a tangential topic, but the key insight
>>> is that although it's easy to find the start of a cluster, to split
>>> ideally one would want to know where the end of the cluster is. For
>>> keyspaces, this is likely to be at binary fractions, and in particular
>>> looking at the longevity of common prefixes of length n one could make
>>> heuristic guesses as to where this density dropoff may be. (This also
>>> requires splitting at a key, not splitting relative to a current
>>> position, which has its issues...)
>>>
>>> >> >> 2) Whether its's worth passing backlog back to split requests,
>>> rather
>>> >> >> than (again) a double representing "portion of current remaining"
>>> >> >> which may change over time. (The most common split request is into
>>> >> >> even portions, and specifically half, which can't accurately be
>>> >> >> requested from a stale backlog.)
>>> >> >
>>> >> > I see two scenarios here:
>>> >> > * the fraction is exposed to the SDF author and then the SDF author
>>> needs to map from their restriciton space to backlog and also map fractions
>>> onto their restriction space meaning that they are required to write
>>> mappings between three different models.
>>> >> > * the fraction is not exposed to the SDF author and the framework
>>> code multiplies the fraction against the backlog and provides the backlog
>>> to the user (this solves the backlog skew issue but still has the limited
>>> precision issue).
>>> >>
>>> >> Limited precision is not as much of an issue here because

Re: [PROPOSAL] Preparing for Beam 2.12.0 release

2019-03-18 Thread Etienne Chauchot
Sounds great, thanks for volunteering to do the release.
Etienne
Le mercredi 13 mars 2019 à 12:08 -0700, Andrew Pilloud a écrit :
> Hello Beam community!
> Beam 2.12 release branch cut date is March 27th according to the release 
> calendar [1]. I would like to volunteer
> myself to do this release. I intend to cut the branch as planned on March 
> 27th and cherrypick fixes if needed.
> 
> If you have releasing blocking issues for 2.12 please mark their "Fix 
> Version" as 2.12.0. Kenn created a 2.13.0
> release in JIRA in case you would like to move any non-blocking issues to 
> that version.
> 
> Does this sound reasonable?
> 
> Andrew
> 
> [1] 
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com&ctz=America%2FLos_Angeles


"Contributors" Wiki Page

2019-03-18 Thread Maximilian Michels

Hi,

This is a follow-up from a past thread. We often get questions like "Who 
is working on component XY?" or "Whom can I ping for a review/ask a 
question?" Providing insight into the project structure is important for 
new contributors to get started.


What do you think about creating a Wiki page with Beam contributors? 
Contributors would be free to leave their name, contact information, and 
a description of their work in Beam. Note that this is should be for 
everybody, not only committers/PMC members. The page could be organized 
by Beam components.


Let me know what you think.

Cheers,
Max


Re: JIRA hygiene

2019-03-18 Thread Maximilian Michels
+1 for the committer to close the PR upon merge. I think it is better to 
risk closing it prematurely and having it reopened than not closing it 
at all.


Also if in doubt, committers just leave a comment in the JIRA asking the 
reporter if it can be closed.


-Max

On 14.03.19 23:42, Reuven Lax wrote:
I have in the past created JIRAs that took several PRs to close. Often 
this happened when I decided that a PR was too large and split it up 
into several smaller PRs.


On Thu, Mar 14, 2019 at 12:09 PM Thomas Weise > wrote:


I see many other projects where JIRAs are resolved when the PR is
merged. It is simple and seems to work well.

As a reviewer that merges a PR, I need to have an understanding of
the scope of the work. I cannot remember an instance where it wasn't
clear to me if a JIRA should be resolved or not. Do others have
examples?

It is important that JIRAs are resolved and fields like component,
issue type and fix version are properly set for the releases. IMO
the committer is in the best position to verify that.

The contributor usually isn't incentivised to massage a JIRA ticket
after the code was merged. That is probably the main reason why we
find ourselves with many dangling tickets. Merge is usually the last
action in the workflow, at which point we also know what the fix
version is.

Thomas




On Thu, Mar 14, 2019 at 11:58 AM Mikhail Gryzykhin
mailto:mig...@google.com>> wrote:

I believe that there are too many scenarios that we have to
cover if we are to design a generic approach. Common pattern
I've seen most times is when assignee on the ticket, who's
usually author of relevant PR, is expected to either resolve
ticket or pass it to the feature owner for verification.

We can have a bot that will check stale assigned tickets and
poke assignees. Can go further and allow bot to unassign tickets
if no response comes and remove "triaged" label. This will
always highlight all non-updated tickets and keep forgotten
tickets in available pool. Giving a hint to pass ownership of
ticket to committer (or person who merged PR) can be a simple
answer for contributors who are not sure whether ticket can be
closed.

--Mikhail

Have feedback ?


On Wed, Mar 13, 2019 at 6:00 PM Michael Luckey
mailto:adude3...@gmail.com>> wrote:

Totally agree. The contributor is most likely be the better
target. But as she is probably less familiar with the
process, we might be better of to put the responsibility on
the committer to kindly ask/discuss with her how to proceed
with corresponding jira ticket?

On Thu, Mar 14, 2019 at 1:18 AM Ahmet Altay
mailto:al...@google.com>> wrote:

I agree with defining the workflow for closing JIRAs.
Would not contributor be in a better position to close
JIRAs or keep it open? It would make sense for the
committer to ask about this but I think contributor
(presumably the person who is the assignee of the JIRA)
could be the responsible party for updating their JIRAs.
On the other hand, I understand the argument that
committer could do this at the time of merging and fill
a gap in the process.

On Wed, Mar 13, 2019 at 4:59 PM Michael Luckey
mailto:adude3...@gmail.com>> wrote:

Hi,

definitely +1 to properly establish a workflow to
maintain jira status. Naively I d think, the
reporter should close as she is the one to confirm
whether the reported issue is fixed or not. But for
obvious reasons that will not work here, so -
although it puts another burden on committers, you
are probably right that the committer is the best
choice to ensure that the ticket gets promoted.
Whether it will be resolved or clarified what's
still to be done.

Looking into the current state, we seem to have tons
of issues whith merged PRs, which for anyone trying
to find an existing jira issue to work on makes it
unnecessary difficult to decide whether to look into
that or  not. From my personal experience, it is
somehow frustrating going through open issues,
selecting one and after investing some (or even
more) time to first understand a problem and then
the PR to realise nothing has to be done anymore. Or
 

Beam Dependency Check Report (2019-03-18)

2019-03-18 Thread Apache Jenkins Server
ERROR: File 'src/build/dependencyUpdates/beam-dependency-check-report.html' does not exist

Re: JIRA hygiene

2019-03-18 Thread Etienne Chauchot
Well, I agree but a contributor might not have the rights on jira and more 
important he might be unable to chose a
target version for the jira. Targeting the ticket to the correct version 
requires to know the release cut date which is
not the date of the commit to which the release tag points in case of cherry 
picks. It seems a bit complicated for a
one-time contributor. This is why I proposed that the committer/reviewer does 
the jira closing.
Etienne

Le mercredi 13 mars 2019 à 17:08 -0700, Ahmet Altay a écrit :
> I agree with defining the workflow for closing JIRAs. Would not contributor 
> be in a better position to close JIRAs or
> keep it open? It would make sense for the committer to ask about this but I 
> think contributor (presumably the person
> who is the assignee of the JIRA) could be the responsible party for updating 
> their JIRAs. On the other hand, I
> understand the argument that committer could do this at the time of merging 
> and fill a gap in the process.
> On Wed, Mar 13, 2019 at 4:59 PM Michael Luckey  wrote:
> > Hi,
> > 
> > definitely +1 to properly establish a workflow to maintain jira status. 
> > Naively I d think, the reporter should close
> > as she is the one to confirm whether the reported issue is fixed or not. 
> > But for obvious reasons that will not work
> > here, so - although it puts another burden on committers, you are probably 
> > right that the committer is the best
> > choice to ensure that the ticket gets promoted. Whether it will be resolved 
> > or clarified what's still to be done.
> > 
> > Looking into the current state, we seem to have tons of issues whith merged 
> > PRs, which for anyone trying to find an
> > existing jira issue to work on makes it unnecessary difficult to decide 
> > whether to look into that or  not. From my
> > personal experience, it is somehow frustrating going through open issues, 
> > selecting one and after investing some (or
> > even more) time to first understand a problem and then the PR to realise 
> > nothing has to be done anymore. Or not
> > knowing what's left out and for what reason. But of course, this is another 
> > issue which we definitely need to invest
> > time into - kenn already asked for our support here.
> > 
> > thx,
> > 
> > michel
> > On Tue, Mar 12, 2019 at 11:30 AM Etienne Chauchot  
> > wrote:
> > > Hi Thomas,
> > > I agree, the committer that merges a PR should close the ticket. And, if 
> > > needed, he could discuss with the author
> > > (inside the PR) to assess if the PR covers the ticket scope.
> > > This is the rule I apply to myself when I merge a PR (even thought it has 
> > > happened that I forgot to close one or
> > > two tickets :) ) .
> > > Etienne
> > > 
> > > Le lundi 11 mars 2019 à 14:17 -0700, Thomas Weise a écrit :
> > > > JIRA probably deserves a separate discussion. It is messy.. We also 
> > > > have examples of tickets being referenced by
> > > > users that were not closed, although the feature long implemented or 
> > > > issue fixed.
> > > > 
> > > > There is no clear ownership in our workflow.
> > > > 
> > > > A while ago I proposed in another context to make resolving JIRA part 
> > > > of committer duty. I would like to bring
> > > > this up for discussion again:
> > > > 
> > > > https://github.com/apache/beam/pull/7129#discussion_r236405202
> > > > 
> > > > Thomas
> > > > 
> > > > 
> > > > On Mon, Mar 11, 2019 at 1:47 PM Ahmet Altay  wrote:
> > > > > I agree this is a good idea. I used the same technique for 2.11 blog 
> > > > > post (JIRA release notes -> editorialized
> > > > > list + diffed the dependencies).
> > > > > 
> > > > > On Mon, Mar 11, 2019 at 1:40 PM Kenneth Knowles  
> > > > > wrote:
> > > > > > That is a good idea. The blog post is probably the main avenue 
> > > > > > where folks will find out about new features
> > > > > > or big fixes.
> > > > > > When I did 2.10.0 I just used the automated Jira release notes and 
> > > > > > pulled out significant things based on my
> > > > > > judgment. I would also suggest that our Jira hygiene could be 
> > > > > > significantly improved to make this process
> > > > > > more effective.
> > > > > > 
> > > > > 
> > > > > +1 to improving JIRA notes as well. Often times issues are closed 
> > > > > with no real comments on what happened, has
> > > > > it been resolved or not. It becomes an exercise on reading the linked 
> > > > > PRs to figure out what happened.
> > > > >  
> > > > > > Kenn
> > > > > > On Mon, Mar 11, 2019 at 1:04 PM Thomas Weise  
> > > > > > wrote:
> > > > > > > Ahmet, thanks managing the release!
> > > > > > > I have a suggestion (not specific to only this release): 
> > > > > > > 
> > > > > > > The release blogs could be more useful to users. In this case, we 
> > > > > > > have a long list of dependency updates
> > > > > > > on the top, but probably the improvements and features section 
> > > > > > > should come first. I was also very
> > > > > > > surprised to find "Portable Flink runner suppor

Re: [PROPOSAL] Preparing for Beam 2.12.0 release

2019-03-18 Thread Robert Bradshaw
I agree with Kenn on both accounts. We can (and should) keep 2.7.x
alive with an immanent 2.7.1 release, and choose the next one at a
future date based on actual experience with an existing release.

On Fri, Mar 15, 2019 at 5:36 PM Ahmet Altay  wrote:
>
> +1 to extending 2.7.x LTS lifetime for a little longer and simultaneously 
> making a 2.7.1 release.
>
> On Fri, Mar 15, 2019 at 9:32 AM Kenneth Knowles  wrote:
>>
>> We actually have some issues queued up for 2.7.1, and IMO it makes sense to 
>> extend 2.7 since the 6 month period was just a pilot and like you say we 
>> haven't really exercised LTS.
>>
>> Re 2.12.0 I strongly feel LTS should be designated after a release has seen 
>> some use. If we extend 2.7 for another while then we will have some 
>> candidate by the time it expires. (2.8, 2.9, 2.10 all have major issues, 
>> while 2.11 and 2.12 are untried)
>>
>> Kenn
>>
>> On Fri, Mar 15, 2019 at 7:50 AM Thomas Weise  wrote:
>>>
>>> Given no LTS activity for 2.7.x - do we really need it?
>>>
>>>
>>> On Fri, Mar 15, 2019 at 6:54 AM Ismaël Mejía  wrote:

 After looking at the dates it seems that 2.12 should be the next LTS
 since it will be exactly 6 months after the release of 2.7.0. Anyone
 has comments, or prefer to do the LTS better for the next version
 (2.13) ?

 On Thu, Mar 14, 2019 at 12:13 PM Michael Luckey  
 wrote:
 >
 > @mxm
 >
 > Sure we should. Unfortunately the scripts to not have any '--dry-run' 
 > toggle. Implementing this seemed not too easy on first sight, as those 
 > release scripts do assume committed outputs of their predecessors and 
 > are not yet in the shape to be parameterised.
 >
 > So here is what I did:
 > 1. As I did not wanted the scripts to do 'sudo' installs on my machine, 
 > I first created a docker image with required prerequisites.
 > 2. Cloned beam to that machine (to get the release.scripts)
 > 3. Edited the places which seemed to call to the outside
 > - disabled any git push
 > - changed git url to point to some copy on local filesystem to pull 
 > required changes from there
 > - changed './gradlew' build to './gradlew assemble' as build will 
 > not work on docker anyway
 > - changed publish to publishToMavenLocal
 > - probably some more changes to ensure I do not write to remote
 > 4. run the scripts
 >
 > What I missed out:
 > 1. There is some communication with svn (signing artefacts downloaded 
 > from svn and committing). I just skipped those steps, as I was just too 
 > scared to miss some commit and doing an accidental push to some remote 
 > system (where I am hopefully not authorised anyway without doing proper 
 > authentication)
 >
 > If you believe I missed something which could be tested in advance, I d 
 > happily do more testing to ensure a smooth release process.
 >
 > michel
 >
 > On Thu, Mar 14, 2019 at 11:23 AM Maximilian Michels  
 > wrote:
 >>
 >> Hi Andrew,
 >>
 >> Sounds good. Thank you for being the release manager.
 >>
 >> @Michael Shall we perform some dry-run release testing for ensuring
 >> Gradle 5 compatibility?
 >>
 >> Thanks,
 >> Max
 >>
 >> On 14.03.19 00:28, Michael Luckey wrote:
 >> > Sounds good. Thanks for volunteering.
 >> >
 >> > Just as a side note: @aaltay had trouble releasing caused by the 
 >> > switch
 >> > to gradle5. Although that should be fixed now, you will be the first
 >> > using those changes in production. So if you encounter any issues. do
 >> > not hesitate to blame and contact me. Also I am currently looking into
 >> > some improvements to the process suggested by @kenn. So your feedback 
 >> > on
 >> > the current state would be greatly appreciated. I hope to get at least
 >> > https://issues.apache.org/jira/browse/BEAM-6798 done until then.
 >> >
 >> > Thanks again,
 >> >
 >> > michel
 >> >
 >> > On Wed, Mar 13, 2019 at 8:13 PM Ahmet Altay >>> >> > > wrote:
 >> >
 >> > Sounds great, thank you!
 >> >
 >> > On Wed, Mar 13, 2019 at 12:09 PM Andrew Pilloud 
 >> > >>> >> > > wrote:
 >> >
 >> > Hello Beam community!
 >> >
 >> > Beam 2.12 release branch cut date is March 27th according to 
 >> > the
 >> > release calendar [1]. I would like to volunteer myself to do
 >> > this release. I intend to cut the branch as planned on March
 >> > 27th and cherrypick fixes if needed.
 >> >
 >> > If you have releasing blocking issues for 2.12 please mark 
 >> > their
 >> > "Fix Version" as 2.12.0. Kenn created a 2.13.0 release in JIRA
 >> > in case you would like to move any non-blocking issues to that
 >> >