Re: Apache Pulsar connector for Beam

2019-10-25 Thread Taher Koitawala
I would be interested in contributing to the Pulsar Beam connector. That's
one of the reasons i started the email thread.


Regards,
Taher Koitawala

On Sat, Oct 26, 2019, 9:41 AM Sijie Guo  wrote:

> This is Sijie Guo from StreamNative and Pulsar PMC.
>
> Maximilian - thank you for adding us in the email thread!
>
> We do have one roadmap item for adding a Beam connector for Pulsar. It was
> planned for this quarter, but we haven’t started the implementation yet. If
> the Beam community is interested in it, we are happy to collaborate with
> Beam community.
>
> Thanks,
> Sijie
>
> On Sat, Oct 26, 2019 at 12:36 AM Maximilian Michels 
> wrote:
>
>> It would be great to have a Pulsar connector. We might want to ask the
>> folks from StreamNative (in CC). Any plans? :)
>>
>> Cheers,
>> Max
>>
>> On 24.10.19 18:31, Pablo Estrada wrote:
>> > There's a JIRA issue to track this:
>> > https://issues.apache.org/jira/browse/BEAM-8218
>> >
>> > Alex was kind enough to file it. +Alex Van Boxel
>> >  : )
>> > Best
>> > -P
>> >
>> > On Thu, Oct 24, 2019 at 12:01 AM Taher Koitawala > > > wrote:
>> >
>> > Hi Reza,
>> >   Thanks for your reply. However i do not see Pulsar
>> > listed in there. Should we file a jira?
>> >
>> > On Thu, Oct 24, 2019, 12:16 PM Reza Rokni > > > wrote:
>> >
>> > Hi Taher,
>> >
>> > You can see the list of current and wip IO's here:
>> >
>> > https://beam.apache.org/documentation/io/built-in/
>> >
>> > Cheers
>> >
>> > Reza
>> >
>> > On Thu, 24 Oct 2019 at 13:56, Taher Koitawala
>> > mailto:taher...@gmail.com>> wrote:
>> >
>> > Hi All,
>> >   Been wanting to know if we have a Pulsar connector
>> > for Beam. Pulsar is another messaging queue like Kafka and I
>> > would like to build a streaming pipeline with Pulsar. Any
>> > help would be appreciated..
>> >
>> >
>> > Regards,
>> > Taher Koitawala
>> >
>> >
>> >
>> > --
>> >
>> > This email may be confidential and privileged. If you received
>> > this communication by mistake, please don't forward it to anyone
>> > else, please erase all copies and attachments, and please let me
>> > know that it has gone to the wrong person.
>> >
>> > The above terms reflect a potential business arrangement, are
>> > provided solely as a basis for further discussion, and are not
>> > intended to be and do not constitute a legally binding
>> > obligation. No legally binding obligations will be created,
>> > implied, or inferred until an agreement in final form is
>> > executed in writing by all parties involved.
>> >
>>
>


Re: Apache Pulsar connector for Beam

2019-10-25 Thread Sijie Guo
This is Sijie Guo from StreamNative and Pulsar PMC.

Maximilian - thank you for adding us in the email thread!

We do have one roadmap item for adding a Beam connector for Pulsar. It was
planned for this quarter, but we haven’t started the implementation yet. If
the Beam community is interested in it, we are happy to collaborate with
Beam community.

Thanks,
Sijie

On Sat, Oct 26, 2019 at 12:36 AM Maximilian Michels  wrote:

> It would be great to have a Pulsar connector. We might want to ask the
> folks from StreamNative (in CC). Any plans? :)
>
> Cheers,
> Max
>
> On 24.10.19 18:31, Pablo Estrada wrote:
> > There's a JIRA issue to track this:
> > https://issues.apache.org/jira/browse/BEAM-8218
> >
> > Alex was kind enough to file it. +Alex Van Boxel
> >  : )
> > Best
> > -P
> >
> > On Thu, Oct 24, 2019 at 12:01 AM Taher Koitawala  > > wrote:
> >
> > Hi Reza,
> >   Thanks for your reply. However i do not see Pulsar
> > listed in there. Should we file a jira?
> >
> > On Thu, Oct 24, 2019, 12:16 PM Reza Rokni  > > wrote:
> >
> > Hi Taher,
> >
> > You can see the list of current and wip IO's here:
> >
> > https://beam.apache.org/documentation/io/built-in/
> >
> > Cheers
> >
> > Reza
> >
> > On Thu, 24 Oct 2019 at 13:56, Taher Koitawala
> > mailto:taher...@gmail.com>> wrote:
> >
> > Hi All,
> >   Been wanting to know if we have a Pulsar connector
> > for Beam. Pulsar is another messaging queue like Kafka and I
> > would like to build a streaming pipeline with Pulsar. Any
> > help would be appreciated..
> >
> >
> > Regards,
> > Taher Koitawala
> >
> >
> >
> > --
> >
> > This email may be confidential and privileged. If you received
> > this communication by mistake, please don't forward it to anyone
> > else, please erase all copies and attachments, and please let me
> > know that it has gone to the wrong person.
> >
> > The above terms reflect a potential business arrangement, are
> > provided solely as a basis for further discussion, and are not
> > intended to be and do not constitute a legally binding
> > obligation. No legally binding obligations will be created,
> > implied, or inferred until an agreement in final form is
> > executed in writing by all parties involved.
> >
>


Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Valentyn Tymofieiev
Thanks, Brian.
+Udi Meiri 
As next step, it would be good to know whether slowdown is caused by tests
in this PR, or its effect on other tests, and to confirm that only Python 2
codepaths were affected.

On Fri, Oct 25, 2019 at 6:35 PM Brian Hulette  wrote:

> I did a bisect based on the runtime of `./gradlew
> :sdks:python:test-suites:tox:py2:testPy2Gcp` around the commits between 9/1
> and 9/15 to see if I could find the source of the spike that happened
> around 9/6. It looks like it was due to PR#9283 [1]. I thought maybe this
> search would reveal some mis-guided configuration change, but as far as I
> can tell 9283 just added a well-tested feature. I don't think there's
> anything to learn from that... I just wanted to circle back about it in
> case others are curious about that spike.
>
> I'm +1 on bumping some FnApiRunner configurations.
>
> Brian
>
> [1] https://github.com/apache/beam/pull/9283
>
> On Fri, Oct 25, 2019 at 4:49 PM Pablo Estrada  wrote:
>
>> I think it makes sense to remove some of the extra FnApiRunner
>> configurations. Perhaps some of the multiworkers and some of the grpc
>> versions?
>> Best
>> -P.
>>
>> On Fri, Oct 25, 2019 at 12:27 PM Robert Bradshaw 
>> wrote:
>>
>>> It looks like fn_api_runner_test.py is quite expensive, taking 10-15+
>>> minutes on each version of Python. This test consists of a base class
>>> that is basically a validates runner suite, and is then run in several
>>> configurations, many more of which (including some expensive ones)
>>> have been added lately.
>>>
>>> class FnApiRunnerTest(unittest.TestCase):
>>> class FnApiRunnerTestWithGrpc(FnApiRunnerTest):
>>> class FnApiRunnerTestWithGrpcMultiThreaded(FnApiRunnerTest):
>>> class FnApiRunnerTestWithDisabledCaching(FnApiRunnerTest):
>>> class FnApiRunnerTestWithMultiWorkers(FnApiRunnerTest):
>>> class FnApiRunnerTestWithGrpcAndMultiWorkers(FnApiRunnerTest):
>>> class FnApiRunnerTestWithBundleRepeat(FnApiRunnerTest):
>>> class FnApiRunnerTestWithBundleRepeatAndMultiWorkers(FnApiRunnerTest):
>>>
>>> I'm not convinced we need to run all of these permutations, or at
>>> least not all tests in all permutations.
>>>
>>> On Fri, Oct 25, 2019 at 10:57 AM Valentyn Tymofieiev
>>>  wrote:
>>> >
>>> > I took another look at this and precommit ITs are already running in
>>> parallel, albeit in the same suite. However it appears Python precommits
>>> became slower, especially Python 2 precommits [35 min per suite x 3
>>> suites], see [1]. Not sure yet what caused the increase, but precommits
>>> used to be faster. Perhaps we have added a slow test or a lot of new tests.
>>> >
>>> > [1]
>>> https://scans.gradle.com/s/jvcw5fpqfc64k/timeline?task=ancsbov425524
>>> >
>>> > On Thu, Oct 24, 2019 at 4:53 PM Ahmet Altay  wrote:
>>> >>
>>> >> Ack. Separating precommit ITs to a different suite sounds good.
>>> Anyone is interested in doing that?
>>> >>
>>> >> On Thu, Oct 24, 2019 at 2:41 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>> >>>
>>> >>> This should not increase the queue time substantially, since
>>> precommit ITs are running sequentially with precommit tests, unlike
>>> multiple precommit tests which run in parallel to each other.
>>> >>>
>>> >>> The precommit ITs we run are batch and streaming wordcount tests on
>>> Py2 and one Py3 version, so it's not a lot of tests.
>>> >>>
>>> >>> On Thu, Oct 24, 2019 at 1:07 PM Ahmet Altay 
>>> wrote:
>>> 
>>>  +1 to separating ITs from precommit. Downside would be, when Chad
>>> tried to do something similar [1] it was noted that the total time to run
>>> all precommit tests would increase and also potentially increase the queue
>>> time.
>>> 
>>>  Another alternative, we could run a smaller set of IT tests in
>>> precommits and run the whole suite as part of post commit tests.
>>> 
>>>  [1] https://github.com/apache/beam/pull/9642
>>> 
>>>  On Thu, Oct 24, 2019 at 12:15 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>> >
>>> > One improvement could be move to Precommit IT tests into a
>>> separate suite from precommit tests, and run it in parallel.
>>> >
>>> > On Thu, Oct 24, 2019 at 11:41 AM Brian Hulette <
>>> bhule...@google.com> wrote:
>>> >>
>>> >> Python Precommits are taking quite a while now [1]. Just visually
>>> it looks like the average length is 1.5h or so, but it spikes up to 2h.
>>> I've had several precommit runs get aborted due to the 2 hour limit.
>>> >>
>>> >> It looks like there was a spike up above 1h back on 9/6 and the
>>> duration has been steadily rising since then. Is there anything we can do
>>> about this?
>>> >>
>>> >> Brian
>>> >>
>>> >> [1]
>>> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=now-90d&to=now&fullscreen&panelId=4
>>>
>>


Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Brian Hulette
I did a bisect based on the runtime of `./gradlew
:sdks:python:test-suites:tox:py2:testPy2Gcp` around the commits between 9/1
and 9/15 to see if I could find the source of the spike that happened
around 9/6. It looks like it was due to PR#9283 [1]. I thought maybe this
search would reveal some mis-guided configuration change, but as far as I
can tell 9283 just added a well-tested feature. I don't think there's
anything to learn from that... I just wanted to circle back about it in
case others are curious about that spike.

I'm +1 on bumping some FnApiRunner configurations.

Brian

[1] https://github.com/apache/beam/pull/9283

On Fri, Oct 25, 2019 at 4:49 PM Pablo Estrada  wrote:

> I think it makes sense to remove some of the extra FnApiRunner
> configurations. Perhaps some of the multiworkers and some of the grpc
> versions?
> Best
> -P.
>
> On Fri, Oct 25, 2019 at 12:27 PM Robert Bradshaw 
> wrote:
>
>> It looks like fn_api_runner_test.py is quite expensive, taking 10-15+
>> minutes on each version of Python. This test consists of a base class
>> that is basically a validates runner suite, and is then run in several
>> configurations, many more of which (including some expensive ones)
>> have been added lately.
>>
>> class FnApiRunnerTest(unittest.TestCase):
>> class FnApiRunnerTestWithGrpc(FnApiRunnerTest):
>> class FnApiRunnerTestWithGrpcMultiThreaded(FnApiRunnerTest):
>> class FnApiRunnerTestWithDisabledCaching(FnApiRunnerTest):
>> class FnApiRunnerTestWithMultiWorkers(FnApiRunnerTest):
>> class FnApiRunnerTestWithGrpcAndMultiWorkers(FnApiRunnerTest):
>> class FnApiRunnerTestWithBundleRepeat(FnApiRunnerTest):
>> class FnApiRunnerTestWithBundleRepeatAndMultiWorkers(FnApiRunnerTest):
>>
>> I'm not convinced we need to run all of these permutations, or at
>> least not all tests in all permutations.
>>
>> On Fri, Oct 25, 2019 at 10:57 AM Valentyn Tymofieiev
>>  wrote:
>> >
>> > I took another look at this and precommit ITs are already running in
>> parallel, albeit in the same suite. However it appears Python precommits
>> became slower, especially Python 2 precommits [35 min per suite x 3
>> suites], see [1]. Not sure yet what caused the increase, but precommits
>> used to be faster. Perhaps we have added a slow test or a lot of new tests.
>> >
>> > [1]
>> https://scans.gradle.com/s/jvcw5fpqfc64k/timeline?task=ancsbov425524
>> >
>> > On Thu, Oct 24, 2019 at 4:53 PM Ahmet Altay  wrote:
>> >>
>> >> Ack. Separating precommit ITs to a different suite sounds good. Anyone
>> is interested in doing that?
>> >>
>> >> On Thu, Oct 24, 2019 at 2:41 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>> >>>
>> >>> This should not increase the queue time substantially, since
>> precommit ITs are running sequentially with precommit tests, unlike
>> multiple precommit tests which run in parallel to each other.
>> >>>
>> >>> The precommit ITs we run are batch and streaming wordcount tests on
>> Py2 and one Py3 version, so it's not a lot of tests.
>> >>>
>> >>> On Thu, Oct 24, 2019 at 1:07 PM Ahmet Altay  wrote:
>> 
>>  +1 to separating ITs from precommit. Downside would be, when Chad
>> tried to do something similar [1] it was noted that the total time to run
>> all precommit tests would increase and also potentially increase the queue
>> time.
>> 
>>  Another alternative, we could run a smaller set of IT tests in
>> precommits and run the whole suite as part of post commit tests.
>> 
>>  [1] https://github.com/apache/beam/pull/9642
>> 
>>  On Thu, Oct 24, 2019 at 12:15 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>> >
>> > One improvement could be move to Precommit IT tests into a separate
>> suite from precommit tests, and run it in parallel.
>> >
>> > On Thu, Oct 24, 2019 at 11:41 AM Brian Hulette 
>> wrote:
>> >>
>> >> Python Precommits are taking quite a while now [1]. Just visually
>> it looks like the average length is 1.5h or so, but it spikes up to 2h.
>> I've had several precommit runs get aborted due to the 2 hour limit.
>> >>
>> >> It looks like there was a spike up above 1h back on 9/6 and the
>> duration has been steadily rising since then. Is there anything we can do
>> about this?
>> >>
>> >> Brian
>> >>
>> >> [1]
>> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=now-90d&to=now&fullscreen&panelId=4
>>
>


Re: [EXT] Re: Interactive Beam Example Failing [BEAM-8451]

2019-10-25 Thread Chuck Yang
IMO returning the input PCollection in a PTransform should be a valid
albeit trivial use case. I have put a suggested fix for supporting
these kinds of transforms in the interactive runner here:
https://github.com/apache/beam/pull/9865 . I'm also new to beam dev so
if there's something I'm missing please let me know :-)

On Mon, Oct 21, 2019 at 4:46 PM Robert Bradshaw  wrote:
>
> Thanks for trying this out. Yes, this is definitely something that
> should be supported (and tested).
>
> On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic  wrote:
> >
> > Hi everyone,
> >
> > The interactive beam example using the DirectRunner fails after execution 
> > of the last cell. The recursion limit is exceeded during the calculation of 
> > the cache label because of a circular reference in the PipelineInfo object.
> >
> > The constructor for the PipelineInfo class creates a mapping from each 
> > pcollection to the transforms that produce and consume it. The issue arises 
> > when there exists a transform that is both a producer and a consumer for 
> > the same pcollection. This occurs when a transform's expand method returns 
> > the same pcoll object that's passed into it. The specific transform causing 
> > the failure of the example is MaybeReshuffle, which is used in the Create 
> > transform. Replacing "return pcoll" with "return pcoll | Map(lambda x: x)" 
> > seems to fix the problem.
> >
> > A workaround for this issue on the interactive beam side would be fairly 
> > simple, but it seems to me that there should be more validation of 
> > pipelines to prevent the use of transforms that return the same pcoll 
> > that's passed in, or at least a mention of this in the transform style 
> > guide. My understanding is that pcollections are produced by a single 
> > transform (they even have a field called "producer" that references only 
> > one transform). If that's the case then that property of pcollections 
> > should be enforced.
> >
> > I made ticket BEAM-8451 to track this issue.
> >
> > I'm still new to beam so I apologize if I'm fundamentally misunderstanding 
> > something. I'm not exactly sure what the next step should be and would 
> > appreciate some recommendations. I can submit a PR to solve the immediate 
> > problem of the failing example but the underlying problem should also be 
> > addressed at some point. I also apologize if people are already aware of 
> > this problem.
> >
> > Thank You!
> > Igor Durovic

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.


Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Pablo Estrada
I think it makes sense to remove some of the extra FnApiRunner
configurations. Perhaps some of the multiworkers and some of the grpc
versions?
Best
-P.

On Fri, Oct 25, 2019 at 12:27 PM Robert Bradshaw 
wrote:

> It looks like fn_api_runner_test.py is quite expensive, taking 10-15+
> minutes on each version of Python. This test consists of a base class
> that is basically a validates runner suite, and is then run in several
> configurations, many more of which (including some expensive ones)
> have been added lately.
>
> class FnApiRunnerTest(unittest.TestCase):
> class FnApiRunnerTestWithGrpc(FnApiRunnerTest):
> class FnApiRunnerTestWithGrpcMultiThreaded(FnApiRunnerTest):
> class FnApiRunnerTestWithDisabledCaching(FnApiRunnerTest):
> class FnApiRunnerTestWithMultiWorkers(FnApiRunnerTest):
> class FnApiRunnerTestWithGrpcAndMultiWorkers(FnApiRunnerTest):
> class FnApiRunnerTestWithBundleRepeat(FnApiRunnerTest):
> class FnApiRunnerTestWithBundleRepeatAndMultiWorkers(FnApiRunnerTest):
>
> I'm not convinced we need to run all of these permutations, or at
> least not all tests in all permutations.
>
> On Fri, Oct 25, 2019 at 10:57 AM Valentyn Tymofieiev
>  wrote:
> >
> > I took another look at this and precommit ITs are already running in
> parallel, albeit in the same suite. However it appears Python precommits
> became slower, especially Python 2 precommits [35 min per suite x 3
> suites], see [1]. Not sure yet what caused the increase, but precommits
> used to be faster. Perhaps we have added a slow test or a lot of new tests.
> >
> > [1] https://scans.gradle.com/s/jvcw5fpqfc64k/timeline?task=ancsbov425524
> >
> > On Thu, Oct 24, 2019 at 4:53 PM Ahmet Altay  wrote:
> >>
> >> Ack. Separating precommit ITs to a different suite sounds good. Anyone
> is interested in doing that?
> >>
> >> On Thu, Oct 24, 2019 at 2:41 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> >>>
> >>> This should not increase the queue time substantially, since precommit
> ITs are running sequentially with precommit tests, unlike multiple
> precommit tests which run in parallel to each other.
> >>>
> >>> The precommit ITs we run are batch and streaming wordcount tests on
> Py2 and one Py3 version, so it's not a lot of tests.
> >>>
> >>> On Thu, Oct 24, 2019 at 1:07 PM Ahmet Altay  wrote:
> 
>  +1 to separating ITs from precommit. Downside would be, when Chad
> tried to do something similar [1] it was noted that the total time to run
> all precommit tests would increase and also potentially increase the queue
> time.
> 
>  Another alternative, we could run a smaller set of IT tests in
> precommits and run the whole suite as part of post commit tests.
> 
>  [1] https://github.com/apache/beam/pull/9642
> 
>  On Thu, Oct 24, 2019 at 12:15 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> >
> > One improvement could be move to Precommit IT tests into a separate
> suite from precommit tests, and run it in parallel.
> >
> > On Thu, Oct 24, 2019 at 11:41 AM Brian Hulette 
> wrote:
> >>
> >> Python Precommits are taking quite a while now [1]. Just visually
> it looks like the average length is 1.5h or so, but it spikes up to 2h.
> I've had several precommit runs get aborted due to the 2 hour limit.
> >>
> >> It looks like there was a spike up above 1h back on 9/6 and the
> duration has been steadily rising since then. Is there anything we can do
> about this?
> >>
> >> Brian
> >>
> >> [1]
> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=now-90d&to=now&fullscreen&panelId=4
>


Re: JIRA priorities explaination

2019-10-25 Thread Pablo Estrada
That SGTM

On Fri, Oct 25, 2019 at 4:18 PM Robert Bradshaw  wrote:

> +1 to both.
>
> On Fri, Oct 25, 2019 at 3:58 PM Valentyn Tymofieiev 
> wrote:
> >
> > On Fri, Oct 25, 2019 at 3:39 PM Kenneth Knowles  wrote:
> >>
> >> Suppose, hypothetically, we say that if Fix Version is set, then
> P0/Blocker and P1/Critical block release and lower priorities get bumped.
> >
> >
> > +1 to Kenn's suggestion.  In addition, we can discourage setting Fix
> version for non-critical issues before issues are fixed.
> >
> >>
> >>
> >> Most likely the release manager still pings and asks about all those
> before bumping. Which means that in effect they were part of the burn down
> and do block the release in the sense that they must be re-triaged away to
> the next release. I would prefer less work for the release manager and more
> emphasis on the default being nonblocking.
> >>
> >> One very different possibility is to ignore Fix Version on open bugs
> and use a different search query as the burndown, auto bump everything that
> didn't make it.
> >
> > This may create a situation where an issue will eventually be closed,
> but Fix Version not updated, and confuse the users who will rely "Fix
> Version" to  find which release actually fixes the issue. A pass over open
> bugs with a Fix Version set to next release (as currently done by a release
> manager) helps to make sure that unfixed bugs won't have Fix Version tag of
> the upcoming release.
> >
> >>
> >> Kenn
> >>
> >> On Fri, Oct 25, 2019, 14:16 Robert Bradshaw 
> wrote:
> >>>
> >>> I'm fine with that, but in that case we should have a priority for
> >>> release blockers, below which bugs get automatically bumped to the
> >>> next release (and which becomes the burndown list).
> >>>
> >>> On Fri, Oct 25, 2019 at 1:58 PM Kenneth Knowles 
> wrote:
> >>> >
> >>> > My takeaway from this thread is that priorities should have a shared
> community intuition and/or policy around how they are treated, which could
> eventually be formalized into SLOs.
> >>> >
> >>> > At a practical level, I do think that build breaks are higher
> priority than release blockers. If you are on this thread but not looking
> at the PR, here is the verbiage I added about urgency:
> >>> >
> >>> > P0/Blocker: "A P0 issue is more urgent than simply blocking the next
> release"
> >>> > P1/Critical: "Most critical bugs should block release"
> >>> > P2/Major: "No special urgency is associated"
> >>> > ...
> >>> >
> >>> > Kenn
> >>> >
> >>> > On Fri, Oct 25, 2019 at 11:46 AM Robert Bradshaw <
> rober...@google.com> wrote:
> >>> >>
> >>> >> We cut a release every 6 weeks, according to schedule, making it
> easy
> >>> >> to plan for, and the release manager typically sends out a warning
> >>> >> email to remind everyone. I don't think it makes sense to do that
> for
> >>> >> every ticket. Blockers should be reserved for things we really
> >>> >> shouldn't release without.
> >>> >>
> >>> >> On Fri, Oct 25, 2019 at 11:33 AM Pablo Estrada 
> wrote:
> >>> >> >
> >>> >> > I mentioned on the PR that I had been using the 'blocker'
> priority along with the 'fix version' field to mark issues that I want to
> get in the release.
> >>> >> > Of course, this little practice of mine only matters much around
> release branch cutting time - and has been useful for me to track which
> things I want to ensure getting into the release / bump to the next /etc.
> >>> >> > I've also found it to be useful as a way to communicate with the
> release manager without having to sync directly.
> >>> >> >
> >>> >> > What would be a reasonable way to tell the release manager "I'd
> like to get this feature in. please talk to me if you're about to cut the
> branch" - that also uses the priorities appropriately? - and that allows
> the release manager to know when a fix version is "more optional" / "less
> optional"?
> >>> >> >
> >>> >> > On Wed, Oct 23, 2019 at 12:20 PM Kenneth Knowles 
> wrote:
> >>> >> >>
> >>> >> >> I finally got around to writing some of this up. It is minimal.
> Feedback is welcome, especially if what I have written does not accurately
> represent the community's approach.
> >>> >> >>
> >>> >> >> https://github.com/apache/beam/pull/9862
> >>> >> >>
> >>> >> >> Kenn
> >>> >> >>
> >>> >> >> On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira <
> danolive...@google.com> wrote:
> >>> >> >>>
> >>> >> >>> Ah, sorry, I missed that Alex was just quoting from our Jira
> installation (didn't read his email closely enough). Also I wasn't aware
> about those pages on our website.
> >>> >> >>>
> >>> >> >>> Seeing as we do have definitions for our priorities, I guess my
> main request would be that they be made more discoverable somehow. I don't
> think the tooltips are reliable, and the pages on the website are
> informative, but hard to find. Since it feels a bit lazy to say "this isn't
> discoverable enough" without suggesting any improvements, I'd like to
> propose these two changes:
> >>> >> >>>
> >>> >> >>> 1. We should write a Be

Re: JIRA priorities explaination

2019-10-25 Thread Robert Bradshaw
+1 to both.

On Fri, Oct 25, 2019 at 3:58 PM Valentyn Tymofieiev  wrote:
>
> On Fri, Oct 25, 2019 at 3:39 PM Kenneth Knowles  wrote:
>>
>> Suppose, hypothetically, we say that if Fix Version is set, then P0/Blocker 
>> and P1/Critical block release and lower priorities get bumped.
>
>
> +1 to Kenn's suggestion.  In addition, we can discourage setting Fix version 
> for non-critical issues before issues are fixed.
>
>>
>>
>> Most likely the release manager still pings and asks about all those before 
>> bumping. Which means that in effect they were part of the burn down and do 
>> block the release in the sense that they must be re-triaged away to the next 
>> release. I would prefer less work for the release manager and more emphasis 
>> on the default being nonblocking.
>>
>> One very different possibility is to ignore Fix Version on open bugs and use 
>> a different search query as the burndown, auto bump everything that didn't 
>> make it.
>
> This may create a situation where an issue will eventually be closed, but Fix 
> Version not updated, and confuse the users who will rely "Fix Version" to  
> find which release actually fixes the issue. A pass over open bugs with a Fix 
> Version set to next release (as currently done by a release manager) helps to 
> make sure that unfixed bugs won't have Fix Version tag of the upcoming 
> release.
>
>>
>> Kenn
>>
>> On Fri, Oct 25, 2019, 14:16 Robert Bradshaw  wrote:
>>>
>>> I'm fine with that, but in that case we should have a priority for
>>> release blockers, below which bugs get automatically bumped to the
>>> next release (and which becomes the burndown list).
>>>
>>> On Fri, Oct 25, 2019 at 1:58 PM Kenneth Knowles  wrote:
>>> >
>>> > My takeaway from this thread is that priorities should have a shared 
>>> > community intuition and/or policy around how they are treated, which 
>>> > could eventually be formalized into SLOs.
>>> >
>>> > At a practical level, I do think that build breaks are higher priority 
>>> > than release blockers. If you are on this thread but not looking at the 
>>> > PR, here is the verbiage I added about urgency:
>>> >
>>> > P0/Blocker: "A P0 issue is more urgent than simply blocking the next 
>>> > release"
>>> > P1/Critical: "Most critical bugs should block release"
>>> > P2/Major: "No special urgency is associated"
>>> > ...
>>> >
>>> > Kenn
>>> >
>>> > On Fri, Oct 25, 2019 at 11:46 AM Robert Bradshaw  
>>> > wrote:
>>> >>
>>> >> We cut a release every 6 weeks, according to schedule, making it easy
>>> >> to plan for, and the release manager typically sends out a warning
>>> >> email to remind everyone. I don't think it makes sense to do that for
>>> >> every ticket. Blockers should be reserved for things we really
>>> >> shouldn't release without.
>>> >>
>>> >> On Fri, Oct 25, 2019 at 11:33 AM Pablo Estrada  
>>> >> wrote:
>>> >> >
>>> >> > I mentioned on the PR that I had been using the 'blocker' priority 
>>> >> > along with the 'fix version' field to mark issues that I want to get 
>>> >> > in the release.
>>> >> > Of course, this little practice of mine only matters much around 
>>> >> > release branch cutting time - and has been useful for me to track 
>>> >> > which things I want to ensure getting into the release / bump to the 
>>> >> > next /etc.
>>> >> > I've also found it to be useful as a way to communicate with the 
>>> >> > release manager without having to sync directly.
>>> >> >
>>> >> > What would be a reasonable way to tell the release manager "I'd like 
>>> >> > to get this feature in. please talk to me if you're about to cut the 
>>> >> > branch" - that also uses the priorities appropriately? - and that 
>>> >> > allows the release manager to know when a fix version is "more 
>>> >> > optional" / "less optional"?
>>> >> >
>>> >> > On Wed, Oct 23, 2019 at 12:20 PM Kenneth Knowles  
>>> >> > wrote:
>>> >> >>
>>> >> >> I finally got around to writing some of this up. It is minimal. 
>>> >> >> Feedback is welcome, especially if what I have written does not 
>>> >> >> accurately represent the community's approach.
>>> >> >>
>>> >> >> https://github.com/apache/beam/pull/9862
>>> >> >>
>>> >> >> Kenn
>>> >> >>
>>> >> >> On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira 
>>> >> >>  wrote:
>>> >> >>>
>>> >> >>> Ah, sorry, I missed that Alex was just quoting from our Jira 
>>> >> >>> installation (didn't read his email closely enough). Also I wasn't 
>>> >> >>> aware about those pages on our website.
>>> >> >>>
>>> >> >>> Seeing as we do have definitions for our priorities, I guess my main 
>>> >> >>> request would be that they be made more discoverable somehow. I 
>>> >> >>> don't think the tooltips are reliable, and the pages on the website 
>>> >> >>> are informative, but hard to find. Since it feels a bit lazy to say 
>>> >> >>> "this isn't discoverable enough" without suggesting any 
>>> >> >>> improvements, I'd like to propose these two changes:
>>> >> >>>
>>> >> >>> 1. We should write a Beam Jira Guide with basic 

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Robert Bradshaw
You can literally return a Python tuple of outputs from a composite
transform as well. (Dicts with PCollections as values are also
supported, if you want things to be named rather than referenced by
index.)

On Fri, Oct 25, 2019 at 4:06 PM Ahmet Altay  wrote:
>
> Is DoOutputsTuple what you are looking for? [1] You can look at this expand 
> function using it [2].
>
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pvalue.py#L204
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L1283
>
> On Fri, Oct 25, 2019 at 3:51 PM Luke Cwik  wrote:
>>
>> My example is about multiple inputs and not multiple outputs from further 
>> investigation it seems as I don't know.
>>
>> Looking at the documentation online[1] doesn't seem to specify how to do 
>> this either for composite transforms. All the examples are of the single 
>> output variety as well[2].
>>
>> 1: 
>> https://beam.apache.org/documentation/programming-guide/#composite-transforms
>> 2: 
>> https://github.com/apache/beam/blob/4ba731fe93f7f8385c771caf576745d14edf34b8/sdks/python/apache_beam/examples/cookbook/custom_ptransform.py
>>
>> On Fri, Oct 25, 2019 at 10:24 AM Luke Cwik  wrote:
>>>
>>> I believe PCollectionTuple should be unnecessary since Python has first 
>>> class support for tuples as shown in the example below[1]. Can we use 
>>> tuples to solve your issue?
>>>
>>> wordsStartingWithA = \
>>> p | 'Words starting with A' >> beam.Create(['apple', 'ant', 'arrow'])
>>>
>>> wordsStartingWithB = \
>>> p | 'Words starting with B' >> beam.Create(['ball', 'book', 'bow'])
>>>
>>> ((wordsStartingWithA, wordsStartingWithB)
>>> | beam.Flatten()
>>> | LogElements())
>>>
>>> 1: 
>>> https://github.com/apache/beam/blob/238659bce8043e6a64619a959ab44453dbe22dff/learning/katas/python/Core%20Transforms/Flatten/Flatten/task.py#L29
>>>
>>> On Fri, Oct 25, 2019 at 10:11 AM Sam Rohde  wrote:

 Talked to Daniel offline and it looks like the Python SDK is missing 
 PCollection Tuples like the one Java has: 
 https://github.com/rohdesamuel/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionTuple.java.

 I'll go ahead and implement that for the Python SDK.

 On Thu, Oct 24, 2019 at 5:20 PM Sam Rohde  wrote:
>
> Hey All,
>
> I'm trying to implement an expand override with multiple output 
> PCollections. The kicker is that I want to insert a new transform for 
> each output PCollection. How can I do this?
>
> Regards,
> Sam


Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Ahmet Altay
Is DoOutputsTuple what you are looking for? [1] You can look at this expand
function using it [2].

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pvalue.py#L204
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L1283

On Fri, Oct 25, 2019 at 3:51 PM Luke Cwik  wrote:

> My example is about multiple inputs and not multiple outputs from further
> investigation it seems as I don't know.
>
> Looking at the documentation online[1] doesn't seem to specify how to do
> this either for composite transforms. All the examples are of the single
> output variety as well[2].
>
> 1:
> https://beam.apache.org/documentation/programming-guide/#composite-transforms
> 2:
> https://github.com/apache/beam/blob/4ba731fe93f7f8385c771caf576745d14edf34b8/sdks/python/apache_beam/examples/cookbook/custom_ptransform.py
>
> On Fri, Oct 25, 2019 at 10:24 AM Luke Cwik  wrote:
>
>> I believe PCollectionTuple should be unnecessary since Python has first
>> class support for tuples as shown in the example below[1]. Can we use
>> tuples to solve your issue?
>>
>> wordsStartingWithA = \
>> p | 'Words starting with A' >> beam.Create(['apple', 'ant', 'arrow'])
>>
>> wordsStartingWithB = \
>> p | 'Words starting with B' >> beam.Create(['ball', 'book', 'bow'])
>>
>> ((wordsStartingWithA, wordsStartingWithB)
>> | beam.Flatten()
>> | LogElements())
>>
>> 1:
>> https://github.com/apache/beam/blob/238659bce8043e6a64619a959ab44453dbe22dff/learning/katas/python/Core%20Transforms/Flatten/Flatten/task.py#L29
>>
>> On Fri, Oct 25, 2019 at 10:11 AM Sam Rohde  wrote:
>>
>>> Talked to Daniel offline and it looks like the Python SDK is missing
>>> PCollection Tuples like the one Java has:
>>> https://github.com/rohdesamuel/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionTuple.java
>>> .
>>>
>>> I'll go ahead and implement that for the Python SDK.
>>>
>>> On Thu, Oct 24, 2019 at 5:20 PM Sam Rohde  wrote:
>>>
 Hey All,

 I'm trying to implement an expand override with multiple output
 PCollections. The kicker is that I want to insert a new transform for each
 output PCollection. How can I do this?

 Regards,
 Sam

>>>


Re: JIRA priorities explaination

2019-10-25 Thread Valentyn Tymofieiev
On Fri, Oct 25, 2019 at 3:39 PM Kenneth Knowles  wrote:

> Suppose, hypothetically, we say that if Fix Version is set, then
> P0/Blocker and P1/Critical block release and lower priorities get bumped.
>

+1 to Kenn's suggestion.  In addition, we can discourage setting Fix
version for non-critical issues before issues are fixed.


>
> Most likely the release manager still pings and asks about all those
> before bumping. Which means that in effect they were part of the burn down
> and do block the release in the sense that they must be re-triaged away to
> the next release. I would prefer less work for the release manager and more
> emphasis on the default being nonblocking.
>
> One very different possibility is to ignore Fix Version on open bugs and
> use a different search query as the burndown, auto bump everything that
> didn't make it.
>
This may create a situation where an issue will eventually be closed, but
Fix Version not updated, and confuse the users who will rely "Fix Version"
to  find which release actually fixes the issue. A pass over open bugs with
a Fix Version set to next release (as currently done by a release manager)
helps to make sure that unfixed bugs won't have Fix Version tag of the
upcoming release.


> Kenn
>
> On Fri, Oct 25, 2019, 14:16 Robert Bradshaw  wrote:
>
>> I'm fine with that, but in that case we should have a priority for
>> release blockers, below which bugs get automatically bumped to the
>> next release (and which becomes the burndown list).
>>
>> On Fri, Oct 25, 2019 at 1:58 PM Kenneth Knowles  wrote:
>> >
>> > My takeaway from this thread is that priorities should have a shared
>> community intuition and/or policy around how they are treated, which could
>> eventually be formalized into SLOs.
>> >
>> > At a practical level, I do think that build breaks are higher priority
>> than release blockers. If you are on this thread but not looking at the PR,
>> here is the verbiage I added about urgency:
>> >
>> > P0/Blocker: "A P0 issue is more urgent than simply blocking the next
>> release"
>> > P1/Critical: "Most critical bugs should block release"
>> > P2/Major: "No special urgency is associated"
>> > ...
>> >
>> > Kenn
>> >
>> > On Fri, Oct 25, 2019 at 11:46 AM Robert Bradshaw 
>> wrote:
>> >>
>> >> We cut a release every 6 weeks, according to schedule, making it easy
>> >> to plan for, and the release manager typically sends out a warning
>> >> email to remind everyone. I don't think it makes sense to do that for
>> >> every ticket. Blockers should be reserved for things we really
>> >> shouldn't release without.
>> >>
>> >> On Fri, Oct 25, 2019 at 11:33 AM Pablo Estrada 
>> wrote:
>> >> >
>> >> > I mentioned on the PR that I had been using the 'blocker' priority
>> along with the 'fix version' field to mark issues that I want to get in the
>> release.
>> >> > Of course, this little practice of mine only matters much around
>> release branch cutting time - and has been useful for me to track which
>> things I want to ensure getting into the release / bump to the next /etc.
>> >> > I've also found it to be useful as a way to communicate with the
>> release manager without having to sync directly.
>> >> >
>> >> > What would be a reasonable way to tell the release manager "I'd like
>> to get this feature in. please talk to me if you're about to cut the
>> branch" - that also uses the priorities appropriately? - and that allows
>> the release manager to know when a fix version is "more optional" / "less
>> optional"?
>> >> >
>> >> > On Wed, Oct 23, 2019 at 12:20 PM Kenneth Knowles 
>> wrote:
>> >> >>
>> >> >> I finally got around to writing some of this up. It is minimal.
>> Feedback is welcome, especially if what I have written does not accurately
>> represent the community's approach.
>> >> >>
>> >> >> https://github.com/apache/beam/pull/9862
>> >> >>
>> >> >> Kenn
>> >> >>
>> >> >> On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira <
>> danolive...@google.com> wrote:
>> >> >>>
>> >> >>> Ah, sorry, I missed that Alex was just quoting from our Jira
>> installation (didn't read his email closely enough). Also I wasn't aware
>> about those pages on our website.
>> >> >>>
>> >> >>> Seeing as we do have definitions for our priorities, I guess my
>> main request would be that they be made more discoverable somehow. I don't
>> think the tooltips are reliable, and the pages on the website are
>> informative, but hard to find. Since it feels a bit lazy to say "this isn't
>> discoverable enough" without suggesting any improvements, I'd like to
>> propose these two changes:
>> >> >>>
>> >> >>> 1. We should write a Beam Jira Guide with basic information about
>> our Jira. I think the bug priorities should go in here, but also anything
>> else we would want someone to know before filing any Jira issues, like how
>> our components are organized or what the different issue types mean. This
>> guide could either be written in the website or the wiki, but I think it
>> should definitely be l

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Luke Cwik
My example is about multiple inputs and not multiple outputs from further
investigation it seems as I don't know.

Looking at the documentation online[1] doesn't seem to specify how to do
this either for composite transforms. All the examples are of the single
output variety as well[2].

1:
https://beam.apache.org/documentation/programming-guide/#composite-transforms
2:
https://github.com/apache/beam/blob/4ba731fe93f7f8385c771caf576745d14edf34b8/sdks/python/apache_beam/examples/cookbook/custom_ptransform.py

On Fri, Oct 25, 2019 at 10:24 AM Luke Cwik  wrote:

> I believe PCollectionTuple should be unnecessary since Python has first
> class support for tuples as shown in the example below[1]. Can we use
> tuples to solve your issue?
>
> wordsStartingWithA = \
> p | 'Words starting with A' >> beam.Create(['apple', 'ant', 'arrow'])
>
> wordsStartingWithB = \
> p | 'Words starting with B' >> beam.Create(['ball', 'book', 'bow'])
>
> ((wordsStartingWithA, wordsStartingWithB)
> | beam.Flatten()
> | LogElements())
>
> 1:
> https://github.com/apache/beam/blob/238659bce8043e6a64619a959ab44453dbe22dff/learning/katas/python/Core%20Transforms/Flatten/Flatten/task.py#L29
>
> On Fri, Oct 25, 2019 at 10:11 AM Sam Rohde  wrote:
>
>> Talked to Daniel offline and it looks like the Python SDK is missing
>> PCollection Tuples like the one Java has:
>> https://github.com/rohdesamuel/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionTuple.java
>> .
>>
>> I'll go ahead and implement that for the Python SDK.
>>
>> On Thu, Oct 24, 2019 at 5:20 PM Sam Rohde  wrote:
>>
>>> Hey All,
>>>
>>> I'm trying to implement an expand override with multiple output
>>> PCollections. The kicker is that I want to insert a new transform for each
>>> output PCollection. How can I do this?
>>>
>>> Regards,
>>> Sam
>>>
>>


Re: JIRA priorities explaination

2019-10-25 Thread Kenneth Knowles
Suppose, hypothetically, we say that if Fix Version is set, then P0/Blocker
and P1/Critical block release and lower priorities get bumped.

Most likely the release manager still pings and asks about all those before
bumping. Which means that in effect they were part of the burn down and do
block the release in the sense that they must be re-triaged away to the
next release. I would prefer less work for the release manager and more
emphasis on the default being nonblocking.

One very different possibility is to ignore Fix Version on open bugs and
use a different search query as the burndown, auto bump everything that
didn't make it.

Kenn

On Fri, Oct 25, 2019, 14:16 Robert Bradshaw  wrote:

> I'm fine with that, but in that case we should have a priority for
> release blockers, below which bugs get automatically bumped to the
> next release (and which becomes the burndown list).
>
> On Fri, Oct 25, 2019 at 1:58 PM Kenneth Knowles  wrote:
> >
> > My takeaway from this thread is that priorities should have a shared
> community intuition and/or policy around how they are treated, which could
> eventually be formalized into SLOs.
> >
> > At a practical level, I do think that build breaks are higher priority
> than release blockers. If you are on this thread but not looking at the PR,
> here is the verbiage I added about urgency:
> >
> > P0/Blocker: "A P0 issue is more urgent than simply blocking the next
> release"
> > P1/Critical: "Most critical bugs should block release"
> > P2/Major: "No special urgency is associated"
> > ...
> >
> > Kenn
> >
> > On Fri, Oct 25, 2019 at 11:46 AM Robert Bradshaw 
> wrote:
> >>
> >> We cut a release every 6 weeks, according to schedule, making it easy
> >> to plan for, and the release manager typically sends out a warning
> >> email to remind everyone. I don't think it makes sense to do that for
> >> every ticket. Blockers should be reserved for things we really
> >> shouldn't release without.
> >>
> >> On Fri, Oct 25, 2019 at 11:33 AM Pablo Estrada 
> wrote:
> >> >
> >> > I mentioned on the PR that I had been using the 'blocker' priority
> along with the 'fix version' field to mark issues that I want to get in the
> release.
> >> > Of course, this little practice of mine only matters much around
> release branch cutting time - and has been useful for me to track which
> things I want to ensure getting into the release / bump to the next /etc.
> >> > I've also found it to be useful as a way to communicate with the
> release manager without having to sync directly.
> >> >
> >> > What would be a reasonable way to tell the release manager "I'd like
> to get this feature in. please talk to me if you're about to cut the
> branch" - that also uses the priorities appropriately? - and that allows
> the release manager to know when a fix version is "more optional" / "less
> optional"?
> >> >
> >> > On Wed, Oct 23, 2019 at 12:20 PM Kenneth Knowles 
> wrote:
> >> >>
> >> >> I finally got around to writing some of this up. It is minimal.
> Feedback is welcome, especially if what I have written does not accurately
> represent the community's approach.
> >> >>
> >> >> https://github.com/apache/beam/pull/9862
> >> >>
> >> >> Kenn
> >> >>
> >> >> On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira <
> danolive...@google.com> wrote:
> >> >>>
> >> >>> Ah, sorry, I missed that Alex was just quoting from our Jira
> installation (didn't read his email closely enough). Also I wasn't aware
> about those pages on our website.
> >> >>>
> >> >>> Seeing as we do have definitions for our priorities, I guess my
> main request would be that they be made more discoverable somehow. I don't
> think the tooltips are reliable, and the pages on the website are
> informative, but hard to find. Since it feels a bit lazy to say "this isn't
> discoverable enough" without suggesting any improvements, I'd like to
> propose these two changes:
> >> >>>
> >> >>> 1. We should write a Beam Jira Guide with basic information about
> our Jira. I think the bug priorities should go in here, but also anything
> else we would want someone to know before filing any Jira issues, like how
> our components are organized or what the different issue types mean. This
> guide could either be written in the website or the wiki, but I think it
> should definitely be linked in https://beam.apache.org/contribute/ so
> that newcomers read it before getting their Jira account approved. The goal
> here being to have a reference for the basics of our Jira since at the
> moment it doesn't seem like we have anything for this.
> >> >>>
> >> >>> 2. The existing info on Post-commit and pre-commit policies doesn't
> seem very discoverable to someone monitoring the Pre/Post-commits. I've
> reported a handful of test-failures already and haven't seen this link
> mentioned much. We should try to find a way to funnel people towards this
> link when there's an issue, the same way we try to funnel people towards
> the contribution guide when they write a PR. A

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Luke Cwik
Approach 3 is about caching the bundle descriptor forever but tearing down
a "live" instance of the DoFns at some SDK chosen arbitrary point in time.
This way if a future ProcessBundleRequest comes in, a new "live" instance
can be constructed.
Approach 2 is still needed so that when the workers are being shutdown all
the "live" instances are torn down.

On Fri, Oct 25, 2019 at 12:56 PM Robert Burke  wrote:

> Approach 2 isn't incompatible with approach 3. 3 simple sets down
> convention/configuration for the conditions when the SDK will do this after
> process bundle has completed.
>
>
>
> On Fri, Oct 25, 2019, 12:34 PM Robert Bradshaw 
> wrote:
>
>> I think we'll still need approach (2) for when the pipeline finishes
>> and a runner is tearing down workers.
>>
>> On Fri, Oct 25, 2019 at 10:36 AM Maximilian Michels 
>> wrote:
>> >
>> > Hi Jincheng,
>> >
>> > Thanks for bringing this up and capturing the ideas in the doc.
>> >
>> > Intuitively, I would have also considered adding a new Proto message for
>> > the teardown, but I think the idea to trigger this logic when the SDK
>> > Harness evicts process bundle descriptors is more elegant.
>> >
>> > Thanks,
>> > Max
>> >
>> > On 25.10.19 17:23, Luke Cwik wrote:
>> > > I like approach 3 since it doesn't add additional complexity to the
>> API
>> > > and individual SDKs can choose to implement any clean-up strategy they
>> > > want or none at all which is the simplest.
>> > >
>> > > On Thu, Oct 24, 2019 at 8:46 PM jincheng sun <
>> sunjincheng...@gmail.com
>> > > > wrote:
>> > >
>> > > Hi,
>> > >
>> > > Thanks for your comments in doc, I have add Approach 3 which you
>> > > mentioned! @Luke
>> > >
>> > > For now, we should do a decision for Approach 3 and Approach 1.
>> > > Detail can be found in doc [1]
>> > >
>> > > Welcome anyone's feedback :)
>> > >
>> > > Regards,
>> > > Jincheng
>> > >
>> > > [1]
>> > >
>> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
>> > >
>> > > jincheng sun > > > > 于2019年10月25日周五 上午10:40写道:
>> > >
>> > > Hi,
>> > >
>> > > Functionally capable of `abort`, but it will be called at the
>> > > end of operator. So, I prefer `dispose` semantics. i.e., all
>> > > normal logic has been executed.
>> > >
>> > > Best,
>> > > Jincheng
>> > >
>> > > Harsh Vardhan mailto:anan...@google.com
>> >>
>> > > 于2019年10月23日周三 上午12:14写道:
>> > >
>> > > Would approach 1 be akin to abort semantics?
>> > >
>> > > On Mon, Oct 21, 2019 at 8:01 PM jincheng sun
>> > > > sunjincheng...@gmail.com>>
>> > > wrote:
>> > >
>> > > Hi Luke,
>> > >
>> > > Thanks a lot for your reply. Since it allows to share
>> > > one SDK harness between multiple executable stages,
>> the
>> > > control service termination may occur much later than
>> > > the completion of an executable stage. This is the
>> main
>> > > reason I prefer runners to control the teardown of
>> DoFns.
>> > >
>> > > Regarding to "SDK harnesses can terminate instances
>> any
>> > > time they want and start new instances anytime as
>> > > well.", personally I think it's not conflict with the
>> > > proposed Approach 1 as the SDK harness could decide
>> what
>> > > to do when receiving the teardown request. It could do
>> > > nothing if the DoFns has already been teared down and
>> > > could also tear down the DoFns if needed.
>> > >
>> > > What do you think?
>> > >
>> > > Best,
>> > > Jincheng
>> > >
>> > > Luke Cwik mailto:lc...@google.com
>> >>
>> > > 于2019年10月22日周二 上午2:05写道:
>> > >
>> > > Approach 2 is currently the suggested approach[1]
>> > > for DoFn's to shutdown.
>> > > Note that SDK harnesses can terminate instances
>> any
>> > > time they want and start new instances anytime as
>> well.
>> > >
>> > > Why do you want to expose this logic so that
>> Runners
>> > > could control it?
>> > >
>> > > 1:
>> > >
>> https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#
>> > >
>> > > On Mon, Oct 21, 2019 at 4:27 AM jincheng sun
>> > > > > > > wrote:
>> > >
>> > > Hi,
>> > > I found that in `SdkHarness` do not  stop the
>> > > `SdkWorker` when finish.  We should add the
>> > > logic for stop the `SdkWorker` in

Re: JIRA priorities explaination

2019-10-25 Thread Kenneth Knowles
My takeaway from this thread is that priorities should have a shared
community intuition and/or policy around how they are treated, which could
eventually be formalized into SLOs.

At a practical level, I do think that build breaks are higher priority than
release blockers. If you are on this thread but not looking at the PR, here
is the verbiage I added about urgency:

P0/Blocker: "A P0 issue is more urgent than simply blocking the next
release"
P1/Critical: "Most critical bugs should block release"
P2/Major: "No special urgency is associated"
...

Kenn

On Fri, Oct 25, 2019 at 11:46 AM Robert Bradshaw 
wrote:

> We cut a release every 6 weeks, according to schedule, making it easy
> to plan for, and the release manager typically sends out a warning
> email to remind everyone. I don't think it makes sense to do that for
> every ticket. Blockers should be reserved for things we really
> shouldn't release without.
>
> On Fri, Oct 25, 2019 at 11:33 AM Pablo Estrada  wrote:
> >
> > I mentioned on the PR that I had been using the 'blocker' priority along
> with the 'fix version' field to mark issues that I want to get in the
> release.
> > Of course, this little practice of mine only matters much around release
> branch cutting time - and has been useful for me to track which things I
> want to ensure getting into the release / bump to the next /etc.
> > I've also found it to be useful as a way to communicate with the release
> manager without having to sync directly.
> >
> > What would be a reasonable way to tell the release manager "I'd like to
> get this feature in. please talk to me if you're about to cut the branch" -
> that also uses the priorities appropriately? - and that allows the release
> manager to know when a fix version is "more optional" / "less optional"?
> >
> > On Wed, Oct 23, 2019 at 12:20 PM Kenneth Knowles 
> wrote:
> >>
> >> I finally got around to writing some of this up. It is minimal.
> Feedback is welcome, especially if what I have written does not accurately
> represent the community's approach.
> >>
> >> https://github.com/apache/beam/pull/9862
> >>
> >> Kenn
> >>
> >> On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira 
> wrote:
> >>>
> >>> Ah, sorry, I missed that Alex was just quoting from our Jira
> installation (didn't read his email closely enough). Also I wasn't aware
> about those pages on our website.
> >>>
> >>> Seeing as we do have definitions for our priorities, I guess my main
> request would be that they be made more discoverable somehow. I don't think
> the tooltips are reliable, and the pages on the website are informative,
> but hard to find. Since it feels a bit lazy to say "this isn't discoverable
> enough" without suggesting any improvements, I'd like to propose these two
> changes:
> >>>
> >>> 1. We should write a Beam Jira Guide with basic information about our
> Jira. I think the bug priorities should go in here, but also anything else
> we would want someone to know before filing any Jira issues, like how our
> components are organized or what the different issue types mean. This guide
> could either be written in the website or the wiki, but I think it should
> definitely be linked in https://beam.apache.org/contribute/ so that
> newcomers read it before getting their Jira account approved. The goal here
> being to have a reference for the basics of our Jira since at the moment it
> doesn't seem like we have anything for this.
> >>>
> >>> 2. The existing info on Post-commit and pre-commit policies doesn't
> seem very discoverable to someone monitoring the Pre/Post-commits. I've
> reported a handful of test-failures already and haven't seen this link
> mentioned much. We should try to find a way to funnel people towards this
> link when there's an issue, the same way we try to funnel people towards
> the contribution guide when they write a PR. As a note, while writing this
> email I remembered this link that someone gave me before (
> https://s.apache.org/beam-test-failure). That mentions the Post-commit
> policies page, so maybe it's just a matter of pasting that all over our
> Jenkins builds whenever we have a failing test?
> >>>
> >>> PS: I'm also definitely for SLOs, but I figure it's probably better
> discussed in a separate thread so I'm trying to stick to the subject of
> priority definitions.
> >>>
> >>> On Mon, Feb 11, 2019 at 9:17 AM Scott Wegner  wrote:
> 
>  Thanks for driving this discussion. I also was not aware of these
> existing definitions. Once we agree on the terms, let's add them to our
> Contributor Guide and start using them.
> 
>  +1 in general; I like both Alex and Kenn's definitions; Additional
> wordsmithing could be moved to a Pull Request. Can we make the definitions
> useful for both the person filing a bug, and the assignee, i.e.
> 
>  :  assigned>. 
> 
>  On Sun, Feb 10, 2019 at 7:49 PM Kenneth Knowles 
> wrote:
> >
> > The content that Alex posted* is the definition from our Jira
> install

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Robert Burke
Approach 2 isn't incompatible with approach 3. 3 simple sets down
convention/configuration for the conditions when the SDK will do this after
process bundle has completed.



On Fri, Oct 25, 2019, 12:34 PM Robert Bradshaw  wrote:

> I think we'll still need approach (2) for when the pipeline finishes
> and a runner is tearing down workers.
>
> On Fri, Oct 25, 2019 at 10:36 AM Maximilian Michels 
> wrote:
> >
> > Hi Jincheng,
> >
> > Thanks for bringing this up and capturing the ideas in the doc.
> >
> > Intuitively, I would have also considered adding a new Proto message for
> > the teardown, but I think the idea to trigger this logic when the SDK
> > Harness evicts process bundle descriptors is more elegant.
> >
> > Thanks,
> > Max
> >
> > On 25.10.19 17:23, Luke Cwik wrote:
> > > I like approach 3 since it doesn't add additional complexity to the API
> > > and individual SDKs can choose to implement any clean-up strategy they
> > > want or none at all which is the simplest.
> > >
> > > On Thu, Oct 24, 2019 at 8:46 PM jincheng sun  > > > wrote:
> > >
> > > Hi,
> > >
> > > Thanks for your comments in doc, I have add Approach 3 which you
> > > mentioned! @Luke
> > >
> > > For now, we should do a decision for Approach 3 and Approach 1.
> > > Detail can be found in doc [1]
> > >
> > > Welcome anyone's feedback :)
> > >
> > > Regards,
> > > Jincheng
> > >
> > > [1]
> > >
> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
> > >
> > > jincheng sun  > > > 于2019年10月25日周五 上午10:40写道:
> > >
> > > Hi,
> > >
> > > Functionally capable of `abort`, but it will be called at the
> > > end of operator. So, I prefer `dispose` semantics. i.e., all
> > > normal logic has been executed.
> > >
> > > Best,
> > > Jincheng
> > >
> > > Harsh Vardhan mailto:anan...@google.com>>
> > > 于2019年10月23日周三 上午12:14写道:
> > >
> > > Would approach 1 be akin to abort semantics?
> > >
> > > On Mon, Oct 21, 2019 at 8:01 PM jincheng sun
> > > mailto:sunjincheng...@gmail.com
> >>
> > > wrote:
> > >
> > > Hi Luke,
> > >
> > > Thanks a lot for your reply. Since it allows to share
> > > one SDK harness between multiple executable stages, the
> > > control service termination may occur much later than
> > > the completion of an executable stage. This is the main
> > > reason I prefer runners to control the teardown of
> DoFns.
> > >
> > > Regarding to "SDK harnesses can terminate instances any
> > > time they want and start new instances anytime as
> > > well.", personally I think it's not conflict with the
> > > proposed Approach 1 as the SDK harness could decide
> what
> > > to do when receiving the teardown request. It could do
> > > nothing if the DoFns has already been teared down and
> > > could also tear down the DoFns if needed.
> > >
> > > What do you think?
> > >
> > > Best,
> > > Jincheng
> > >
> > > Luke Cwik mailto:lc...@google.com>>
> > > 于2019年10月22日周二 上午2:05写道:
> > >
> > > Approach 2 is currently the suggested approach[1]
> > > for DoFn's to shutdown.
> > > Note that SDK harnesses can terminate instances any
> > > time they want and start new instances anytime as
> well.
> > >
> > > Why do you want to expose this logic so that
> Runners
> > > could control it?
> > >
> > > 1:
> > >
> https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#
> > >
> > > On Mon, Oct 21, 2019 at 4:27 AM jincheng sun
> > >  > > > wrote:
> > >
> > > Hi,
> > > I found that in `SdkHarness` do not  stop the
> > > `SdkWorker` when finish.  We should add the
> > > logic for stop the `SdkWorker` in `SdkHarness`.
> > > More detail can be found [1].
> > >
> > > There are two approaches to solve this issue:
> > >
> > > Approach 1:  We can add a Fn API for teardown
> > > purpose and the runner will teardown a specific
> > > bundle descriptor via this teardown Fn API
> > > during disposing.
> > > Approach 2: The control service termination
> > > could be seen as a signal and once SDK ha

Re: JIRA priorities explaination

2019-10-25 Thread Robert Bradshaw
I'm fine with that, but in that case we should have a priority for
release blockers, below which bugs get automatically bumped to the
next release (and which becomes the burndown list).

On Fri, Oct 25, 2019 at 1:58 PM Kenneth Knowles  wrote:
>
> My takeaway from this thread is that priorities should have a shared 
> community intuition and/or policy around how they are treated, which could 
> eventually be formalized into SLOs.
>
> At a practical level, I do think that build breaks are higher priority than 
> release blockers. If you are on this thread but not looking at the PR, here 
> is the verbiage I added about urgency:
>
> P0/Blocker: "A P0 issue is more urgent than simply blocking the next release"
> P1/Critical: "Most critical bugs should block release"
> P2/Major: "No special urgency is associated"
> ...
>
> Kenn
>
> On Fri, Oct 25, 2019 at 11:46 AM Robert Bradshaw  wrote:
>>
>> We cut a release every 6 weeks, according to schedule, making it easy
>> to plan for, and the release manager typically sends out a warning
>> email to remind everyone. I don't think it makes sense to do that for
>> every ticket. Blockers should be reserved for things we really
>> shouldn't release without.
>>
>> On Fri, Oct 25, 2019 at 11:33 AM Pablo Estrada  wrote:
>> >
>> > I mentioned on the PR that I had been using the 'blocker' priority along 
>> > with the 'fix version' field to mark issues that I want to get in the 
>> > release.
>> > Of course, this little practice of mine only matters much around release 
>> > branch cutting time - and has been useful for me to track which things I 
>> > want to ensure getting into the release / bump to the next /etc.
>> > I've also found it to be useful as a way to communicate with the release 
>> > manager without having to sync directly.
>> >
>> > What would be a reasonable way to tell the release manager "I'd like to 
>> > get this feature in. please talk to me if you're about to cut the branch" 
>> > - that also uses the priorities appropriately? - and that allows the 
>> > release manager to know when a fix version is "more optional" / "less 
>> > optional"?
>> >
>> > On Wed, Oct 23, 2019 at 12:20 PM Kenneth Knowles  wrote:
>> >>
>> >> I finally got around to writing some of this up. It is minimal. Feedback 
>> >> is welcome, especially if what I have written does not accurately 
>> >> represent the community's approach.
>> >>
>> >> https://github.com/apache/beam/pull/9862
>> >>
>> >> Kenn
>> >>
>> >> On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira  
>> >> wrote:
>> >>>
>> >>> Ah, sorry, I missed that Alex was just quoting from our Jira 
>> >>> installation (didn't read his email closely enough). Also I wasn't aware 
>> >>> about those pages on our website.
>> >>>
>> >>> Seeing as we do have definitions for our priorities, I guess my main 
>> >>> request would be that they be made more discoverable somehow. I don't 
>> >>> think the tooltips are reliable, and the pages on the website are 
>> >>> informative, but hard to find. Since it feels a bit lazy to say "this 
>> >>> isn't discoverable enough" without suggesting any improvements, I'd like 
>> >>> to propose these two changes:
>> >>>
>> >>> 1. We should write a Beam Jira Guide with basic information about our 
>> >>> Jira. I think the bug priorities should go in here, but also anything 
>> >>> else we would want someone to know before filing any Jira issues, like 
>> >>> how our components are organized or what the different issue types mean. 
>> >>> This guide could either be written in the website or the wiki, but I 
>> >>> think it should definitely be linked in 
>> >>> https://beam.apache.org/contribute/ so that newcomers read it before 
>> >>> getting their Jira account approved. The goal here being to have a 
>> >>> reference for the basics of our Jira since at the moment it doesn't seem 
>> >>> like we have anything for this.
>> >>>
>> >>> 2. The existing info on Post-commit and pre-commit policies doesn't seem 
>> >>> very discoverable to someone monitoring the Pre/Post-commits. I've 
>> >>> reported a handful of test-failures already and haven't seen this link 
>> >>> mentioned much. We should try to find a way to funnel people towards 
>> >>> this link when there's an issue, the same way we try to funnel people 
>> >>> towards the contribution guide when they write a PR. As a note, while 
>> >>> writing this email I remembered this link that someone gave me before 
>> >>> (https://s.apache.org/beam-test-failure). That mentions the Post-commit 
>> >>> policies page, so maybe it's just a matter of pasting that all over our 
>> >>> Jenkins builds whenever we have a failing test?
>> >>>
>> >>> PS: I'm also definitely for SLOs, but I figure it's probably better 
>> >>> discussed in a separate thread so I'm trying to stick to the subject of 
>> >>> priority definitions.
>> >>>
>> >>> On Mon, Feb 11, 2019 at 9:17 AM Scott Wegner  wrote:
>> 
>>  Thanks for driving this discussion. I also was not aware of thes

Re: contributor permission for Beam Jira tickets

2019-10-25 Thread Kenneth Knowles
Assuming your Jira id is pbhosale87, I have added you to Contributors role,
so you can be assigned a Jira.

Kenn

On Fri, Oct 25, 2019 at 1:22 PM Pradeep Bhosale <
bhosale.pradeep1...@gmail.com> wrote:

> Hi,
>
>  This is Pradeep from Conde Nast.
>  I'm currently working on Apache beam and would like to suggest/fix few
> issues for AWS IOs.
> Can someone add me as a contributor for Beam's Jira issue tracker?
> I would like to create/assign tickets for my work.
>
>  Thanks,
> Pradeep
>
>
>


contributor permission for Beam Jira tickets

2019-10-25 Thread Pradeep Bhosale
Hi,

 This is Pradeep from Conde Nast.
 I'm currently working on Apache beam and would like to suggest/fix few
issues for AWS IOs.
Can someone add me as a contributor for Beam's Jira issue tracker?
I would like to create/assign tickets for my work.

 Thanks,
Pradeep


Re: DynamoDBIO related issue

2019-10-25 Thread Pradeep Bhosale



On 2019/10/25 16:04:53, Luke Cwik  wrote: 
> If you create a JIRA account and share your user id with us, we will grant
> you contributor access which will allow you to create a JIRA issue.
> 
> Please take a look at the our contribution guide, it mentions how to
> connect with the Beam community including creating a JIRA account[1].
> 
> 1: https://beam.apache.org/contribute/#connect-with-the-beam-community
> 
> On Thu, Oct 24, 2019 at 7:15 PM Cam Mach  wrote:
> 
> > Hi Pradeep,
> >
> > What is your enhancement? Can you create a ticket and describe it?
> >
> > Thanks,
> > Cam
> >
> >
> >
> > On Thu, Oct 24, 2019 at 1:58 PM Pradeep Bhosale <
> > bhosale.pradeep1...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> This is Pradeep. I am using DynamoDB IO to write data to dynamo DB.
> >> I would like to report one enhancement.
> >>
> >> Please let me know how can I achieve that.
> >> I don't have *create issue* access on beam JIRA.
> >>
> >> https://issues.apache.org/jira/projects/BEAM/issues/BEAM-8368?filter=allopenissues
> >>
> >>
> Here is my userid: pbhosale87


Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Robert Bradshaw
I think we'll still need approach (2) for when the pipeline finishes
and a runner is tearing down workers.

On Fri, Oct 25, 2019 at 10:36 AM Maximilian Michels  wrote:
>
> Hi Jincheng,
>
> Thanks for bringing this up and capturing the ideas in the doc.
>
> Intuitively, I would have also considered adding a new Proto message for
> the teardown, but I think the idea to trigger this logic when the SDK
> Harness evicts process bundle descriptors is more elegant.
>
> Thanks,
> Max
>
> On 25.10.19 17:23, Luke Cwik wrote:
> > I like approach 3 since it doesn't add additional complexity to the API
> > and individual SDKs can choose to implement any clean-up strategy they
> > want or none at all which is the simplest.
> >
> > On Thu, Oct 24, 2019 at 8:46 PM jincheng sun  > > wrote:
> >
> > Hi,
> >
> > Thanks for your comments in doc, I have add Approach 3 which you
> > mentioned! @Luke
> >
> > For now, we should do a decision for Approach 3 and Approach 1.
> > Detail can be found in doc [1]
> >
> > Welcome anyone's feedback :)
> >
> > Regards,
> > Jincheng
> >
> > [1]
> > 
> > https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
> >
> > jincheng sun  > > 于2019年10月25日周五 上午10:40写道:
> >
> > Hi,
> >
> > Functionally capable of `abort`, but it will be called at the
> > end of operator. So, I prefer `dispose` semantics. i.e., all
> > normal logic has been executed.
> >
> > Best,
> > Jincheng
> >
> > Harsh Vardhan mailto:anan...@google.com>>
> > 于2019年10月23日周三 上午12:14写道:
> >
> > Would approach 1 be akin to abort semantics?
> >
> > On Mon, Oct 21, 2019 at 8:01 PM jincheng sun
> > mailto:sunjincheng...@gmail.com>>
> > wrote:
> >
> > Hi Luke,
> >
> > Thanks a lot for your reply. Since it allows to share
> > one SDK harness between multiple executable stages, the
> > control service termination may occur much later than
> > the completion of an executable stage. This is the main
> > reason I prefer runners to control the teardown of DoFns.
> >
> > Regarding to "SDK harnesses can terminate instances any
> > time they want and start new instances anytime as
> > well.", personally I think it's not conflict with the
> > proposed Approach 1 as the SDK harness could decide what
> > to do when receiving the teardown request. It could do
> > nothing if the DoFns has already been teared down and
> > could also tear down the DoFns if needed.
> >
> > What do you think?
> >
> > Best,
> > Jincheng
> >
> > Luke Cwik mailto:lc...@google.com>>
> > 于2019年10月22日周二 上午2:05写道:
> >
> > Approach 2 is currently the suggested approach[1]
> > for DoFn's to shutdown.
> > Note that SDK harnesses can terminate instances any
> > time they want and start new instances anytime as well.
> >
> > Why do you want to expose this logic so that Runners
> > could control it?
> >
> > 1:
> > 
> > https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#
> >
> > On Mon, Oct 21, 2019 at 4:27 AM jincheng sun
> >  > > wrote:
> >
> > Hi,
> > I found that in `SdkHarness` do not  stop the
> > `SdkWorker` when finish.  We should add the
> > logic for stop the `SdkWorker` in `SdkHarness`.
> > More detail can be found [1].
> >
> > There are two approaches to solve this issue:
> >
> > Approach 1:  We can add a Fn API for teardown
> > purpose and the runner will teardown a specific
> > bundle descriptor via this teardown Fn API
> > during disposing.
> > Approach 2: The control service termination
> > could be seen as a signal and once SDK harness
> > receives this signal, the teardown of the bundle
> > descriptor will be performed.
> >
> > More detail can be found in [2].
> >
> > As the Approach 2, SDK harness could be shared
> > between multiple executable stages. The control
> > service termination only occurs when all the
> >

Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Robert Bradshaw
It looks like fn_api_runner_test.py is quite expensive, taking 10-15+
minutes on each version of Python. This test consists of a base class
that is basically a validates runner suite, and is then run in several
configurations, many more of which (including some expensive ones)
have been added lately.

class FnApiRunnerTest(unittest.TestCase):
class FnApiRunnerTestWithGrpc(FnApiRunnerTest):
class FnApiRunnerTestWithGrpcMultiThreaded(FnApiRunnerTest):
class FnApiRunnerTestWithDisabledCaching(FnApiRunnerTest):
class FnApiRunnerTestWithMultiWorkers(FnApiRunnerTest):
class FnApiRunnerTestWithGrpcAndMultiWorkers(FnApiRunnerTest):
class FnApiRunnerTestWithBundleRepeat(FnApiRunnerTest):
class FnApiRunnerTestWithBundleRepeatAndMultiWorkers(FnApiRunnerTest):

I'm not convinced we need to run all of these permutations, or at
least not all tests in all permutations.

On Fri, Oct 25, 2019 at 10:57 AM Valentyn Tymofieiev
 wrote:
>
> I took another look at this and precommit ITs are already running in 
> parallel, albeit in the same suite. However it appears Python precommits 
> became slower, especially Python 2 precommits [35 min per suite x 3 suites], 
> see [1]. Not sure yet what caused the increase, but precommits used to be 
> faster. Perhaps we have added a slow test or a lot of new tests.
>
> [1] https://scans.gradle.com/s/jvcw5fpqfc64k/timeline?task=ancsbov425524
>
> On Thu, Oct 24, 2019 at 4:53 PM Ahmet Altay  wrote:
>>
>> Ack. Separating precommit ITs to a different suite sounds good. Anyone is 
>> interested in doing that?
>>
>> On Thu, Oct 24, 2019 at 2:41 PM Valentyn Tymofieiev  
>> wrote:
>>>
>>> This should not increase the queue time substantially, since precommit ITs 
>>> are running sequentially with precommit tests, unlike multiple precommit 
>>> tests which run in parallel to each other.
>>>
>>> The precommit ITs we run are batch and streaming wordcount tests on Py2 and 
>>> one Py3 version, so it's not a lot of tests.
>>>
>>> On Thu, Oct 24, 2019 at 1:07 PM Ahmet Altay  wrote:

 +1 to separating ITs from precommit. Downside would be, when Chad tried to 
 do something similar [1] it was noted that the total time to run all 
 precommit tests would increase and also potentially increase the queue 
 time.

 Another alternative, we could run a smaller set of IT tests in precommits 
 and run the whole suite as part of post commit tests.

 [1] https://github.com/apache/beam/pull/9642

 On Thu, Oct 24, 2019 at 12:15 PM Valentyn Tymofieiev  
 wrote:
>
> One improvement could be move to Precommit IT tests into a separate suite 
> from precommit tests, and run it in parallel.
>
> On Thu, Oct 24, 2019 at 11:41 AM Brian Hulette  
> wrote:
>>
>> Python Precommits are taking quite a while now [1]. Just visually it 
>> looks like the average length is 1.5h or so, but it spikes up to 2h. 
>> I've had several precommit runs get aborted due to the 2 hour limit.
>>
>> It looks like there was a spike up above 1h back on 9/6 and the duration 
>> has been steadily rising since then. Is there anything we can do about 
>> this?
>>
>> Brian
>>
>> [1] 
>> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=now-90d&to=now&fullscreen&panelId=4


Re: Interactive Beam Example Failing [BEAM-8451]

2019-10-25 Thread David Yan
+Sam Rohde  is working on streaming support for
Interactive Beam.

The high level idea is to capture a bounded segment of the unbounded data
source for replayablity and determinism, and to use TestStream, which has
the ability to control the clock of the pipeline, to replay the data, so
streaming semantics that depend on processing time (e.g. processing time
triggers) will be deterministic.

David

On Thu, Oct 24, 2019 at 4:39 PM Robert Bradshaw  wrote:

> Yes, there are plans to support streaming for interactive beam. David
> Yan (cc'd) is leading this effort.
>
> On Thu, Oct 24, 2019 at 1:50 PM Harsh Vardhan  wrote:
> >
> > Thanks, +1 to adding support for streaming on Interactive Beam (+David
> Yan)
> >
> >
> > On Thu, Oct 24, 2019 at 1:45 PM Hai Lu  wrote:
> >>
> >> Hi Robert,
> >>
> >> We're trying out iBeam at LinkedIn for Python. As Igor mentioned, there
> seems to be some inconsistency in the behavior of interactive beam. We can
> suggest some fixes from our end but we would need some support from the
> community.
> >>
> >> Also, is there a plan to support iBeam for streaming mode? We're
> interested in that use case as well.
> >>
> >> Thanks,
> >> Hai
> >>
> >> On Mon, Oct 21, 2019 at 4:45 PM Robert Bradshaw 
> wrote:
> >>>
> >>> Thanks for trying this out. Yes, this is definitely something that
> >>> should be supported (and tested).
> >>>
> >>> On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic 
> wrote:
> >>> >
> >>> > Hi everyone,
> >>> >
> >>> > The interactive beam example using the DirectRunner fails after
> execution of the last cell. The recursion limit is exceeded during the
> calculation of the cache label because of a circular reference in the
> PipelineInfo object.
> >>> >
> >>> > The constructor for the PipelineInfo class creates a mapping from
> each pcollection to the transforms that produce and consume it. The issue
> arises when there exists a transform that is both a producer and a consumer
> for the same pcollection. This occurs when a transform's expand method
> returns the same pcoll object that's passed into it. The specific transform
> causing the failure of the example is MaybeReshuffle, which is used in the
> Create transform. Replacing "return pcoll" with "return pcoll | Map(lambda
> x: x)" seems to fix the problem.
> >>> >
> >>> > A workaround for this issue on the interactive beam side would be
> fairly simple, but it seems to me that there should be more validation of
> pipelines to prevent the use of transforms that return the same pcoll
> that's passed in, or at least a mention of this in the transform style
> guide. My understanding is that pcollections are produced by a single
> transform (they even have a field called "producer" that references only
> one transform). If that's the case then that property of pcollections
> should be enforced.
> >>> >
> >>> > I made ticket BEAM-8451 to track this issue.
> >>> >
> >>> > I'm still new to beam so I apologize if I'm fundamentally
> misunderstanding something. I'm not exactly sure what the next step should
> be and would appreciate some recommendations. I can submit a PR to solve
> the immediate problem of the failing example but the underlying problem
> should also be addressed at some point. I also apologize if people are
> already aware of this problem.
> >>> >
> >>> > Thank You!
> >>> > Igor Durovic
>


Re: JIRA priorities explaination

2019-10-25 Thread Robert Bradshaw
We cut a release every 6 weeks, according to schedule, making it easy
to plan for, and the release manager typically sends out a warning
email to remind everyone. I don't think it makes sense to do that for
every ticket. Blockers should be reserved for things we really
shouldn't release without.

On Fri, Oct 25, 2019 at 11:33 AM Pablo Estrada  wrote:
>
> I mentioned on the PR that I had been using the 'blocker' priority along with 
> the 'fix version' field to mark issues that I want to get in the release.
> Of course, this little practice of mine only matters much around release 
> branch cutting time - and has been useful for me to track which things I want 
> to ensure getting into the release / bump to the next /etc.
> I've also found it to be useful as a way to communicate with the release 
> manager without having to sync directly.
>
> What would be a reasonable way to tell the release manager "I'd like to get 
> this feature in. please talk to me if you're about to cut the branch" - that 
> also uses the priorities appropriately? - and that allows the release manager 
> to know when a fix version is "more optional" / "less optional"?
>
> On Wed, Oct 23, 2019 at 12:20 PM Kenneth Knowles  wrote:
>>
>> I finally got around to writing some of this up. It is minimal. Feedback is 
>> welcome, especially if what I have written does not accurately represent the 
>> community's approach.
>>
>> https://github.com/apache/beam/pull/9862
>>
>> Kenn
>>
>> On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira  
>> wrote:
>>>
>>> Ah, sorry, I missed that Alex was just quoting from our Jira installation 
>>> (didn't read his email closely enough). Also I wasn't aware about those 
>>> pages on our website.
>>>
>>> Seeing as we do have definitions for our priorities, I guess my main 
>>> request would be that they be made more discoverable somehow. I don't think 
>>> the tooltips are reliable, and the pages on the website are informative, 
>>> but hard to find. Since it feels a bit lazy to say "this isn't discoverable 
>>> enough" without suggesting any improvements, I'd like to propose these two 
>>> changes:
>>>
>>> 1. We should write a Beam Jira Guide with basic information about our Jira. 
>>> I think the bug priorities should go in here, but also anything else we 
>>> would want someone to know before filing any Jira issues, like how our 
>>> components are organized or what the different issue types mean. This guide 
>>> could either be written in the website or the wiki, but I think it should 
>>> definitely be linked in https://beam.apache.org/contribute/ so that 
>>> newcomers read it before getting their Jira account approved. The goal here 
>>> being to have a reference for the basics of our Jira since at the moment it 
>>> doesn't seem like we have anything for this.
>>>
>>> 2. The existing info on Post-commit and pre-commit policies doesn't seem 
>>> very discoverable to someone monitoring the Pre/Post-commits. I've reported 
>>> a handful of test-failures already and haven't seen this link mentioned 
>>> much. We should try to find a way to funnel people towards this link when 
>>> there's an issue, the same way we try to funnel people towards the 
>>> contribution guide when they write a PR. As a note, while writing this 
>>> email I remembered this link that someone gave me before 
>>> (https://s.apache.org/beam-test-failure). That mentions the Post-commit 
>>> policies page, so maybe it's just a matter of pasting that all over our 
>>> Jenkins builds whenever we have a failing test?
>>>
>>> PS: I'm also definitely for SLOs, but I figure it's probably better 
>>> discussed in a separate thread so I'm trying to stick to the subject of 
>>> priority definitions.
>>>
>>> On Mon, Feb 11, 2019 at 9:17 AM Scott Wegner  wrote:

 Thanks for driving this discussion. I also was not aware of these existing 
 definitions. Once we agree on the terms, let's add them to our Contributor 
 Guide and start using them.

 +1 in general; I like both Alex and Kenn's definitions; Additional 
 wordsmithing could be moved to a Pull Request. Can we make the definitions 
 useful for both the person filing a bug, and the assignee, i.e.

 : . 
 

 On Sun, Feb 10, 2019 at 7:49 PM Kenneth Knowles  wrote:
>
> The content that Alex posted* is the definition from our Jira 
> installation anyhow.
>
> I just searched around, and there's 
> https://community.atlassian.com/t5/Jira-questions/According-to-Jira-What-is-Blocker-Critical-Major-Minor-and/qaq-p/668774
>  which makes clear that this is really user-defined, since Jira has many 
> deployments with their own configs.
>
> I guess what I want to know about this thread is what action is being 
> proposed?
>
> Previously, there was a thread that resulted in 
> https://beam.apache.org/contribute/precommit-policies/ and 
> https://beam.apache.org/contribute/postcommits-policies/. These

Re: Intermittent No FileSystem found exception

2019-10-25 Thread Koprivica,Preston Blake
I misspoke on the temporary workaround.  Should use  #withIgnoreWindowing() 
option on FileIO.

On 10/25/19, 11:33 AM, "Maximilian Michels"  wrote:

Hi Maulik,

Thanks for reporting. As Preston already pointed out, this is fixed in
the upcoming 2.17.0 release.

Thanks,
Max

On 24.10.19 15:20, Koprivica,Preston Blake wrote:
> Hi Maulik,
>
> I believe you may be witnessing this issue:
> 
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-8303&data=02%7C01%7CPreston.B.Koprivica%40cerner.com%7C36be5764350d4c7c759d08d759690487%7Cfbc493a80d244454a815f4ca58e8c09d%7C0%7C1%7C637076179896614383&sdata=au%2BJBvQQRhGFDHnkBE0%2BNBPvTU3pMsp8v7Qp2Il%2Bluk%3D&reserved=0.
  We ran into this using
> beam-2.15.0 on flink-1.8 over S3.  It looks like it’ll be fixed in 2.17.0.
>
> As a temporary workaround, you can set the #withNoSpilling() option if
> you’re using the FileIO api.  If not, it should be relatively easy to
> move to it.
>
> *From: *Maulik Soneji 
> *Reply-To: *"dev@beam.apache.org" 
> *Date: *Thursday, October 24, 2019 at 7:05 AM
> *To: *"dev@beam.apache.org" 
> *Subject: *Intermittent No FileSystem found exception
>
> Hi everyone,
>
> We are running a Batch job on flink that reads data from GCS and does
> some aggregation on this data.
> We are intermittently getting issue: `No filesystem found for scheme gs`
>
> We are running Beam version 2.15.0 with FlinkRunner, Flink version: 1.6.4
>
> On remote debugging the task managers, we found that in a few task
> managers, the *GcsFileSystemRegistrar is not added to the list of
> FileSystem Schemes*. In these task managers, we get this issue.
>
> The collection *SCHEME_TO_FILESYSTEM* is getting modified only in
> *setDefaultPipelineOptions* function call in
> org.apache.beam.sdk.io.FileSystems class and this function is not
> getting called and thus the GcsFileSystemRegistrar is not added to
> *SCHEME_TO_FILESYSTEM*.
>
> *Detailed stacktrace:*
>
>
> |java.lang.IllegalArgumentException: No filesystem found for scheme gs|
>
> | at
> 
org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:463)|
>
> | at
> org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:533)|
>
> | at
> org.apache.beam.sdk.io.fs.ResourceIdCoder.decode(ResourceIdCoder.java:49)|
>
> | at
> 
org.apache.beam.sdk.io.fs.MetadataCoder.decodeBuilder(MetadataCoder.java:62)|
>
> | at
> org.apache.beam.sdk.io.fs.MetadataCoder.decode(MetadataCoder.java:58)|
>
> | at
> org.apache.beam.sdk.io.fs.MetadataCoder.decode(MetadataCoder.java:36)|
>
> | at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)|
>
> | at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:82)|
>
> | at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:36)|
>
> | at
> 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:592)|
>
> | at
> 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:583)|
>
> | at
> 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:529)|
>
> | at
> 
org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:92)|
>
> | at
> 
org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)|
>
> | at
> 
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)|
>
> | at
> 
org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)|
>
> | at
> 
org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)|
>
> | at
> 
org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)|
>
> | at
> org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:94)|
>
> | at
> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)|
>
> | at
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)|
>
> | at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)|
>
> | at java.lang.Thread.run(Thread.java:748)|
>
> Inorder to resolve this issue, we tried calling the following in
> PTransform's expand function:
>
> 
FileSystems./setDefaultPipelineOptions

Re: JIRA priorities explaination

2019-10-25 Thread Pablo Estrada
I mentioned on the PR that I had been using the 'blocker' priority along
with the 'fix version' field to mark issues that I want to get in the
release.
Of course, this little practice of mine only matters much around release
branch cutting time - and has been useful for me to track which things I
want to ensure getting into the release / bump to the next /etc.
I've also found it to be useful as a way to communicate with the release
manager without having to sync directly.

What would be a reasonable way to tell the release manager "I'd like to get
this feature in. please talk to me if you're about to cut the branch" -
that also uses the priorities appropriately? - and that allows the release
manager to know when a fix version is "more optional" / "less optional"?

On Wed, Oct 23, 2019 at 12:20 PM Kenneth Knowles  wrote:

> I finally got around to writing some of this up. It is minimal. Feedback
> is welcome, especially if what I have written does not accurately represent
> the community's approach.
>
> https://github.com/apache/beam/pull/9862
>
> Kenn
>
> On Mon, Feb 11, 2019 at 3:21 PM Daniel Oliveira 
> wrote:
>
>> Ah, sorry, I missed that Alex was just quoting from our Jira installation
>> (didn't read his email closely enough). Also I wasn't aware about those
>> pages on our website.
>>
>> Seeing as we do have definitions for our priorities, I guess my main
>> request would be that they be made more discoverable somehow. I don't think
>> the tooltips are reliable, and the pages on the website are informative,
>> but hard to find. Since it feels a bit lazy to say "this isn't discoverable
>> enough" without suggesting any improvements, I'd like to propose these two
>> changes:
>>
>> 1. We should write a Beam Jira Guide with basic information about our
>> Jira. I think the bug priorities should go in here, but also anything else
>> we would want someone to know before filing any Jira issues, like how our
>> components are organized or what the different issue types mean. This guide
>> could either be written in the website or the wiki, but I think it should
>> definitely be linked in https://beam.apache.org/contribute/ so that
>> newcomers read it before getting their Jira account approved. The goal here
>> being to have a reference for the basics of our Jira since at the moment it
>> doesn't seem like we have anything for this.
>>
>> 2. The existing info on Post-commit and pre-commit policies doesn't seem
>> very discoverable to someone monitoring the Pre/Post-commits. I've reported
>> a handful of test-failures already and haven't seen this link mentioned
>> much. We should try to find a way to funnel people towards this link when
>> there's an issue, the same way we try to funnel people towards the
>> contribution guide when they write a PR. As a note, while writing this
>> email I remembered this link that someone gave me before (
>> https://s.apache.org/beam-test-failure
>> ).
>> That mentions the Post-commit policies page, so maybe it's just a matter of
>> pasting that all over our Jenkins builds whenever we have a failing test?
>>
>> PS: I'm also definitely for SLOs, but I figure it's probably better
>> discussed in a separate thread so I'm trying to stick to the subject of
>> priority definitions.
>>
>> On Mon, Feb 11, 2019 at 9:17 AM Scott Wegner  wrote:
>>
>>> Thanks for driving this discussion. I also was not aware of these
>>> existing definitions. Once we agree on the terms, let's add them to our
>>> Contributor Guide and start using them.
>>>
>>> +1 in general; I like both Alex and Kenn's definitions; Additional
>>> wordsmithing could be moved to a Pull Request. Can we make the definitions
>>> useful for both the person filing a bug, and the assignee, i.e.
>>>
>>> : >> assigned>. 
>>>
>>> On Sun, Feb 10, 2019 at 7:49 PM Kenneth Knowles  wrote:
>>>
 The content that Alex posted* is the definition from our Jira
 installation anyhow.

 I just searched around, and there's
 https://community.atlassian.com/t5/Jira-questions/According-to-Jira-What-is-Blocker-Critical-Major-Minor-and/qaq-p/668774
 which makes clear that this is really user-defined, since Jira has many
 deployments with their own configs.

 I guess what I want to know about this thread is what action is being
 proposed?

 Previously, there was a thread that resulted in
 https://beam.apache.org/contribute/precommit-policies/ and
 https://beam.apache.org/contribute/postcommits-policies/. These have
 test failures and flakes as Critical. I agree with Alex that these should
 be Blocker. They disrupt the work of the entire community, so we need to
 drop everything and get green again.

 Other than that, I think what you - Daniel - are suggesting is that the
 definition might be best expressed as SLOs. I asked on
 u...@infra.apache.org about how 

Re: [DISCUSS] New Beam pipeline diagrams

2019-10-25 Thread Kenneth Knowles
These are really clean and clear. Nice!

On Thu, Oct 24, 2019 at 9:05 AM Cyrus Maden  wrote:

> Hi all,
>
> Thank you to everyone who reached out to me and/or commented on my
> proposal for new Beam pipeline diagrams[1]. I've compiled all of your
> suggestions and updated the design accordingly. *Here's a new Google Doc
> with sample diagrams[2].* I plan to draft a PR soon that replaces all of
> the diagrams in our docs with new ones. In the meantime, please feel free
> to leave comments and suggestions on the new doc!
>
> Best,
> Cyrus
>
> [1]
> https://docs.google.com/document/d/1khf9Bx4XJWsKUD6J1eDcYo_8dL9LBoHDtJpyDjDzOMM/edit?usp=sharing
> [2]
> https://docs.google.com/document/d/1MvL64o1QmJdzZcPFtkTFFlFM3_HrsZFjshxd1KWEBfg/edit?usp=sharing
>


Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Valentyn Tymofieiev
I took another look at this and precommit ITs are already running in
parallel, albeit in the same suite. However it appears Python precommits
became slower, especially Python 2 precommits [35 min per suite x 3
suites], see [1]. Not sure yet what caused the increase, but precommits
used to be faster. Perhaps we have added a slow test or a lot of new tests.

[1] https://scans.gradle.com/s/jvcw5fpqfc64k/timeline?task=ancsbov425524

On Thu, Oct 24, 2019 at 4:53 PM Ahmet Altay  wrote:

> Ack. Separating precommit ITs to a different suite sounds good. Anyone is
> interested in doing that?
>
> On Thu, Oct 24, 2019 at 2:41 PM Valentyn Tymofieiev 
> wrote:
>
>> This should not increase the queue time substantially, since precommit
>> ITs are running *sequentially* with precommit tests, unlike multiple
>> precommit tests which run in parallel to each other.
>>
>> The precommit ITs we run are batch and streaming wordcount tests on Py2
>> and one Py3 version, so it's not a lot of tests.
>>
>> On Thu, Oct 24, 2019 at 1:07 PM Ahmet Altay  wrote:
>>
>>> +1 to separating ITs from precommit. Downside would be, when Chad tried
>>> to do something similar [1] it was noted that the total time to run all
>>> precommit tests would increase and also potentially increase the queue time.
>>>
>>> Another alternative, we could run a smaller set of IT tests in
>>> precommits and run the whole suite as part of post commit tests.
>>>
>>> [1] https://github.com/apache/beam/pull/9642
>>>
>>> On Thu, Oct 24, 2019 at 12:15 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 One improvement could be move to Precommit IT tests into a separate
 suite from precommit tests, and run it in parallel.

 On Thu, Oct 24, 2019 at 11:41 AM Brian Hulette 
 wrote:

> Python Precommits are taking quite a while now [1]. Just visually it
> looks like the average length is 1.5h or so, but it spikes up to 2h. I've
> had several precommit runs get aborted due to the 2 hour limit.
>
> It looks like there was a spike up above 1h back on 9/6 and the
> duration has been steadily rising since then. Is there anything we can do
> about this?
>
> Brian
>
> [1]
> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=now-90d&to=now&fullscreen&panelId=4
>



Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Maximilian Michels

Hi Jincheng,

Thanks for bringing this up and capturing the ideas in the doc.

Intuitively, I would have also considered adding a new Proto message for 
the teardown, but I think the idea to trigger this logic when the SDK 
Harness evicts process bundle descriptors is more elegant.


Thanks,
Max

On 25.10.19 17:23, Luke Cwik wrote:
I like approach 3 since it doesn't add additional complexity to the API 
and individual SDKs can choose to implement any clean-up strategy they 
want or none at all which is the simplest.


On Thu, Oct 24, 2019 at 8:46 PM jincheng sun > wrote:


Hi,

Thanks for your comments in doc, I have add Approach 3 which you
mentioned! @Luke

For now, we should do a decision for Approach 3 and Approach 1.
Detail can be found in doc [1]

Welcome anyone's feedback :)

Regards,
Jincheng

[1]

https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing

jincheng sun mailto:sunjincheng...@gmail.com>> 于2019年10月25日周五 上午10:40写道:

Hi,

Functionally capable of `abort`, but it will be called at the
end of operator. So, I prefer `dispose` semantics. i.e., all
normal logic has been executed.

Best,
Jincheng

Harsh Vardhan mailto:anan...@google.com>>
于2019年10月23日周三 上午12:14写道:

Would approach 1 be akin to abort semantics?

On Mon, Oct 21, 2019 at 8:01 PM jincheng sun
mailto:sunjincheng...@gmail.com>>
wrote:

Hi Luke,

Thanks a lot for your reply. Since it allows to share
one SDK harness between multiple executable stages, the
control service termination may occur much later than
the completion of an executable stage. This is the main
reason I prefer runners to control the teardown of DoFns.

Regarding to "SDK harnesses can terminate instances any
time they want and start new instances anytime as
well.", personally I think it's not conflict with the
proposed Approach 1 as the SDK harness could decide what
to do when receiving the teardown request. It could do
nothing if the DoFns has already been teared down and
could also tear down the DoFns if needed.

What do you think?

Best,
Jincheng

Luke Cwik mailto:lc...@google.com>>
于2019年10月22日周二 上午2:05写道:

Approach 2 is currently the suggested approach[1]
for DoFn's to shutdown.
Note that SDK harnesses can terminate instances any
time they want and start new instances anytime as well.

Why do you want to expose this logic so that Runners
could control it?

1:

https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#

On Mon, Oct 21, 2019 at 4:27 AM jincheng sun
mailto:sunjincheng...@gmail.com>> wrote:

Hi,
I found that in `SdkHarness` do not  stop the
`SdkWorker` when finish.  We should add the
logic for stop the `SdkWorker` in `SdkHarness`. 
More detail can be found [1].


There are two approaches to solve this issue:

Approach 1:  We can add a Fn API for teardown
purpose and the runner will teardown a specific
bundle descriptor via this teardown Fn API
during disposing.
Approach 2: The control service termination
could be seen as a signal and once SDK harness
receives this signal, the teardown of the bundle
descriptor will be performed.

More detail can be found in [2].

As the Approach 2, SDK harness could be shared
between multiple executable stages. The control
service termination only occurs when all the
executable stages sharing the same SDK harness
finished. This means that the teardown of DoFns
may not be executed immediately after an
executable stage is finished.

So, I prefer Approach 1. Welcome any feedback :)

Best,
Jincheng

[1]

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py
[2]

https:

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Luke Cwik
I believe PCollectionTuple should be unnecessary since Python has first
class support for tuples as shown in the example below[1]. Can we use
tuples to solve your issue?

wordsStartingWithA = \
p | 'Words starting with A' >> beam.Create(['apple', 'ant', 'arrow'])

wordsStartingWithB = \
p | 'Words starting with B' >> beam.Create(['ball', 'book', 'bow'])

((wordsStartingWithA, wordsStartingWithB)
| beam.Flatten()
| LogElements())

1:
https://github.com/apache/beam/blob/238659bce8043e6a64619a959ab44453dbe22dff/learning/katas/python/Core%20Transforms/Flatten/Flatten/task.py#L29

On Fri, Oct 25, 2019 at 10:11 AM Sam Rohde  wrote:

> Talked to Daniel offline and it looks like the Python SDK is missing
> PCollection Tuples like the one Java has:
> https://github.com/rohdesamuel/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionTuple.java
> .
>
> I'll go ahead and implement that for the Python SDK.
>
> On Thu, Oct 24, 2019 at 5:20 PM Sam Rohde  wrote:
>
>> Hey All,
>>
>> I'm trying to implement an expand override with multiple output
>> PCollections. The kicker is that I want to insert a new transform for each
>> output PCollection. How can I do this?
>>
>> Regards,
>> Sam
>>
>


Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Sam Rohde
Talked to Daniel offline and it looks like the Python SDK is missing
PCollection Tuples like the one Java has:
https://github.com/rohdesamuel/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionTuple.java
.

I'll go ahead and implement that for the Python SDK.

On Thu, Oct 24, 2019 at 5:20 PM Sam Rohde  wrote:

> Hey All,
>
> I'm trying to implement an expand override with multiple output
> PCollections. The kicker is that I want to insert a new transform for each
> output PCollection. How can I do this?
>
> Regards,
> Sam
>


Re: Apache Pulsar connector for Beam

2019-10-25 Thread Maximilian Michels
It would be great to have a Pulsar connector. We might want to ask the 
folks from StreamNative (in CC). Any plans? :)


Cheers,
Max

On 24.10.19 18:31, Pablo Estrada wrote:
There's a JIRA issue to track this: 
https://issues.apache.org/jira/browse/BEAM-8218


Alex was kind enough to file it. +Alex Van Boxel 
 : )

Best
-P

On Thu, Oct 24, 2019 at 12:01 AM Taher Koitawala > wrote:


Hi Reza,
              Thanks for your reply. However i do not see Pulsar
listed in there. Should we file a jira?

On Thu, Oct 24, 2019, 12:16 PM Reza Rokni mailto:r...@google.com>> wrote:

Hi Taher,

You can see the list of current and wip IO's here:

https://beam.apache.org/documentation/io/built-in/

Cheers

Reza

On Thu, 24 Oct 2019 at 13:56, Taher Koitawala
mailto:taher...@gmail.com>> wrote:

Hi All,
          Been wanting to know if we have a Pulsar connector
for Beam. Pulsar is another messaging queue like Kafka and I
would like to build a streaming pipeline with Pulsar. Any
help would be appreciated..


Regards,
Taher Koitawala



-- 


This email may be confidential and privileged. If you received
this communication by mistake, please don't forward it to anyone
else, please erase all copies and attachments, and please let me
know that it has gone to the wrong person.

The above terms reflect a potential business arrangement, are
provided solely as a basis for further discussion, and are not
intended to be and do not constitute a legally binding
obligation. No legally binding obligations will be created,
implied, or inferred until an agreement in final form is
executed in writing by all parties involved.



Re: Intermittent No FileSystem found exception

2019-10-25 Thread Maximilian Michels

Hi Maulik,

Thanks for reporting. As Preston already pointed out, this is fixed in 
the upcoming 2.17.0 release.


Thanks,
Max

On 24.10.19 15:20, Koprivica,Preston Blake wrote:

Hi Maulik,

I believe you may be witnessing this issue: 
https://issues.apache.org/jira/browse/BEAM-8303.  We ran into this using 
beam-2.15.0 on flink-1.8 over S3.  It looks like it’ll be fixed in 2.17.0.


As a temporary workaround, you can set the #withNoSpilling() option if 
you’re using the FileIO api.  If not, it should be relatively easy to 
move to it.


*From: *Maulik Soneji 
*Reply-To: *"dev@beam.apache.org" 
*Date: *Thursday, October 24, 2019 at 7:05 AM
*To: *"dev@beam.apache.org" 
*Subject: *Intermittent No FileSystem found exception

Hi everyone,

We are running a Batch job on flink that reads data from GCS and does 
some aggregation on this data.

We are intermittently getting issue: `No filesystem found for scheme gs`

We are running Beam version 2.15.0 with FlinkRunner, Flink version: 1.6.4

On remote debugging the task managers, we found that in a few task 
managers, the *GcsFileSystemRegistrar is not added to the list of 
FileSystem Schemes*. In these task managers, we get this issue.


The collection *SCHEME_TO_FILESYSTEM* is getting modified only in 
*setDefaultPipelineOptions* function call in 
org.apache.beam.sdk.io.FileSystems class and this function is not 
getting called and thus the GcsFileSystemRegistrar is not added to 
*SCHEME_TO_FILESYSTEM*.


*Detailed stacktrace:*


|java.lang.IllegalArgumentException: No filesystem found for scheme gs|

| at 
org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:463)|


| at 
org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:533)|


| at 
org.apache.beam.sdk.io.fs.ResourceIdCoder.decode(ResourceIdCoder.java:49)|


| at 
org.apache.beam.sdk.io.fs.MetadataCoder.decodeBuilder(MetadataCoder.java:62)|


| at 
org.apache.beam.sdk.io.fs.MetadataCoder.decode(MetadataCoder.java:58)|


| at 
org.apache.beam.sdk.io.fs.MetadataCoder.decode(MetadataCoder.java:36)|


| at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)|

| at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:82)|

| at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:36)|

| at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:592)|


| at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:583)|


| at 
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:529)|


| at 
org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:92)|


| at 
org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)|


| at 
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)|


| at 
org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)|


| at 
org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)|


| at 
org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)|


| at 
org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:94)|


| at 
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)|


| at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)|


| at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)|

| at java.lang.Thread.run(Thread.java:748)|

Inorder to resolve this issue, we tried calling the following in 
PTransform's expand function:


FileSystems./setDefaultPipelineOptions/(PipelineOptionsFactory./create/());

This function call is to make sure that the GcsFileSystemRegistrar is 
added to the list, but this hasn't solved the issue.


Can someone please help in checking why this might be happening and what 
can be done to resolve this issue.


Thanks and Regards,
Maulik

CONFIDENTIALITY NOTICE This message and any included attachments are 
from Cerner Corporation and are intended only for the addressee. The 
information contained in this message is confidential and may constitute 
inside or non-public information under international, federal, or state 
securities laws. Unauthorized forwarding, printing, copying, 
distribution, or use of such information is strictly prohibited and may 
be unlawful. If you are not the addressee, please promptly delete this 
message and notify the sender of the delivery error by e-mail or you may 
call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) 
(816)221-1024.




Re: DynamoDBIO related issue

2019-10-25 Thread Luke Cwik
If you create a JIRA account and share your user id with us, we will grant
you contributor access which will allow you to create a JIRA issue.

Please take a look at the our contribution guide, it mentions how to
connect with the Beam community including creating a JIRA account[1].

1: https://beam.apache.org/contribute/#connect-with-the-beam-community

On Thu, Oct 24, 2019 at 7:15 PM Cam Mach  wrote:

> Hi Pradeep,
>
> What is your enhancement? Can you create a ticket and describe it?
>
> Thanks,
> Cam
>
>
>
> On Thu, Oct 24, 2019 at 1:58 PM Pradeep Bhosale <
> bhosale.pradeep1...@gmail.com> wrote:
>
>> Hi,
>>
>> This is Pradeep. I am using DynamoDB IO to write data to dynamo DB.
>> I would like to report one enhancement.
>>
>> Please let me know how can I achieve that.
>> I don't have *create issue* access on beam JIRA.
>>
>> https://issues.apache.org/jira/projects/BEAM/issues/BEAM-8368?filter=allopenissues
>>
>>


Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Luke Cwik
I like approach 3 since it doesn't add additional complexity to the API and
individual SDKs can choose to implement any clean-up strategy they want or
none at all which is the simplest.

On Thu, Oct 24, 2019 at 8:46 PM jincheng sun 
wrote:

> Hi,
>
> Thanks for your comments in doc, I have add Approach 3 which you
> mentioned! @Luke
>
> For now, we should do a decision for Approach 3 and Approach 1. Detail can
> be found in doc [1]
>
> Welcome anyone's feedback :)
>
> Regards,
> Jincheng
>
> [1]
> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
>
> jincheng sun  于2019年10月25日周五 上午10:40写道:
>
>> Hi,
>>
>> Functionally capable of `abort`, but it will be called at the end of
>> operator. So, I prefer `dispose` semantics. i.e., all normal logic has been
>> executed.
>>
>> Best,
>> Jincheng
>>
>> Harsh Vardhan  于2019年10月23日周三 上午12:14写道:
>>
>>> Would approach 1 be akin to abort semantics?
>>>
>>> On Mon, Oct 21, 2019 at 8:01 PM jincheng sun 
>>> wrote:
>>>
 Hi Luke,

 Thanks a lot for your reply. Since it allows to share one SDK harness
 between multiple executable stages, the control service termination may
 occur much later than the completion of an executable stage. This is the
 main reason I prefer runners to control the teardown of DoFns.

 Regarding to "SDK harnesses can terminate instances any time they want
 and start new instances anytime as well.", personally I think it's not
 conflict with the proposed Approach 1 as the SDK harness could decide what
 to do when receiving the teardown request. It could do nothing if the DoFns
 has already been teared down and could also tear down the DoFns if needed.

 What do you think?

 Best,
 Jincheng

 Luke Cwik  于2019年10月22日周二 上午2:05写道:

> Approach 2 is currently the suggested approach[1] for DoFn's to
> shutdown.
> Note that SDK harnesses can terminate instances any time they want and
> start new instances anytime as well.
>
> Why do you want to expose this logic so that Runners could control it?
>
> 1:
> https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#
>
> On Mon, Oct 21, 2019 at 4:27 AM jincheng sun 
> wrote:
>
>> Hi,
>> I found that in `SdkHarness` do not  stop the `SdkWorker` when
>> finish.  We should add the logic for stop the `SdkWorker` in 
>> `SdkHarness`.
>> More detail can be found [1].
>>
>> There are two approaches to solve this issue:
>>
>> Approach 1:  We can add a Fn API for teardown purpose and the runner
>> will teardown a specific bundle descriptor via this teardown Fn API 
>> during
>> disposing.
>> Approach 2: The control service termination could be seen as a signal
>> and once SDK harness receives this signal, the teardown of the bundle
>> descriptor will be performed.
>>
>> More detail can be found in [2].
>>
>> As the Approach 2, SDK harness could be shared between multiple
>> executable stages. The control service termination only occurs when all 
>> the
>> executable stages sharing the same SDK harness finished. This means that
>> the teardown of DoFns may not be executed immediately after an executable
>> stage is finished.
>>
>> So, I prefer Approach 1. Welcome any feedback :)
>>
>> Best,
>> Jincheng
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py
>> [2]
>> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
>>
> --
>>>
>>> Got feedback? go/harsh-feedback 
>>>
>>


Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Ryan Skraba
One more thing to try -- depending on your pipeline, you can disable
the "auto-reshuffle" of JdbcIO.Read by setting
withOutputParallelization(false)

This is particularly useful if (1) you do aggressive and cheap
filtering immediately after the read or (2) you do your own
repartitioning action like GroupByKey after the read.

Given your investigation into the heap, I doubt this will help!  I'll
take a closer look at the DoFnOutputManager.  In the meantime, is
there anything particularly about your job that might help
investigate?

All my best, Ryan

On Fri, Oct 25, 2019 at 2:47 PM Jozef Vilcek  wrote:
>
> I agree I might be too quick to call DoFn output need to fit in memory. 
> Actually I am not sure what Beam model say on this matter and what output 
> managers of particular runners do about it.
>
> But SparkRunner definitely has an issue here. I did try set small `fetchSize` 
> for JdbcIO as well as change `storageLevel` to MEMORY_AND_DISK. All fails on 
> OOM.
> When looking at the heap, most of it is used by linked list multi-map of 
> DoFnOutputManager here:
> https://github.com/apache/beam/blob/v2.15.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L234
>
>


Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Eugene Kirpichov
Yeah - in this case your primary option is to use JdbcIO.readAll() and shard
your query, as suggested above.

Alternative hypothesis: is the result set of your query actually big enough
that it *shouldn't* fit in memory? Or could it be a matter of inefficient
storage of its elements? Could you briefly describe how big is the result
set and in what form do you store its elements?

On Fri, Oct 25, 2019 at 5:47 AM Jozef Vilcek  wrote:

> I agree I might be too quick to call DoFn output need to fit in memory.
> Actually I am not sure what Beam model say on this matter and what output
> managers of particular runners do about it.
>
> But SparkRunner definitely has an issue here. I did try set small
> `fetchSize` for JdbcIO as well as change `storageLevel` to MEMORY_AND_DISK.
> All fails on OOM.
> When looking at the heap, most of it is used by linked list multi-map of
> DoFnOutputManager here:
>
> https://github.com/apache/beam/blob/v2.15.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L234
>
>
>


Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Jozef Vilcek
I agree I might be too quick to call DoFn output need to fit in memory.
Actually I am not sure what Beam model say on this matter and what output
managers of particular runners do about it.

But SparkRunner definitely has an issue here. I did try set small
`fetchSize` for JdbcIO as well as change `storageLevel` to MEMORY_AND_DISK.
All fails on OOM.
When looking at the heap, most of it is used by linked list multi-map of
DoFnOutputManager here:
https://github.com/apache/beam/blob/v2.15.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L234


Re: Jenkin jobs taking too long

2019-10-25 Thread Kyle Weaver
Thanks Alexey. Forgot what magic word. 🙂

On Fri, Oct 25, 2019 at 11:48 AM Alexey Romanenko 
wrote:

>
> On 25 Oct 2019, at 09:18, Kyle Weaver  wrote:
> Does anyone know if there's a PR command for "rerun all”?
>
>
> Do you mean “Retest this please” ?
>
>
>
> On Fri, Oct 25, 2019 at 9:09 AM Rehman Murad Ali <
> rehman.murad...@venturedive.com> wrote:
>
>> Hello,
>>
>> It's been more than 17 hours,  Jenkin jobs are still running. Here is the
>> PR https://github.com/apache/beam/pull/9677.
>>
>> Any help would be appreciated.
>>
>>
>> Rehman
>>
>
>


Re: Jenkin jobs taking too long

2019-10-25 Thread Alexey Romanenko

> On 25 Oct 2019, at 09:18, Kyle Weaver  wrote:
> Does anyone know if there's a PR command for "rerun all”?

Do you mean “Retest this please” ?


> 
> On Fri, Oct 25, 2019 at 9:09 AM Rehman Murad Ali 
> mailto:rehman.murad...@venturedive.com>> 
> wrote:
> Hello,
> 
> It's been more than 17 hours,  Jenkin jobs are still running. Here is the PR 
> https://github.com/apache/beam/pull/9677 
> . 
> 
> Any help would be appreciated.
> 
> 
> 
> Rehman
> 



Re: Jenkin jobs taking too long

2019-10-25 Thread Rehman Murad Ali
Thank you Kyle for letting me know work around.


*Thanks & Regards*


*Rehman Murad Ali*

Software Engineer
Mobile: +92 3452076766
Skype: rehman,muradali




On Fri, Oct 25, 2019 at 12:26 PM Kyle Weaver  wrote:

> Looks like Jenkins might have got stuck for a while (this happened on my
> PR as well), but it seems to be working again now. You can manually re-run
> individual jobs with a comment on the PR containing one of the commands
> listed (such as Run Python PreCommit).
>
> Does anyone know if there's a PR command for "rerun all"?
>
> On Fri, Oct 25, 2019 at 9:09 AM Rehman Murad Ali <
> rehman.murad...@venturedive.com> wrote:
>
>> Hello,
>>
>> It's been more than 17 hours,  Jenkin jobs are still running. Here is the
>> PR https://github.com/apache/beam/pull/9677.
>>
>> Any help would be appreciated.
>>
>>
>> Rehman
>>
>


Re: Jenkin jobs taking too long

2019-10-25 Thread Kyle Weaver
Looks like Jenkins might have got stuck for a while (this happened on my PR
as well), but it seems to be working again now. You can manually re-run
individual jobs with a comment on the PR containing one of the commands
listed (such as Run Python PreCommit).

Does anyone know if there's a PR command for "rerun all"?

On Fri, Oct 25, 2019 at 9:09 AM Rehman Murad Ali <
rehman.murad...@venturedive.com> wrote:

> Hello,
>
> It's been more than 17 hours,  Jenkin jobs are still running. Here is the
> PR https://github.com/apache/beam/pull/9677.
>
> Any help would be appreciated.
>
>
> Rehman
>


Jenkin jobs taking too long

2019-10-25 Thread Rehman Murad Ali
Hello,

It's been more than 17 hours,  Jenkin jobs are still running. Here is the
PR https://github.com/apache/beam/pull/9677.

Any help would be appreciated.


Rehman