date:20200714

Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-07-14 Thread Tobiasz Kędzierski

I think the experiment went quite well.
Now all nodes have the increased number of the executors [1]..
Canceling Jenkins builds after PR updates seems to work as well. Queue is
usually very short.

[1]: https://ci-beam.apache.org/label/beam/load-statistics?type=hour

BR
Tobiasz

On Mon, Jul 6, 2020 at 4:52 PM Tyson Hamilton  wrote:

> How did the experiment go? The load status graphs on the executors seem to
> show no increase in queued jobs [1]. There is a periodic bump every 6h,
> possibly due to cron firing off a bunch at the same time, that can also be
> seen by changing the timepsan in [1]. The number of executors on 1/4 of the
> nodes was also increased so the combination of these things make contention
> to appear quite low or even non-existent.
>
> [1]: https://ci-beam.apache.org/label/beam/load-statistics?type=hour
>
> On Mon, Jun 29, 2020 at 9:17 AM Tobiasz Kędzierski <
> tobiasz.kedzier...@polidea.com> wrote:
>
>> Hi
>>
>> Agree with Ahmet, that even in that shape it should improve the queue
>> length. Both _Commit/_Phrase cross-cancelling and "cancell all" phrase seem
>> require much effort and I doubt it's worthy to do it.
>>
>> I will turn on `Cancel build on update` in ghprb-plugin on
>> ci-beam.apache.org tomorrow (Tuesday).
>>
>> Some discussions related to job filtering issue (or feature) in
>> ghprb-plugin:
>> https://github.com/jenkinsci/ghprb-plugin/issues/678
>> https://github.com/jenkinsci/ghprb-plugin/pull/680
>>
>> BR
>> Tobiasz
>>
>> On Fri, Jun 26, 2020 at 2:07 AM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Thu, Jun 25, 2020 at 4:27 PM Tobiasz Kędzierski <
>>> tobiasz.kedzier...@polidea.com> wrote:
>>>
 Andrew thanks for great analysis +1
 This bug with job filtering seems to be crucial to keep _Commit and
 _Phrase separate.

 I was considering the situation where the two PRs with the same commit
 hash are created. I created an artificial situation where two branches are
 identical and then two PRs with them. Two separate jobs were triggered. As
 you wrote, due to the matching GH status by job name and hash, both PR
 statuses were pointing to the same job (the newest one, which was wrong for
 one PR). As i tested, adding a new commit which will cancel the previous
 build would show false status on the PR with the previously wrong job link.
 It is possible to reproduce it, but could you give the real life
 situation where two jobs would be triggered with the same commit?
 I am asking because I think that enabling `Cancel build on update` may
 greatly improve Jenkins queue and it would be worthwhile to sacrifice this
 rare and unlikely case for it (if it is rare and unlikely case of course).

>>>
>>> I agree with this.
>>>
>>>

 Ahmet, I think the cancelling _Commit build by following _Phrase should
 be handled within ghprb-plugin if possible. I am not sure if we can make
 some workaround. Do you have any suggestions how we may solve it?

>>>
>>> I do not know jenkins enough to be able to make a good suggestion. We
>>> can try:
>>> - If it is possible to do this with ghprb plugin as you suggested, we
>>> can do that.
>>> - If not, we can make _Commit jobs cancel _Commit jobs only and _Phrase
>>> jobs cancel _Phrase jobs only. It will still be an improvement.
>>>
>>>

 BR
 Tobiasz

 On Wed, Jun 24, 2020 at 12:28 AM Kenneth Knowles 
 wrote:

> +1 to Andrew's analysis
>
> On Tue, Jun 23, 2020 at 12:13 PM Ahmet Altay  wrote:
>
>> Would it be possible to cancel any running _Phrase or _Commit
>> variants, if either one of them is triggered?
>>
>> On Tue, Jun 23, 2020 at 10:41 AM Andrew Pilloud 
>> wrote:
>>
>>> I believe we split _Commit and _Phrase to work around a bug with job
>>> filtering. For example, when you make a python change only the python 
>>> tests
>>> are run based on the commit. We still want to be able to run the java 
>>> jobs
>>> by trigger phrase if needed. There are also performance tests (Nexmark 
>>> for
>>> example) that have different jobs to ensure PR runs don't end up 
>>> published
>>> in the performance dashboard, but i think those have a split of _Phrase 
>>> and
>>> _Cron.
>>>
>>> As for canceling jobs, don't forget that the github status APIs are
>>> keyed on commit hash and job name (not PR). It is possible for a commit 
>>> to
>>> be on multiple PRs and it is possible for a single PR to have
>>> multiple commits. There are workflows that will be broken if you are 
>>> keying
>>> off of a PR to automatically cancel jobs.
>>>
>>> On Tue, Jun 23, 2020 at 9:59 AM Tyson Hamilton 
>>> wrote:
>>>
 +1 the ability to cancel in-flight jobs is worth deduplicating
 _Phrase and _Commit. I don't see a benefit for having both.

 On Tue, Jun 23, 2020 at 9:02 AM Luke Cwik  wrote:

>

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-07-14 Thread Kamil Wasilewski

Never mind, I found this thread on user list:
https://lists.apache.org/thread.html/raeb69afbd820fdf32b3cf0a273060b6b149f80fa49c7414a1bb60528%40%3Cuser.beam.apache.org%3E,
which answers my question.

On Mon, Jul 13, 2020 at 4:10 PM Kamil Wasilewski <
kamil.wasilew...@polidea.com> wrote:

> I'd like to bump this thread up since I get the same error when trying to
> read from Kafka in Python SDK:
>
> *java.lang.UnsupportedOperationException: The ActiveBundle does not have a
> registered bundle checkpoint handler.*
>
> Can someone familiar with cross-language and Flink verify the problem? I
> use the latest Beam master with the following pipeline options:
>
> --runner=FlinkRunner
> --parallelism=2
> --experiment=beam_fn_api
> --environment_type=DOCKER
> --environment_cache_millis=1
>
> Those are the same options which are used in CrossLanguageKafkaIOTest:
> https://github.com/apache/beam/blob/master/sdks/python/test-suites/portable/common.gradle#L114
> Speaking of which, is there a specific reason why reading from Kafka is not
> yet being tested by Jenkins at the moment?
>
> Thanks,
> Kamil
>
> On Thu, Jun 18, 2020 at 11:35 PM Piotr Filipiuk 
> wrote:
>
>> Thank you for clarifying.
>>
>> Would you mind clarifying whether the issues that I experience running
>> Kafka IO on Flink (or DirectRunner for testing) specific to my setup etc.
>> or this setup is not yet fully functional (for Python SDK)?
>>
>> On Thu, Jun 18, 2020 at 12:03 PM Chamikara Jayalath 
>> wrote:
>>
>>> Beam does not have a concept of general availability. It's released with
>>> Beam so available. Some of the APIs used by Kafka are experimental so are
>>> subject to change (but less likely at this point).
>>> Various runners may offer their own levels of availability for
>>> cross-language transforms.
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>> On Thu, Jun 18, 2020 at 11:26 AM Piotr Filipiuk <
>>> piotr.filip...@gmail.com> wrote:
>>>
 I also wanted to clarify whether Kafka IO for Python SDK is general
 availability or is it still experimental?

 On Fri, Jun 12, 2020 at 2:52 PM Piotr Filipiuk <
 piotr.filip...@gmail.com> wrote:

> For completeness I am also attaching task manager logs.
>
> On Fri, Jun 12, 2020 at 2:32 PM Piotr Filipiuk <
> piotr.filip...@gmail.com> wrote:
>
>> Thank you for clarifying.
>>
>> I attempted to use FlinkRunner with 2.22 and I am
>> getting the following error, which I am not sure how to debug:
>>
>> ERROR:root:java.lang.UnsupportedOperationException: The ActiveBundle
>> does not have a registered bundle checkpoint handler.
>> INFO:apache_beam.runners.portability.portable_runner:Job state
>> changed to FAILED
>> Starting pipeline kafka = 192.168.1.219:9092, topic = piotr-test
>> Traceback (most recent call last):
>>   File "apache_beam/examples/streaming_wordcount_kafka.py", line 73,
>> in 
>> run()
>>   File "apache_beam/examples/streaming_wordcount_kafka.py", line 68,
>> in run
>> | "WriteUserScoreSums" >> beam.io.WriteToText(args.output)
>>   File
>> "/Users/piotr.filipiuk/.virtualenvs/apache-beam/lib/python3.7/site-packages/apache_beam/pipeline.py",
>> line 547, in __exit__
>> self.run().wait_until_finish()
>>   File
>> "/Users/piotr.filipiuk/.virtualenvs/apache-beam/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py",
>> line 583, in wait_until_finish
>> raise self._runtime_exception
>> RuntimeError: Pipeline
>> BeamApp-piotr0filipiuk-0612212344-159c632b_45b947f8-bc64-4aad-a96c-6fb2bd834d60
>> failed in state FAILED: java.lang.UnsupportedOperationException: The
>> ActiveBundle does not have a registered bundle checkpoint handler.
>>
>> My setup is (everything runs locally):
>> Beam Version: 2.22.0.
>> Kafka 2.5.0 (https://kafka.apache.org/quickstart - using default
>> config/server.properties)
>> Flink 1.10 (
>> https://ci.apache.org/projects/flink/flink-docs-stable/getting-started/tutorials/local_setup.html
>> )
>>
>> I run the pipeline using the following command:
>>
>> python apache_beam/examples/streaming_wordcount_kafka.py
>> --bootstrap_servers=192.168.1.219:9092 --topic=piotr-test
>> --runner=FlinkRunner --flink_version=1.10 --flink_master=
>> 192.168.1.219:8081 --environment_type=LOOPBACK
>>
>> I can see the following error in the logs:
>>
>> ERROR:apache_beam.runners.worker.data_plane:Failed to read inputs in
>> the data plane.
>> Traceback (most recent call last):
>>   File
>> "/Users/piotr.filipiuk/.virtualenvs/apache-beam/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py",
>> line 528, in _read_inputs
>> for elements in elements_iterator:
>>   File
>> "/Users/piotr.filipiuk/.virtualenvs/apache-beam/lib/python3.7/site-packages/grpc/_channel.py",
>> line 416, in __next__
>>

Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-07-14 Thread Ahmet Altay

Great news! Thank you for the update and making this happen!

On Tue, Jul 14, 2020 at 1:09 AM Tobiasz Kędzierski <
tobiasz.kedzier...@polidea.com> wrote:

> I think the experiment went quite well.
> Now all nodes have the increased number of the executors [1]..
> Canceling Jenkins builds after PR updates seems to work as well. Queue is
> usually very short.
>
> [1]: https://ci-beam.apache.org/label/beam/load-statistics?type=hour
>
> BR
> Tobiasz
>
> On Mon, Jul 6, 2020 at 4:52 PM Tyson Hamilton  wrote:
>
>> How did the experiment go? The load status graphs on the executors seem
>> to show no increase in queued jobs [1]. There is a periodic bump every 6h,
>> possibly due to cron firing off a bunch at the same time, that can also be
>> seen by changing the timepsan in [1]. The number of executors on 1/4 of the
>> nodes was also increased so the combination of these things make contention
>> to appear quite low or even non-existent.
>>
>> [1]: https://ci-beam.apache.org/label/beam/load-statistics?type=hour
>>
>> On Mon, Jun 29, 2020 at 9:17 AM Tobiasz Kędzierski <
>> tobiasz.kedzier...@polidea.com> wrote:
>>
>>> Hi
>>>
>>> Agree with Ahmet, that even in that shape it should improve the queue
>>> length. Both _Commit/_Phrase cross-cancelling and "cancell all" phrase seem
>>> require much effort and I doubt it's worthy to do it.
>>>
>>> I will turn on `Cancel build on update` in ghprb-plugin on
>>> ci-beam.apache.org tomorrow (Tuesday).
>>>
>>> Some discussions related to job filtering issue (or feature) in
>>> ghprb-plugin:
>>> https://github.com/jenkinsci/ghprb-plugin/issues/678
>>> https://github.com/jenkinsci/ghprb-plugin/pull/680
>>>
>>> BR
>>> Tobiasz
>>>
>>> On Fri, Jun 26, 2020 at 2:07 AM Ahmet Altay  wrote:
>>>


 On Thu, Jun 25, 2020 at 4:27 PM Tobiasz Kędzierski <
 tobiasz.kedzier...@polidea.com> wrote:

> Andrew thanks for great analysis +1
> This bug with job filtering seems to be crucial to keep _Commit and
> _Phrase separate.
>
> I was considering the situation where the two PRs with the same commit
> hash are created. I created an artificial situation where two branches are
> identical and then two PRs with them. Two separate jobs were triggered. As
> you wrote, due to the matching GH status by job name and hash, both PR
> statuses were pointing to the same job (the newest one, which was wrong 
> for
> one PR). As i tested, adding a new commit which will cancel the previous
> build would show false status on the PR with the previously wrong job 
> link.
> It is possible to reproduce it, but could you give the real life
> situation where two jobs would be triggered with the same commit?
> I am asking because I think that enabling `Cancel build on update` may
> greatly improve Jenkins queue and it would be worthwhile to sacrifice this
> rare and unlikely case for it (if it is rare and unlikely case of course).
>

 I agree with this.


>
> Ahmet, I think the cancelling _Commit build by following _Phrase
> should be handled within ghprb-plugin if possible. I am not sure if we can
> make some workaround. Do you have any suggestions how we may solve it?
>

 I do not know jenkins enough to be able to make a good suggestion. We
 can try:
 - If it is possible to do this with ghprb plugin as you suggested, we
 can do that.
 - If not, we can make _Commit jobs cancel _Commit jobs only and _Phrase
 jobs cancel _Phrase jobs only. It will still be an improvement.


>
> BR
> Tobiasz
>
> On Wed, Jun 24, 2020 at 12:28 AM Kenneth Knowles 
> wrote:
>
>> +1 to Andrew's analysis
>>
>> On Tue, Jun 23, 2020 at 12:13 PM Ahmet Altay 
>> wrote:
>>
>>> Would it be possible to cancel any running _Phrase or _Commit
>>> variants, if either one of them is triggered?
>>>
>>> On Tue, Jun 23, 2020 at 10:41 AM Andrew Pilloud 
>>> wrote:
>>>
 I believe we split _Commit and _Phrase to work around a bug with
 job filtering. For example, when you make a python change only the 
 python
 tests are run based on the commit. We still want to be able to run the 
 java
 jobs by trigger phrase if needed. There are also performance tests 
 (Nexmark
 for example) that have different jobs to ensure PR runs don't end up
 published in the performance dashboard, but i think those have a split 
 of
 _Phrase and _Cron.

 As for canceling jobs, don't forget that the github status APIs are
 keyed on commit hash and job name (not PR). It is possible for a 
 commit to
 be on multiple PRs and it is possible for a single PR to have
 multiple commits. There are workflows that will be broken if you are 
 keying
 off of a PR to automatically cancel jobs.

 On

[VOTE] Extension name of Interactive Beam Side Panel in JupyterLab

2020-07-14 Thread Ning Kang

Hi everyone,

Last week, I sent a design doc

and proposals in this email thread

about
creating a JupyterLab extension for Interactive Beam
.
If you haven't had a chance to look at it and you're interested in
Interactive Beam, please feel free to leave comments.

Let's start a vote for the name of this extension to be used when published
to NPM .
Here are some of the candidate names:
[1] apache-beam-sidepanel
[2] apache-beam-interactive-sidepanel
[3] apache-beam-jupyterlab-sidepanel
[4] 


The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks!

Ning.

Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

2020-07-14 Thread Kenneth Knowles

Yes, thanks. Really helpful.

Kenn

On Tue, Jul 14, 2020 at 9:48 AM Ahmet Altay  wrote:

> Great news! Thank you for the update and making this happen!
>
> On Tue, Jul 14, 2020 at 1:09 AM Tobiasz Kędzierski <
> tobiasz.kedzier...@polidea.com> wrote:
>
>> I think the experiment went quite well.
>> Now all nodes have the increased number of the executors [1]..
>> Canceling Jenkins builds after PR updates seems to work as well. Queue is
>> usually very short.
>>
>> [1]: https://ci-beam.apache.org/label/beam/load-statistics?type=hour
>>
>> BR
>> Tobiasz
>>
>> On Mon, Jul 6, 2020 at 4:52 PM Tyson Hamilton  wrote:
>>
>>> How did the experiment go? The load status graphs on the executors seem
>>> to show no increase in queued jobs [1]. There is a periodic bump every 6h,
>>> possibly due to cron firing off a bunch at the same time, that can also be
>>> seen by changing the timepsan in [1]. The number of executors on 1/4 of the
>>> nodes was also increased so the combination of these things make contention
>>> to appear quite low or even non-existent.
>>>
>>> [1]: https://ci-beam.apache.org/label/beam/load-statistics?type=hour
>>>
>>> On Mon, Jun 29, 2020 at 9:17 AM Tobiasz Kędzierski <
>>> tobiasz.kedzier...@polidea.com> wrote:
>>>
 Hi

 Agree with Ahmet, that even in that shape it should improve the queue
 length. Both _Commit/_Phrase cross-cancelling and "cancell all" phrase seem
 require much effort and I doubt it's worthy to do it.

 I will turn on `Cancel build on update` in ghprb-plugin on
 ci-beam.apache.org tomorrow (Tuesday).

 Some discussions related to job filtering issue (or feature) in
 ghprb-plugin:
 https://github.com/jenkinsci/ghprb-plugin/issues/678
 https://github.com/jenkinsci/ghprb-plugin/pull/680

 BR
 Tobiasz

 On Fri, Jun 26, 2020 at 2:07 AM Ahmet Altay  wrote:

>
>
> On Thu, Jun 25, 2020 at 4:27 PM Tobiasz Kędzierski <
> tobiasz.kedzier...@polidea.com> wrote:
>
>> Andrew thanks for great analysis +1
>> This bug with job filtering seems to be crucial to keep _Commit and
>> _Phrase separate.
>>
>> I was considering the situation where the two PRs with the same
>> commit hash are created. I created an artificial situation where two
>> branches are identical and then two PRs with them. Two separate jobs were
>> triggered. As you wrote, due to the matching GH status by job name and
>> hash, both PR statuses were pointing to the same job (the newest one, 
>> which
>> was wrong for one PR). As i tested, adding a new commit which will cancel
>> the previous build would show false status on the PR with the previously
>> wrong job link.
>> It is possible to reproduce it, but could you give the real life
>> situation where two jobs would be triggered with the same commit?
>> I am asking because I think that enabling `Cancel build on update`
>> may greatly improve Jenkins queue and it would be worthwhile to sacrifice
>> this rare and unlikely case for it (if it is rare and unlikely case of
>> course).
>>
>
> I agree with this.
>
>
>>
>> Ahmet, I think the cancelling _Commit build by following _Phrase
>> should be handled within ghprb-plugin if possible. I am not sure if we 
>> can
>> make some workaround. Do you have any suggestions how we may solve it?
>>
>
> I do not know jenkins enough to be able to make a good suggestion. We
> can try:
> - If it is possible to do this with ghprb plugin as you suggested, we
> can do that.
> - If not, we can make _Commit jobs cancel _Commit jobs only and
> _Phrase jobs cancel _Phrase jobs only. It will still be an improvement.
>
>
>>
>> BR
>> Tobiasz
>>
>> On Wed, Jun 24, 2020 at 12:28 AM Kenneth Knowles 
>> wrote:
>>
>>> +1 to Andrew's analysis
>>>
>>> On Tue, Jun 23, 2020 at 12:13 PM Ahmet Altay 
>>> wrote:
>>>
 Would it be possible to cancel any running _Phrase or _Commit
 variants, if either one of them is triggered?

 On Tue, Jun 23, 2020 at 10:41 AM Andrew Pilloud <
 apill...@google.com> wrote:

> I believe we split _Commit and _Phrase to work around a bug with
> job filtering. For example, when you make a python change only the 
> python
> tests are run based on the commit. We still want to be able to run 
> the java
> jobs by trigger phrase if needed. There are also performance tests 
> (Nexmark
> for example) that have different jobs to ensure PR runs don't end up
> published in the performance dashboard, but i think those have a 
> split of
> _Phrase and _Cron.
>
> As for canceling jobs, don't forget that the github status APIs
> are keyed on commit hash and job name (not PR). It is possible for a 
>>

Re: KinesisIO Tests - are they run anywhere?

2020-07-14 Thread Kenneth Knowles

Localstack looks really cool. We have to consider resource consumption and
latency for where it fits in the testing plan. It definitely seems useful
for having ITs easily. For individual IOs, an in-process implementation can
be much more performant without losing much accuracy (since it is also just
another implementation that is not the real one) so if that is available I
would also use one of those (any time you would mock, use an in-process
fake instead). Like direct runner vs ULR. You don't need the ULR to test
your transform's correctness. Localstack's README discusses the pro/con of
each pretty well. Fault injection is pretty compelling and process
isolation might reduce dependency troubles.

Kenn

On Fri, Jul 10, 2020 at 4:45 AM Alexey Romanenko 
wrote:

> I think that we should get back to this question since that time we have
> more and more AWS-related IO connectors (mostly in Java SDK afaik).
>
> It would be great to have a dedicated Beam’s credentials to run all our
> AWS-releasted ITs against real AWS instance, but till then, I’m +1 to run
> such tests against 3rd party implementations, for example, “localstack”, as
> most comprehensive one. Of course we can observe some potential discrepancy
> in behaviour between real AWS and other implementations, but if it’s not
> principal things then it should not be a stopper. I believe that regular
> running ITs, especially for IO connectors, is a very important.
>
> On 9 Jul 2020, at 22:18, Luke Cwik  wrote:
>
> It has come up a few times[1, 2, 3, 4] and there have also been a few
> comments over time about whether someone could donate AWS resources to the
> project.
>
> 1: https://issues.apache.org/jira/browse/BEAM-601
> 2: https://issues.apache.org/jira/browse/BEAM-3373
> 3: https://issues.apache.org/jira/browse/BEAM-3550
> 4: https://issues.apache.org/jira/browse/BEAM-3032
>
> On Thu, Jul 9, 2020 at 1:02 PM Mani Kolbe  wrote:
>
>> Have you guys considered using localstack to run AWS service based
>> integration tests?
>>
>> https://github.com/localstack/localstack
>>
>> On Thu, 9 Jul, 2020, 5:25 PM Piotr Szuberski, <
>> piotr.szuber...@polidea.com> wrote:
>>
>>> Yeah, I meant KinesisIOIT tests. I'll do the same with the
>>> cross-language it tests then. Thanks for your reply :)
>>>
>>> On 2020/07/08 17:13:11, Alexey Romanenko 
>>> wrote:
>>> > If you mean Java KinesisIO tests, then unit tests are running on
>>> Jenkins [1] and ITs are not running since it requires AWS credentials that
>>> we don’t have dedicated to Beam for the moment.
>>> >
>>> > In the same time, you can run KinesisIOIT with your own credentials,
>>> like we do in Talend (a company that I work for).
>>> >
>>> > [1]
>>> https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/12209/testReport/org.apache.beam.sdk.io.kinesis/
>>> <
>>> https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/12209/testReport/org.apache.beam.sdk.io.kinesis/
>>> >
>>> >
>>> > > On 8 Jul 2020, at 13:11, Piotr Szuberski <
>>> piotr.szuber...@polidea.com> wrote:
>>> > >
>>> > > I'm writing KinesisIO external transform with python wrapper and I
>>> found that the tests aren't executed anywhere in Jenkins. Am I wrong or
>>> there is a reason for that?
>>> >
>>> >
>>>
>>
>

Re: KinesisIO Tests - are they run anywhere?

2020-07-14 Thread Ismaël Mejía

Just for reference we already run localstack for DynamoDB tests via
testcontainers [1] so everything should be ready (at least module wise) to run
other simulated ITs.  Being able to run all our AWS related ITs with localstack
is a really simple and nice contribution to do if anyone is interested.

About real AWS runs there have been conversations about AWS credits donation in
the past but sadly with no progress. Most recent ones were at the
beginning of this
year so it is not a 'forgotten' subject it is just hard to find someone to
donate the credits or contribute a credit card for this purpose [2].
Contributions welcome ;)

[1] https://www.testcontainers.org/modules/localstack/
[2] 
https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/

On Tue, Jul 14, 2020 at 11:00 PM Kenneth Knowles  wrote:
>
> Localstack looks really cool. We have to consider resource consumption and 
> latency for where it fits in the testing plan. It definitely seems useful for 
> having ITs easily. For individual IOs, an in-process implementation can be 
> much more performant without losing much accuracy (since it is also just 
> another implementation that is not the real one) so if that is available I 
> would also use one of those (any time you would mock, use an in-process fake 
> instead). Like direct runner vs ULR. You don't need the ULR to test your 
> transform's correctness. Localstack's README discusses the pro/con of each 
> pretty well. Fault injection is pretty compelling and process isolation might 
> reduce dependency troubles.
>
> Kenn
>
> On Fri, Jul 10, 2020 at 4:45 AM Alexey Romanenko  
> wrote:
>>
>> I think that we should get back to this question since that time we have 
>> more and more AWS-related IO connectors (mostly in Java SDK afaik).
>>
>> It would be great to have a dedicated Beam’s credentials to run all our 
>> AWS-releasted ITs against real AWS instance, but till then, I’m +1 to run 
>> such tests against 3rd party implementations, for example, “localstack”, as 
>> most comprehensive one. Of course we can observe some potential discrepancy 
>> in behaviour between real AWS and other implementations, but if it’s not 
>> principal things then it should not be a stopper. I believe that regular 
>> running ITs, especially for IO connectors, is a very important.
>>
>> On 9 Jul 2020, at 22:18, Luke Cwik  wrote:
>>
>> It has come up a few times[1, 2, 3, 4] and there have also been a few 
>> comments over time about whether someone could donate AWS resources to the 
>> project.
>>
>> 1: https://issues.apache.org/jira/browse/BEAM-601
>> 2: https://issues.apache.org/jira/browse/BEAM-3373
>> 3: https://issues.apache.org/jira/browse/BEAM-3550
>> 4: https://issues.apache.org/jira/browse/BEAM-3032
>>
>> On Thu, Jul 9, 2020 at 1:02 PM Mani Kolbe  wrote:
>>>
>>> Have you guys considered using localstack to run AWS service based 
>>> integration tests?
>>>
>>> https://github.com/localstack/localstack
>>>
>>> On Thu, 9 Jul, 2020, 5:25 PM Piotr Szuberski,  
>>> wrote:

 Yeah, I meant KinesisIOIT tests. I'll do the same with the cross-language 
 it tests then. Thanks for your reply :)

 On 2020/07/08 17:13:11, Alexey Romanenko  wrote:
 > If you mean Java KinesisIO tests, then unit tests are running on Jenkins 
 > [1] and ITs are not running since it requires AWS credentials that we 
 > don’t have dedicated to Beam for the moment.
 >
 > In the same time, you can run KinesisIOIT with your own credentials, 
 > like we do in Talend (a company that I work for).
 >
 > [1] 
 > https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/12209/testReport/org.apache.beam.sdk.io.kinesis/
 >  
 > 
 >
 > > On 8 Jul 2020, at 13:11, Piotr Szuberski  
 > > wrote:
 > >
 > > I'm writing KinesisIO external transform with python wrapper and I 
 > > found that the tests aren't executed anywhere in Jenkins. Am I wrong 
 > > or there is a reason for that?
 >
 >
>>
>>

Re: Versioning published Java containers

2020-07-14 Thread Ismaël Mejía

+1 for naming as python containers, and quick release so users can try it.

Not related to this tnread but I am also curious about the reasons to remove the
go docker images, was this discussed/voted in the ML (maybe I missed it) ?

I don't think Beam has been historically a conservative project about releasing
early in-progress versions and I have learnt to appreciate this because it helps
for early user testing and bug reports which will be definitely a must for Java
11.

We should read the ticket Kyle mentions with a grain of salt. Most of the
sub-tasks in that ticket are NOT about allowing users to run pipelines with Java
11 but about been able to fully build and run the tests and the source code
ofBeam with Java 11 which is a different goal (important but probably less for
end users) and a task with lots of extra issues because of plugins / dependent
systems etc.

For the Java 11 harness what we need is to guarantee is that users can run their
code without issues with Java 11 and we can do this now for example by checking
that portable runners that support Java 11 pass ValidatesRunner with the Java 11
harness. Since some classic runners [1] already pass these tests, it should be
relatively 'easy' to do so for portable runners.

[1] 
https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/

On Sat, Jul 11, 2020 at 12:43 AM Ahmet Altay  wrote:
>
> Related to the naming question, +1 and this will be similar to the python 
> container naming (e.g. beam_python3.7_sdk).
>
> On Fri, Jul 10, 2020 at 1:46 PM Pablo Estrada  wrote:
>>
>> I agree with Kenn. Dataflow already has some publishing of non-portable JAva 
>> 11 containers, so I think it'll be great to formalize the process for 
>> portable containers, and let users play with it, and know of its 
>> availability.
>> Best
>> -P.
>>
>> On Fri, Jul 10, 2020 at 9:42 AM Kenneth Knowles  wrote:
>>>
>>> To the initial question: I'm +1 on the rename. The container is primarily 
>>> something that the SDK should insert into the pipeline proto during 
>>> construction, and only user-facing in more specialized situations. Given 
>>> the state of Java and portability, it is a good time to get things named 
>>> properly and unambiguously. I think a brief announce to dev@ and user@ when 
>>> it happens is nice-to-have, but no need to give advance warning.
>>>
>>> Kenn
>>>
>>> On Fri, Jul 10, 2020 at 7:58 AM Kenneth Knowles  wrote:

 I believe Beam already has quite a few users that have forged ahead and 
 used Java 11 with various runners, pre-portability. Mostly I believe the 
 Java 11 limitations are with particular features (Schema codegen) and 
 extensions/IOs/transitive deps.

 When it comes to the container, I'd be interested in looking at test 
 coverage. The Flink & Spark portable ValidatesRunner suites use EMBEDDED 
 environment, so they don't exercise the container. The first testing of 
 the Java SDK harness container against the Python-based Universal Local 
 Runner is in pull request now [1]. Are there other test suites to 
 highlight? How hard would it be to run Flink & Spark against the 
 container(s) too?

 Kenn

 [1] https://github.com/apache/beam/pull/11792 (despite the name 
 ValidatesRunner, in this case it is validating both the runner and 
 harness, since we don't have a compliance test suite for SDK harnesses)

 On Fri, Jul 10, 2020 at 7:54 AM Tyson Hamilton  wrote:
>
> What do we consider 'ready'?
>
> Maybe the only required outstanding bugs are supporting the direct runner 
> (BEAM-10085), core tests (BEAM-10081), IO tests (BEAM-10084)  to start 
> with? Notably this would exclude failing tests like those for GCP core, 
> GCPIOs, Dataflow runner, Spark runner, Flink runner, Samza.
>
>
> On Thu, Jul 9, 2020 at 4:44 PM Kyle Weaver  wrote:
>>
>> My main question is, are we confident the Java 11 container is ready to 
>> release? AFAIK there are still a number of issues blocking full Java 11 
>> support (cf [1]; not sure how many of these, if any, affect the SDK 
>> harness specifically though.)
>>
>> For comparison, we recently decided to stop publishing Go SDK containers 
>> until the Go SDK is considered mature [2]. In the meantime, those who 
>> want to use the Go SDK can build their own container images from source.
>>
>> Do we already have a Gradle task to build Java 11 containers? If not, 
>> this would be a good intermediate step, letting users opt-in to Java 11 
>> without us overpromising support.
>
>
> We do not. From what I can tell, the build.gradele [1] for the Java 
> container is only for the one version. There is a docker file used for 
> Jenkins tests.
>
> [1] 
> https://github.com/apache/beam/blob/master/sdks/java/container/build.gradle
>
>>
>>
>> When we eventually do the renaming

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-14 Thread Ismaël Mejía

It has been really interesting to read all the perspectives in this thread and I
have even switched sides back and forth given the advantages / issues exposed
here, so it means we have clear pros/cons.

One ‘not so nice‘ discovery related to this discussion for me was BEAM-10375 [1]
tldr; Reads use java serialization so they don’t have a default deterministic
coder and if they are used as keys they break on GbK because Java’s
implementation requires keys to be deterministic [2] (is this the case in all
the other languages?). We can workaround this by having an alternative Coder for
Reads but somehow it does not feel natural and adds extra maintenance.

I really like Kenn’s idea that we should rethink from scratch or write a
proposal of how we can have designed this with the present awareness about DoFn
based composition, code reuse and schema friendliness. Maybe worth to enumerate
what are the essentials we want to have (or not) first. I will be OOO for the
next month so I cannot actively work on this, but I will be interested on
reviewing/contributing in case someone wants to take the lead on a better
solution or we can in the meantime keep bringing ideas to this thread.

Configuration based on functions translates hardly across languages so I wonder
if we should have also a mechanism to map those. Notice that an important use
case of this is the detailed configuration of clients for IOs which we have
started to expose in some IOs to avoid filling IOs API with ‘knobs‘ and better
let the user do their tuning by providing a client via a function.

[1] https://issues.apache.org/jira/browse/BEAM-10375
[2] 
https://github.com/apache/beam/blob/a9d70fed9069c4f4e9a12860ef711652f5f9c21a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupByKey.java#L232-L237

On Thu, Jul 9, 2020 at 5:52 AM Kenneth Knowles  wrote:
>
> If we are forced to create a fresh class due to a breaking change, let's 
> migrate to the "what we would do from scratch" approach, please.
>
> Kenn
>
> On Wed, Jul 8, 2020 at 5:15 PM Robert Bradshaw  wrote:
>>
>> OK, I'm +0 on this change. Using the PTransform as an element is
>> probably better than duplicating the full API on another interface,
>> and think it's worth getting this ublocked. This will require a Read2
>> if we have to add options in a upgrade-compatible way.
>>
>> On Tue, Jul 7, 2020 at 3:19 PM Luke Cwik  wrote:
>> >
>> > Robert, you're correct in your understanding that the Read PTransform 
>> > would be encoded via the schema coder.
>> >
>> > Kenn, different serializers are ok as long as the output coder can 
>> > encode/decode the output type. Different watermark fns are also ok since 
>> > it is about computing the watermark for each individual source and won't 
>> > impact the watermark computed by other sources. Watermark advancement will 
>> > still be held back by the source that is furthest behind and still has the 
>> > same problems when a user chooses a watermark fn that was incompatible 
>> > with the windowing strategy for producing output (e.g. global window + 
>> > default trigger + streaming pipeline).
>> >
>> > Both are pretty close so if we started from scratch then it could go 
>> > either way but we aren't starting from scratch (I don't think a Beam 3.0 
>> > is likely to happen in the next few years as there isn't enough stuff that 
>> > we want to remove vs the amount of stuff we would gain).
>> >
>> > On Tue, Jul 7, 2020 at 2:57 PM Kenneth Knowles  wrote:
>> >>
>> >> On Tue, Jul 7, 2020 at 2:24 PM Robert Bradshaw  
>> >> wrote:
>> >>>
>> >>> On Tue, Jul 7, 2020 at 2:06 PM Luke Cwik  wrote:
>> >>> >
>> >>> > Robert, the intent is that the Read object would use a schema coder 
>> >>> > and for XLang purposes would be no different then a POJO.
>> >>>
>> >>> Just to clarify, you're saying that the Read PTransform would be
>> >>> encoded via the schema coder? That still feels a bit odd (and
>> >>> specificically if we were designing IO from scratch rather than
>> >>> adapting to what already exists would we choose to use PTransforms as
>> >>> elements?) but would solve the cross language issue.
>> >>
>> >>
>> >> I like this question. If we were designing from scratch, what would we 
>> >> do? Would we encourage users to feed Create.of(SourceDescriptor) into 
>> >> ReadAll? We would probably provide a friendly wrapper for reading one 
>> >> static thing, and call it Read. But it would probably have an API like 
>> >> Read.from(SourceDescriptor), thus eliminating duplicate documentation and 
>> >> boilerplate that Luke described while keeping the separation that Brian 
>> >> described and clarity around xlang environments. But I'm +0 on whatever 
>> >> has momentum. I think the main downside is the weirdness around 
>> >> serializers/watermarkFn/etc on Read. I am not sure how much this will 
>> >> cause users problems. It would be very ambitious of them to produce a 
>> >> PCollection where they had different fns per element...
>> >>
>> >> Kenn
>> >>
>>

Re: Versioning published Java containers

2020-07-14 Thread Robert Burke

Disallowing the go containers was largely due to not having a simple check
on the go boot code's licenses which is required for containers hosted
under the main Apache namespace.

 A manual verification reveals it's only either Go's standard library BSD
license and GRPCs Apache v2 licenses. Not impossible but not yet done by
us. The JIRA issue has a link to the appropriate license finder for go
packages.

The amusing bit is that very similar Go boot code is included in the Java
and Python containers too, so we're only accidentally in compliance with
that there, if at all.



On Tue, Jul 14, 2020, 2:22 PM Ismaël Mejía  wrote:

> +1 for naming as python containers, and quick release so users can try it.
>
> Not related to this tnread but I am also curious about the reasons to
> remove the
> go docker images, was this discussed/voted in the ML (maybe I missed it) ?
>
> I don't think Beam has been historically a conservative project about
> releasing
> early in-progress versions and I have learnt to appreciate this because it
> helps
> for early user testing and bug reports which will be definitely a must for
> Java
> 11.
>
> We should read the ticket Kyle mentions with a grain of salt. Most of the
> sub-tasks in that ticket are NOT about allowing users to run pipelines
> with Java
> 11 but about been able to fully build and run the tests and the source code
> ofBeam with Java 11 which is a different goal (important but probably less
> for
> end users) and a task with lots of extra issues because of plugins /
> dependent
> systems etc.
>
> For the Java 11 harness what we need is to guarantee is that users can run
> their
> code without issues with Java 11 and we can do this now for example by
> checking
> that portable runners that support Java 11 pass ValidatesRunner with the
> Java 11
> harness. Since some classic runners [1] already pass these tests, it
> should be
> relatively 'easy' to do so for portable runners.
>
> [1]
> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/
>
>
>
>
> On Sat, Jul 11, 2020 at 12:43 AM Ahmet Altay  wrote:
> >
> > Related to the naming question, +1 and this will be similar to the
> python container naming (e.g. beam_python3.7_sdk).
> >
> > On Fri, Jul 10, 2020 at 1:46 PM Pablo Estrada 
> wrote:
> >>
> >> I agree with Kenn. Dataflow already has some publishing of non-portable
> JAva 11 containers, so I think it'll be great to formalize the process for
> portable containers, and let users play with it, and know of its
> availability.
> >> Best
> >> -P.
> >>
> >> On Fri, Jul 10, 2020 at 9:42 AM Kenneth Knowles 
> wrote:
> >>>
> >>> To the initial question: I'm +1 on the rename. The container is
> primarily something that the SDK should insert into the pipeline proto
> during construction, and only user-facing in more specialized situations.
> Given the state of Java and portability, it is a good time to get things
> named properly and unambiguously. I think a brief announce to dev@ and
> user@ when it happens is nice-to-have, but no need to give advance
> warning.
> >>>
> >>> Kenn
> >>>
> >>> On Fri, Jul 10, 2020 at 7:58 AM Kenneth Knowles 
> wrote:
> 
>  I believe Beam already has quite a few users that have forged ahead
> and used Java 11 with various runners, pre-portability. Mostly I believe
> the Java 11 limitations are with particular features (Schema codegen) and
> extensions/IOs/transitive deps.
> 
>  When it comes to the container, I'd be interested in looking at test
> coverage. The Flink & Spark portable ValidatesRunner suites use EMBEDDED
> environment, so they don't exercise the container. The first testing of the
> Java SDK harness container against the Python-based Universal Local Runner
> is in pull request now [1]. Are there other test suites to highlight? How
> hard would it be to run Flink & Spark against the container(s) too?
> 
>  Kenn
> 
>  [1] https://github.com/apache/beam/pull/11792 (despite the name
> ValidatesRunner, in this case it is validating both the runner and harness,
> since we don't have a compliance test suite for SDK harnesses)
> 
>  On Fri, Jul 10, 2020 at 7:54 AM Tyson Hamilton 
> wrote:
> >
> > What do we consider 'ready'?
> >
> > Maybe the only required outstanding bugs are supporting the direct
> runner (BEAM-10085), core tests (BEAM-10081), IO tests (BEAM-10084)  to
> start with? Notably this would exclude failing tests like those for GCP
> core, GCPIOs, Dataflow runner, Spark runner, Flink runner, Samza.
> >
> >
> > On Thu, Jul 9, 2020 at 4:44 PM Kyle Weaver 
> wrote:
> >>
> >> My main question is, are we confident the Java 11 container is
> ready to release? AFAIK there are still a number of issues blocking full
> Java 11 support (cf [1]; not sure how many of these, if any, affect the SDK
> harness specifically though.)
> >>
> >> For comparison, we recently decided to stop publishing Go SDK
> containers until the Go SDK i

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

2020-07-14 Thread Robert Bradshaw

Yes, PBegin and PDone are used in the SDKs, but are not part of the model.

I would be supportive of making PBegin more public to denote that a
transform is a "root" of the pipeline. PDone was required for Java,
however I don't think there's any use for it in the Python SDK (a
transform can simply not return any value (which is equivalent to
returning None) if it has no outputs.

On Mon, Jul 13, 2020 at 8:17 PM Udi Meiri  wrote:
>
> Details:
> One item of interest that came up during the implementation of BEAM-10258 [1] 
> is how to treat PTransforms that act like sources or sinks.
> These transforms have either no input or output PCollections, respectively.
>
> Internally, we use PBegin and PDone to denote this. (ex: [2])
> IIUC, PBegin and PDone aren't part of the Beam model, and in the pipeline 
> description they manifest as empty input and output lists.
>
> To support type hinting, I propose making these types public.
> They are not currently listed in [3] and the documentation implies they're 
> internal.
> Java SDK already supports these types and makes them public. [4]
>
>
> [1] https://github.com/apache/beam/pull/12009
> [2] 
> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/io/gcp/experimental/spannerio.py#L471-L506
> [3] 
> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/pvalue.py#L61
> [4] 
> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java#L53-L56

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

2020-07-14 Thread Robert Burke

For contrast, the Go SDK provides an Impulse transform directly (analogous
to PBegin, part of the model) and has a ParDo0 (which like PDone has no
output Pcollections).  The numeral suffixing the go ParDo functions
indicate the number of Output Pcollections are expected from the passed in
DoFm.

On Tue, Jul 14, 2020, 5:03 PM Robert Bradshaw  wrote:

> Yes, PBegin and PDone are used in the SDKs, but are not part of the model.
>
> I would be supportive of making PBegin more public to denote that a
> transform is a "root" of the pipeline. PDone was required for Java,
> however I don't think there's any use for it in the Python SDK (a
> transform can simply not return any value (which is equivalent to
> returning None) if it has no outputs.
>
> On Mon, Jul 13, 2020 at 8:17 PM Udi Meiri  wrote:
> >
> > Details:
> > One item of interest that came up during the implementation of
> BEAM-10258 [1] is how to treat PTransforms that act like sources or sinks.
> > These transforms have either no input or output PCollections,
> respectively.
> >
> > Internally, we use PBegin and PDone to denote this. (ex: [2])
> > IIUC, PBegin and PDone aren't part of the Beam model, and in the
> pipeline description they manifest as empty input and output lists.
> >
> > To support type hinting, I propose making these types public.
> > They are not currently listed in [3] and the documentation implies
> they're internal.
> > Java SDK already supports these types and makes them public. [4]
> >
> >
> > [1] https://github.com/apache/beam/pull/12009
> > [2]
> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/io/gcp/experimental/spannerio.py#L471-L506
> > [3]
> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/pvalue.py#L61
> > [4]
> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java#L53-L56
>

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

2020-07-14 Thread Udi Meiri

So it sounds like we should:
- Make PBegin public
- Deprecate PDone return type in favor of None
- Update the programming guide's Composite Transforms section.


On Tue, Jul 14, 2020 at 5:13 PM Robert Burke  wrote:

> For contrast, the Go SDK provides an Impulse transform directly (analogous
> to PBegin, part of the model) and has a ParDo0 (which like PDone has no
> output Pcollections).  The numeral suffixing the go ParDo functions
> indicate the number of Output Pcollections are expected from the passed in
> DoFm.
>
> On Tue, Jul 14, 2020, 5:03 PM Robert Bradshaw  wrote:
>
>> Yes, PBegin and PDone are used in the SDKs, but are not part of the model.
>>
>> I would be supportive of making PBegin more public to denote that a
>> transform is a "root" of the pipeline. PDone was required for Java,
>> however I don't think there's any use for it in the Python SDK (a
>> transform can simply not return any value (which is equivalent to
>> returning None) if it has no outputs.
>>
>> On Mon, Jul 13, 2020 at 8:17 PM Udi Meiri  wrote:
>> >
>> > Details:
>> > One item of interest that came up during the implementation of
>> BEAM-10258 [1] is how to treat PTransforms that act like sources or sinks.
>> > These transforms have either no input or output PCollections,
>> respectively.
>> >
>> > Internally, we use PBegin and PDone to denote this. (ex: [2])
>> > IIUC, PBegin and PDone aren't part of the Beam model, and in the
>> pipeline description they manifest as empty input and output lists.
>> >
>> > To support type hinting, I propose making these types public.
>> > They are not currently listed in [3] and the documentation implies
>> they're internal.
>> > Java SDK already supports these types and makes them public. [4]
>> >
>> >
>> > [1] https://github.com/apache/beam/pull/12009
>> > [2]
>> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/io/gcp/experimental/spannerio.py#L471-L506
>> > [3]
>> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/pvalue.py#L61
>> > [4]
>> https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java#L53-L56
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

2020-07-14 Thread Robert Bradshaw

SGTM.

On Tue, Jul 14, 2020 at 5:28 PM Udi Meiri  wrote:
>
> So it sounds like we should:
> - Make PBegin public
> - Deprecate PDone return type in favor of None
> - Update the programming guide's Composite Transforms section.
>
>
> On Tue, Jul 14, 2020 at 5:13 PM Robert Burke  wrote:
>>
>> For contrast, the Go SDK provides an Impulse transform directly (analogous 
>> to PBegin, part of the model) and has a ParDo0 (which like PDone has no 
>> output Pcollections).  The numeral suffixing the go ParDo functions indicate 
>> the number of Output Pcollections are expected from the passed in DoFm.
>>
>> On Tue, Jul 14, 2020, 5:03 PM Robert Bradshaw  wrote:
>>>
>>> Yes, PBegin and PDone are used in the SDKs, but are not part of the model.
>>>
>>> I would be supportive of making PBegin more public to denote that a
>>> transform is a "root" of the pipeline. PDone was required for Java,
>>> however I don't think there's any use for it in the Python SDK (a
>>> transform can simply not return any value (which is equivalent to
>>> returning None) if it has no outputs.
>>>
>>> On Mon, Jul 13, 2020 at 8:17 PM Udi Meiri  wrote:
>>> >
>>> > Details:
>>> > One item of interest that came up during the implementation of BEAM-10258 
>>> > [1] is how to treat PTransforms that act like sources or sinks.
>>> > These transforms have either no input or output PCollections, 
>>> > respectively.
>>> >
>>> > Internally, we use PBegin and PDone to denote this. (ex: [2])
>>> > IIUC, PBegin and PDone aren't part of the Beam model, and in the pipeline 
>>> > description they manifest as empty input and output lists.
>>> >
>>> > To support type hinting, I propose making these types public.
>>> > They are not currently listed in [3] and the documentation implies 
>>> > they're internal.
>>> > Java SDK already supports these types and makes them public. [4]
>>> >
>>> >
>>> > [1] https://github.com/apache/beam/pull/12009
>>> > [2] 
>>> > https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/io/gcp/experimental/spannerio.py#L471-L506
>>> > [3] 
>>> > https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/pvalue.py#L61
>>> > [4] 
>>> > https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java#L53-L56

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

2020-07-14 Thread Robert Bradshaw

PBegin is somewhat analogous to Go's Pipeline.Root() scope, though Go
does (Composite and otherwise) transforms quite differently.

On Tue, Jul 14, 2020 at 5:13 PM Robert Burke  wrote:
>
> For contrast, the Go SDK provides an Impulse transform directly (analogous to 
> PBegin, part of the model) and has a ParDo0 (which like PDone has no output 
> Pcollections).  The numeral suffixing the go ParDo functions indicate the 
> number of Output Pcollections are expected from the passed in DoFm.
>
> On Tue, Jul 14, 2020, 5:03 PM Robert Bradshaw  wrote:
>>
>> Yes, PBegin and PDone are used in the SDKs, but are not part of the model.
>>
>> I would be supportive of making PBegin more public to denote that a
>> transform is a "root" of the pipeline. PDone was required for Java,
>> however I don't think there's any use for it in the Python SDK (a
>> transform can simply not return any value (which is equivalent to
>> returning None) if it has no outputs.
>>
>> On Mon, Jul 13, 2020 at 8:17 PM Udi Meiri  wrote:
>> >
>> > Details:
>> > One item of interest that came up during the implementation of BEAM-10258 
>> > [1] is how to treat PTransforms that act like sources or sinks.
>> > These transforms have either no input or output PCollections, respectively.
>> >
>> > Internally, we use PBegin and PDone to denote this. (ex: [2])
>> > IIUC, PBegin and PDone aren't part of the Beam model, and in the pipeline 
>> > description they manifest as empty input and output lists.
>> >
>> > To support type hinting, I propose making these types public.
>> > They are not currently listed in [3] and the documentation implies they're 
>> > internal.
>> > Java SDK already supports these types and makes them public. [4]
>> >
>> >
>> > [1] https://github.com/apache/beam/pull/12009
>> > [2] 
>> > https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/io/gcp/experimental/spannerio.py#L471-L506
>> > [3] 
>> > https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/python/apache_beam/pvalue.py#L61
>> > [4] 
>> > https://github.com/apache/beam/blob/e73e1d1cce93930fa3d85046b9bbae7c724926bf/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java#L53-L56

Re: Versioning published Java containers

2020-07-14 Thread Kenneth Knowles

On Tue, Jul 14, 2020 at 2:22 PM Ismaël Mejía  wrote:

> We should read the ticket Kyle mentions with a grain of salt. Most of the
> sub-tasks in that ticket are NOT about allowing users to run pipelines
> with Java
> 11 but about been able to fully build and run the tests and the source code
> ofBeam with Java 11 which is a different goal (important but probably less
> for
> end users) and a task with lots of extra issues because of plugins /
> dependent
> systems etc.
>

Actually only the tests! Pawel set it up so that the main jars are built
with Java 8 and then just the tests build and run with Java 11. So this is
more like what a "real" user would do while Beam is still building with
Java 8. Most of them have been fixed except runners, which we would not
expect anyhow, and the core SDK was not broken, only the tests. Happily,
there aren't IO unit tests failing except XmlO.

I think to get the runners working we need to re-organize. Right now gradle
project properties control how the tests are run, but what we really need
for this is independent tests for different JREs that we can opt-in or
opt-out of per module.

I agree with Ismaël that releasing early is OK. I would propose an easier
criteria of having continuous integration tests running. The code might
have bugs, but at least we know that we won't break it more, and when we
fix a bug it stays fixed. I think also a simple opt-in flag like
--experiments=java11 could help users to know that things might break or be
weird. But really we are so close maybe we don't need all that formality.

Kenn


>
> For the Java 11 harness what we need is to guarantee is that users can run
> their
> code without issues with Java 11 and we can do this now for example by
> checking
> that portable runners that support Java 11 pass ValidatesRunner with the
> Java 11
> harness. Since some classic runners [1] already pass these tests, it
> should be
> relatively 'easy' to do so for portable runners.
>
> [1]
> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/
>
>
>
>
> On Sat, Jul 11, 2020 at 12:43 AM Ahmet Altay  wrote:
> >
> > Related to the naming question, +1 and this will be similar to the
> python container naming (e.g. beam_python3.7_sdk).
> >
> > On Fri, Jul 10, 2020 at 1:46 PM Pablo Estrada 
> wrote:
> >>
> >> I agree with Kenn. Dataflow already has some publishing of non-portable
> JAva 11 containers, so I think it'll be great to formalize the process for
> portable containers, and let users play with it, and know of its
> availability.
> >> Best
> >> -P.
> >>
> >> On Fri, Jul 10, 2020 at 9:42 AM Kenneth Knowles 
> wrote:
> >>>
> >>> To the initial question: I'm +1 on the rename. The container is
> primarily something that the SDK should insert into the pipeline proto
> during construction, and only user-facing in more specialized situations.
> Given the state of Java and portability, it is a good time to get things
> named properly and unambiguously. I think a brief announce to dev@ and
> user@ when it happens is nice-to-have, but no need to give advance
> warning.
> >>>
> >>> Kenn
> >>>
> >>> On Fri, Jul 10, 2020 at 7:58 AM Kenneth Knowles 
> wrote:
> 
>  I believe Beam already has quite a few users that have forged ahead
> and used Java 11 with various runners, pre-portability. Mostly I believe
> the Java 11 limitations are with particular features (Schema codegen) and
> extensions/IOs/transitive deps.
> 
>  When it comes to the container, I'd be interested in looking at test
> coverage. The Flink & Spark portable ValidatesRunner suites use EMBEDDED
> environment, so they don't exercise the container. The first testing of the
> Java SDK harness container against the Python-based Universal Local Runner
> is in pull request now [1]. Are there other test suites to highlight? How
> hard would it be to run Flink & Spark against the container(s) too?
> 
>  Kenn
> 
>  [1] https://github.com/apache/beam/pull/11792 (despite the name
> ValidatesRunner, in this case it is validating both the runner and harness,
> since we don't have a compliance test suite for SDK harnesses)
> 
>  On Fri, Jul 10, 2020 at 7:54 AM Tyson Hamilton 
> wrote:
> >
> > What do we consider 'ready'?
> >
> > Maybe the only required outstanding bugs are supporting the direct
> runner (BEAM-10085), core tests (BEAM-10081), IO tests (BEAM-10084)  to
> start with? Notably this would exclude failing tests like those for GCP
> core, GCPIOs, Dataflow runner, Spark runner, Flink runner, Samza.
> >
> >
> > On Thu, Jul 9, 2020 at 4:44 PM Kyle Weaver 
> wrote:
> >>
> >> My main question is, are we confident the Java 11 container is
> ready to release? AFAIK there are still a number of issues blocking full
> Java 11 support (cf [1]; not sure how many of these, if any, affect the SDK
> harness specifically though.)
> >>
> >> For comparison, we recently decided to stop publishing Go SDK
> container

Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

[VOTE] Extension name of Interactive Beam Side Panel in JupyterLab

Re: Canceling Jenkins builds when the update to PR makes prior build irrelevant

Re: KinesisIO Tests - are they run anywhere?

Re: KinesisIO Tests - are they run anywhere?

Re: Versioning published Java containers

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

Re: Versioning published Java containers

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

Re: [PROPOSAL] Make PBegin and PDone public in the Python SDK

Re: Versioning published Java containers

16 matches

Site Navigation

Mail list logo

Footer information