Re: Flink: Lost pane timing at some steps of pipeline

2020-05-04 Thread David Morávek
Hi Jozef, I think this is expected beahior as Flink does not use default
expansion for Reshuffle (uses round-robin rebalance ship strategy instead).
There is no aggregation that needs buffering (and triggering). All of the
elements are immediately emmited to downstream operations after the
Reshuffle.

In case of direct runner, this is just a side-effect of Reshuffle
expansion. See
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L69
for more details.

I don't think we should expect Reshuffle to have the same semantics as GBK,
because it's only an performance optimization steps, that should not have
any effect to pipeline's overall result. Some runners may also completely
ignore this step as part of execution plan optimization process (eg. two
reshuffles in a row are idempotent). (
https://issues.apache.org/jira/browse/BEAM-9824)

D.

On Mon, May 4, 2020 at 2:48 PM Jozef Vilcek  wrote:

> I have a pipeline which
>
> 1. Read from KafkaIO
> 2. Does stuff with events and writes windowed file via FileIO
> 3. Apply statefull DoFn on written files info
>
> The statefull DoFn does some logic which depends on PaneInfo.Timing, if it
> is EARLY or something else. When testing in DirectRunner, all is good. But
> with FlinkRunner, panes are always NO_FIRING.
>
> To demonstrate this, here is a dummy test pipeline:
>
> val testStream = sc.testStream(testStreamOf[String]
>   .advanceWatermarkTo(new Instant(1))
>   .addElements(goodMessage, goodMessage)
>   .advanceWatermarkTo(new Instant(2))
>   .addElements(goodMessage, goodMessage)
>   .advanceWatermarkTo(new Instant(200))
>   .addElements(goodMessage, goodMessage)
>   .advanceWatermarkToInfinity())
>
> testStream
>   .withFixedWindows(
> duration = Duration.standardSeconds(1),
> options = WindowOptions(
>   trigger = AfterWatermark.pastEndOfWindow()
> .withEarlyFirings(AfterPane.elementCountAtLeast(1))
> .withLateFirings(AfterPane.elementCountAtLeast(1)),
>   accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES,
>   allowedLateness = Duration.standardDays(1)
> ))
>   .keyBy(_ => "static_key")
>   .withPaneInfo
>   .map { case (element, paneInfo) =>
> println(s"#1 - $paneInfo")
> element
>   }
>   //.groupByKey // <- need to uncomment this for Flink to work
>   .applyTransform(Reshuffle.viaRandomKey())
>   .withPaneInfo
>   .map { case (element, paneInfo) =>
> println(s"#2 - $paneInfo")
> element
>   }
>
> When executed with DirectRunner, #1 prints pane with UNKNOWN timing and #2
> with EARLY, which is what I expect. When run with Flink runner, both #1 and
> #2 writes UNKNOWN timing from PaneInfo.NO_FIRING. Only if I add extra GBK,
> then #2 writes panes with EARLY timing.
>
> This is run on Beam 2.19. I was trying to analyze where could be a problem
> but got lost. I will be happy for any suggestions or pointers. Does it
> sounds like bug or am I doing something wrong?
>


Re: GSoC 2020: Congratulations, your proposal with The Apache Software Foundation has been accepted!

2020-05-04 Thread Kai Jiang
Congratulations!

On Mon, May 4, 2020 at 8:07 PM John Mora  wrote:

> Hi all.
>
> My proposal for GSoC was accepted, so this summer I will be working with
> you guys in the aggregation analytics functionality of Beam. Thanks so much
> for your support during the application period, specially to my mentor Rui
> Wang.
>
> Please let me know if you have suggestions or ideas for my project.
>
> Cheers,
> John
>
> -- Forwarded message -
> De: Google Summer of Code 
> Date: lun., 4 may. 2020 a las 12:53
> Subject: GSoC 2020: Congratulations, your proposal with The Apache
> Software Foundation has been accepted!
> To: 
>
>
> [image: Google Summer of Code]
>
> Hi John Mora,
>
> Your proposal BeamSQL aggregation analytics functionality
> 
> has been accepted!
>
> Welcome to GSoC 2020!
>
> We look forward to seeing the great things you will accomplish this summer
> with The Apache Software Foundation.
>
> The next thing you need to do is read the Information for Accepted
> Students
> .
> It contains important information you need to know about your participation
> in GSoC 2020.
>
> You will receive another email in the next few days with information about
> your stipend.
>
> If you have any questions, please email the Google Summer of Code support
> team at gsoc-supp...@google.com.
>
> Have a great summer!
>
> -*Google Summer of Code team*
>
> This email was sent to jhnmora...@gmail.com.
>
> You are receiving this email because of your participation in Google
> Summer of Code 2020.
> https://summerofcode.withgoogle.com
>
> To leave the program and stop receiving all emails, you can go to your
> profile  and
> request deletion of your program profile.
>
> For any questions, please contact gsoc-supp...@google.com. Replies to
> this message go to an unmonitored mailbox.
>
> © 2020 Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
>


Fwd: GSoC 2020: Congratulations, your proposal with The Apache Software Foundation has been accepted!

2020-05-04 Thread John Mora
Hi all.

My proposal for GSoC was accepted, so this summer I will be working with
you guys in the aggregation analytics functionality of Beam. Thanks so much
for your support during the application period, specially to my mentor Rui
Wang.

Please let me know if you have suggestions or ideas for my project.

Cheers,
John

-- Forwarded message -
De: Google Summer of Code 
Date: lun., 4 may. 2020 a las 12:53
Subject: GSoC 2020: Congratulations, your proposal with The Apache Software
Foundation has been accepted!
To: 


[image: Google Summer of Code]

Hi John Mora,

Your proposal BeamSQL aggregation analytics functionality

has been accepted!

Welcome to GSoC 2020!

We look forward to seeing the great things you will accomplish this summer
with The Apache Software Foundation.

The next thing you need to do is read the Information for Accepted Students
. It
contains important information you need to know about your participation in
GSoC 2020.

You will receive another email in the next few days with information about
your stipend.

If you have any questions, please email the Google Summer of Code support
team at gsoc-supp...@google.com.

Have a great summer!

-*Google Summer of Code team*

This email was sent to jhnmora...@gmail.com.

You are receiving this email because of your participation in Google Summer
of Code 2020.
https://summerofcode.withgoogle.com

To leave the program and stop receiving all emails, you can go to your
profile  and
request deletion of your program profile.

For any questions, please contact gsoc-supp...@google.com. Replies to this
message go to an unmonitored mailbox.

© 2020 Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Ahmet Altay
On Mon, May 4, 2020 at 6:58 PM Aizhamal Nurmamat kyzy 
wrote:

> Thanks everyone for your feedback and support with the review. Please add
> any other comments so we can address them soon, if not please share your
> LGTMs.
>
> @Robert, thanks for separating the PR!
>
> @Thomas, regarding your question "There are some changes missing though
> (for example [2]), are you planning to add more recent commits later?" -
> yes, after merging the PR we will update all of the recent changes that are
> missing.
>

If we are going to do this, could people continue editing website after
Wednesday if PR is still not merged?


>
> @Nam Bui  , can we look into using the feature from
> this PR [1] that Brian mentioned to keep dates in blog post file names?
>
> @everyone, Nam also had a question regarding staging functionality - it
> keeps showing the errors like below:
>
> RAT ("Run RAT PreCommit") — FAILURE
>

There are files without the license header. We either need to modify RAT
config or add license headers. See for example:

*13:01:08*   Printing headers for text files without a valid
license header...*13:01:08*   *13:01:08*
=*13:01:08*
== File: 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_RAT_Commit/src/website/www/site/static/js/bootstrap.min.js



> Website_Stage_GCS ("Run Website_Stage_GCS PreCommit") — FAILURE
>

Some links are failing with 404 errors. We need to update those urls.
Example:

*13:22:47* [=>  ] 1366
/ 1397  curl: (22) The requested URL returned error: 404 *13:22:47*
*13:22:47* 
https://www.talend.com/blog/2017/01/13/future-apache-beam-now-top-level-apache-software-foundation-project/*13:22:50*
[=>  ] 1367 / 1397
curl: (22) The requested URL returned error: 404 *13:22:50* *13:22:50*
https://www.talend.com/blog/2017/01/23/apache-beam-way-greater-data-agility/?utm_medium=socialpost_source=twitter_campaign=blog



> Website_Stage_GCS ("Run Website_Stage_GCS PreCommit") — FAILURE
>

Probably the same as above. Take a look at the logs, they usually have
sufficient information.


>
> The staging is working, but the jobs show up as failed. Does anyone have
> an idea what the failures are related to and how we can fix it?
>
> [1] https://github.com/gohugoio/hugo/pull/4494
>
>
>
> On Mon, May 4, 2020 at 6:30 PM Robert Bradshaw 
> wrote:
>
>> I took the massive commit and split it up into:
>>
>> (1) Infrastructure changes (basically everything outside of
>> (website/www/site/content)
>> (2) Sed script changes, and
>> (3) Manual changes (everything not in (1) and (2)).
>>
>
Thank you Robert. This makes it much easier. What is the source of the sed
script? I am not sure why some of those lines are there. It would be much
easier for us to comment on the script source if it is reviewable somewhere.


>
>> It does seem that (3) has a number of unintentional changes, some
>> stylistic (e.g. lost of removal of end-of-file newlines) and some
>> actual content that's not up to date. This cuts down the number of
>> lines to be reviewed by more than half (and, notably, the more
>> substantial ones).
>>
>> [1]
>> https://github.com/apache/beam/pull/11608/commits/1bcf519a0f041607dfa401f167164301acbca2ac
>> 72 files changed, 3546 insertions(+), 1472 deletions(-)
>> [2]
>> https://github.com/apache/beam/pull/11608/commits/8b9f488c519b97a11ca4c7e3b644bb9ffe12cb98
>> 252 files changed, 4136 insertions(+), 4684 deletions(-)
>> [3]
>> https://github.com/apache/beam/pull/11608/commits/f9d8bc13a0fda0a60a436aa56186139d0f71de4e
>> 228 files changed, 1859 insertions(+), 2370 deletions(-)
>>
>> I also separated out the compatibility matrix move, which was ~1700
>> lines.
>> https://github.com/apache/beam/pull/11608/commits/16516d036af047493445654d61940dea8d04eaaa
>>
>> On Mon, May 4, 2020 at 6:15 PM Robert Bradshaw 
>> wrote:
>> >
>> > On Mon, May 4, 2020 at 6:02 PM Thomas Weise 
>> wrote:
>> > >
>> > > I took a brief look at [1] and looks good overall.
>> > >
>> > > There are some changes missing though (for example [2]), are you
>> planning to add more recent commits later?
>> > >
>> > > Also, there was an earlier question from Brian regarding the
>> possibility to retain the post dates in blog file names. I would second
>> that, it would make the posts significantly easier to navigate.
>> >
>> > I'm OK with removing them from the URL if they're to distracting, but
>> > generally agree here. (If it's too difficult, it's not a huge issue.)
>> >
>> > > [1]
>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
>> > > [2]
>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/documentation/runners/flink/index.html
>> > >
>> > > Thomas
>> > >
>> > > On Mon, May 4, 2020 at 11:06 AM Hannah Jiang 
>> wrote:
>> > >>
>> > >> Hi Aizhamal,, yes, Wednesday sounds good to me. Thank you.
>> > >>
>> > >>
>> > >> On Mon, May 4, 2020 at 

Re: [DISCUSS] How many Python 3.x minor versions should Beam Python SDK aim to support concurrently?

2020-05-04 Thread Yoshiki Obata
Thank you for comment, Valentyn.

> 1) We can seed the smoke test suite with typehints tests, and add more tests 
> later if there is a need. We can identify them by the file path or by special 
> attributes in test files. Identifying them using filepath seems simpler and 
> independent of test runner.

Yes, making run_pylint.sh allow target test file paths as arguments is
good way if could.

> 3)  We should reduce the code duplication across  
> beam/sdks/python/test-suites/$runner/py3*. I think we could move the suite 
> definition into a common file like 
> beam/sdks/python/test-suites/$runner/build.gradle perhaps, and populate 
> individual suites (beam/sdks/python/test-suites/$runner/py38/build.gradle) 
> including the common file and/or logic from PythonNature [1].

Exactly. I'll check it out.

> 4) We have some tests that we run only under specific Python 3 versions, for 
> example: FlinkValidatesRunner test runs using Python 3.5: [2]
> HDFS Python 3 tests are running only with Python 3.7 [3]. Cross-language Py3 
> tests for Spark are running under Python 3.5[4]: , there may be more test 
> suites that selectively use particular versions.
> We need to correct such suites, so that we do not tie them  to a specific 
> Python version. I see several options here: such tests should run either for 
> all high-priority versions, or run only under the lowest version among the 
> high-priority versions.  We don't have to fix them all at the same time. In 
> general, we should try to make it as easy as possible to configure, whether a 
> suite runs across all  versions, all high-priority versions, or just one 
> version.

The way of high-priority/low-priority configuration would be useful for this.
And which versions to be tested may be related to 5).

> 5) If postcommit suites (that need to run against all versions) still 
> constitute too much load on the infrastructure, we may need to investigate 
> how to run these suites less frequently.

That's certainly true, beam_PostCommit_PythonXX and
beam_PostCommit_Python_Chicago_Taxi_(Dataflow|Flink) take about 1
hour.
Does anyone have knowledge about this?

2020年5月2日(土) 5:18 Valentyn Tymofieiev :
>
> Hi Yoshiki,
>
> Thanks a lot for your help with Python 3 support so far and most recently, 
> with your work on Python 3.8.
>
> Overall the proposal sounds good to me. I see several aspects here that we 
> need to address:
>
> 1) We can seed the smoke test suite with typehints tests, and add more tests 
> later if there is a need. We can identify them by the file path or by special 
> attributes in test files. Identifying them using filepath seems simpler and 
> independent of test runner.
>
> 2) Defining high priority/low priority versions in gradle.properties sounds 
> good to me.
>
> 3)  We should reduce the code duplication across  
> beam/sdks/python/test-suites/$runner/py3*. I think we could move the suite 
> definition into a common file like 
> beam/sdks/python/test-suites/$runner/build.gradle perhaps, and populate 
> individual suites (beam/sdks/python/test-suites/$runner/py38/build.gradle) 
> including the common file and/or logic from PythonNature [1].
>
> 4) We have some tests that we run only under specific Python 3 versions, for 
> example: FlinkValidatesRunner test runs using Python 3.5: [2]
> HDFS Python 3 tests are running only with Python 3.7 [3]. Cross-language Py3 
> tests for Spark are running under Python 3.5[4]: , there may be more test 
> suites that selectively use particular versions.
>
> We need to correct such suites, so that we do not tie them  to a specific 
> Python version. I see several options here: such tests should run either for 
> all high-priority versions, or run only under the lowest version among the 
> high-priority versions.  We don't have to fix them all at the same time. In 
> general, we should try to make it as easy as possible to configure, whether a 
> suite runs across all  versions, all high-priority versions, or just one 
> version.
>
> 5) If postcommit suites (that need to run against all versions) still 
> constitute too much load on the infrastructure, we may need to investigate 
> how to run these suites less frequently.
>
> [1] 
> https://github.com/apache/beam/blob/b78c7ed4836e44177a149155581cfa8188e8f748/sdks/python/test-suites/portable/py37/build.gradle#L19-L20
> [2] 
> https://github.com/apache/beam/blob/93181e792f648122d3b4a5080d683f21c6338132/.test-infra/jenkins/job_PostCommit_Python35_ValidatesRunner_Flink.groovy#L34
> [3] 
> https://github.com/apache/beam/blob/93181e792f648122d3b4a5080d683f21c6338132/sdks/python/test-suites/direct/py37/build.gradle#L58
> [4] 
> https://github.com/apache/beam/blob/93181e792f648122d3b4a5080d683f21c6338132/.test-infra/jenkins/job_PostCommit_CrossLanguageValidatesRunner_Spark.groovy#L44
>
> On Fri, May 1, 2020 at 8:42 AM Yoshiki Obata  wrote:
>>
>> Hello everyone.
>>
>> I'm working on Python 3.8 support[1] and now is the time for preparing
>> test infrastructure.
>> 

Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Aizhamal Nurmamat kyzy
Thanks everyone for your feedback and support with the review. Please add
any other comments so we can address them soon, if not please share your
LGTMs.

@Robert, thanks for separating the PR!

@Thomas, regarding your question "There are some changes missing though
(for example [2]), are you planning to add more recent commits later?" -
yes, after merging the PR we will update all of the recent changes that are
missing.

@Nam Bui  , can we look into using the feature from
this PR [1] that Brian mentioned to keep dates in blog post file names?

@everyone, Nam also had a question regarding staging functionality - it
keeps showing the errors like below:

RAT ("Run RAT PreCommit") — FAILURE
Website_Stage_GCS ("Run Website_Stage_GCS PreCommit") — FAILURE
Website_Stage_GCS ("Run Website_Stage_GCS PreCommit") — FAILURE

The staging is working, but the jobs show up as failed. Does anyone have an
idea what the failures are related to and how we can fix it?

[1] https://github.com/gohugoio/hugo/pull/4494



On Mon, May 4, 2020 at 6:30 PM Robert Bradshaw  wrote:

> I took the massive commit and split it up into:
>
> (1) Infrastructure changes (basically everything outside of
> (website/www/site/content)
> (2) Sed script changes, and
> (3) Manual changes (everything not in (1) and (2)).
>
> It does seem that (3) has a number of unintentional changes, some
> stylistic (e.g. lost of removal of end-of-file newlines) and some
> actual content that's not up to date. This cuts down the number of
> lines to be reviewed by more than half (and, notably, the more
> substantial ones).
>
> [1]
> https://github.com/apache/beam/pull/11608/commits/1bcf519a0f041607dfa401f167164301acbca2ac
> 72 files changed, 3546 insertions(+), 1472 deletions(-)
> [2]
> https://github.com/apache/beam/pull/11608/commits/8b9f488c519b97a11ca4c7e3b644bb9ffe12cb98
> 252 files changed, 4136 insertions(+), 4684 deletions(-)
> [3]
> https://github.com/apache/beam/pull/11608/commits/f9d8bc13a0fda0a60a436aa56186139d0f71de4e
> 228 files changed, 1859 insertions(+), 2370 deletions(-)
>
> I also separated out the compatibility matrix move, which was ~1700
> lines.
> https://github.com/apache/beam/pull/11608/commits/16516d036af047493445654d61940dea8d04eaaa
>
> On Mon, May 4, 2020 at 6:15 PM Robert Bradshaw 
> wrote:
> >
> > On Mon, May 4, 2020 at 6:02 PM Thomas Weise 
> wrote:
> > >
> > > I took a brief look at [1] and looks good overall.
> > >
> > > There are some changes missing though (for example [2]), are you
> planning to add more recent commits later?
> > >
> > > Also, there was an earlier question from Brian regarding the
> possibility to retain the post dates in blog file names. I would second
> that, it would make the posts significantly easier to navigate.
> >
> > I'm OK with removing them from the URL if they're to distracting, but
> > generally agree here. (If it's too difficult, it's not a huge issue.)
> >
> > > [1]
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
> > > [2]
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/documentation/runners/flink/index.html
> > >
> > > Thomas
> > >
> > > On Mon, May 4, 2020 at 11:06 AM Hannah Jiang 
> wrote:
> > >>
> > >> Hi Aizhamal,, yes, Wednesday sounds good to me. Thank you.
> > >>
> > >>
> > >> On Mon, May 4, 2020 at 10:40 AM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
> > >>>
> > >>> Hannah,
> > >>>
> > >>> We don't have an exact date, but we are hoping to address all the
> comments and merge the PR by Wednesday. Will it be possible for you to wait
> until then?
> > >>>
> > >>> On Thu, Apr 30, 2020 at 4:29 PM Hannah Jiang 
> wrote:
> > >
> > > Since we want to move forward with the PR, I would like to ask the
> community to hold off changes to the current Beam website for a week, until
> we are able to review and merge the PR. Is this acceptable to everyone?
> > 
> >  Do we have an exact date when we can push changes to the website? I
> have PRs to update documents so would like to plan ahead.
> > 
> >  On Thu, Apr 30, 2020 at 1:17 PM Nam Bui 
> wrote:
> > >
> > > Hey guys,
> > >
> > > I tried my best to handle renamed files in Git. I have no clue why
> GitHub doesn't show it, but finally, I made this commit [1] (thanks for
> your idea @bhulette) so you guys can review changes with ease (there is no
> bunch of deleted markdown files anymore :D). Also, new staged version is
> deployed, you could check it out [2].
> > >
> > > In case you are interested in translation, here is the proof of
> concept [3] (the earth icon on the right corner is temporarily used for
> switching languages). You can take a look at the translation guide for this
> PoC [4].
> > >
> > > [1]
> https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
> > > [2]
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
> > > [3] 

Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Robert Bradshaw
I took the massive commit and split it up into:

(1) Infrastructure changes (basically everything outside of
(website/www/site/content)
(2) Sed script changes, and
(3) Manual changes (everything not in (1) and (2)).

It does seem that (3) has a number of unintentional changes, some
stylistic (e.g. lost of removal of end-of-file newlines) and some
actual content that's not up to date. This cuts down the number of
lines to be reviewed by more than half (and, notably, the more
substantial ones).

[1] 
https://github.com/apache/beam/pull/11608/commits/1bcf519a0f041607dfa401f167164301acbca2ac
72 files changed, 3546 insertions(+), 1472 deletions(-)
[2] 
https://github.com/apache/beam/pull/11608/commits/8b9f488c519b97a11ca4c7e3b644bb9ffe12cb98
252 files changed, 4136 insertions(+), 4684 deletions(-)
[3] 
https://github.com/apache/beam/pull/11608/commits/f9d8bc13a0fda0a60a436aa56186139d0f71de4e
228 files changed, 1859 insertions(+), 2370 deletions(-)

I also separated out the compatibility matrix move, which was ~1700
lines. 
https://github.com/apache/beam/pull/11608/commits/16516d036af047493445654d61940dea8d04eaaa

On Mon, May 4, 2020 at 6:15 PM Robert Bradshaw  wrote:
>
> On Mon, May 4, 2020 at 6:02 PM Thomas Weise  wrote:
> >
> > I took a brief look at [1] and looks good overall.
> >
> > There are some changes missing though (for example [2]), are you planning 
> > to add more recent commits later?
> >
> > Also, there was an earlier question from Brian regarding the possibility to 
> > retain the post dates in blog file names. I would second that, it would 
> > make the posts significantly easier to navigate.
>
> I'm OK with removing them from the URL if they're to distracting, but
> generally agree here. (If it's too difficult, it's not a huge issue.)
>
> > [1] 
> > http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
> > [2] 
> > http://apache-beam-website-pull-requests.storage.googleapis.com/11554/documentation/runners/flink/index.html
> >
> > Thomas
> >
> > On Mon, May 4, 2020 at 11:06 AM Hannah Jiang  wrote:
> >>
> >> Hi Aizhamal,, yes, Wednesday sounds good to me. Thank you.
> >>
> >>
> >> On Mon, May 4, 2020 at 10:40 AM Aizhamal Nurmamat kyzy 
> >>  wrote:
> >>>
> >>> Hannah,
> >>>
> >>> We don't have an exact date, but we are hoping to address all the 
> >>> comments and merge the PR by Wednesday. Will it be possible for you to 
> >>> wait until then?
> >>>
> >>> On Thu, Apr 30, 2020 at 4:29 PM Hannah Jiang  
> >>> wrote:
> >
> > Since we want to move forward with the PR, I would like to ask the 
> > community to hold off changes to the current Beam website for a week, 
> > until we are able to review and merge the PR. Is this acceptable to 
> > everyone?
> 
>  Do we have an exact date when we can push changes to the website? I have 
>  PRs to update documents so would like to plan ahead.
> 
>  On Thu, Apr 30, 2020 at 1:17 PM Nam Bui  wrote:
> >
> > Hey guys,
> >
> > I tried my best to handle renamed files in Git. I have no clue why 
> > GitHub doesn't show it, but finally, I made this commit [1] (thanks for 
> > your idea @bhulette) so you guys can review changes with ease (there is 
> > no bunch of deleted markdown files anymore :D). Also, new staged 
> > version is deployed, you could check it out [2].
> >
> > In case you are interested in translation, here is the proof of concept 
> > [3] (the earth icon on the right corner is temporarily used for 
> > switching languages). You can take a look at the translation guide for 
> > this PoC [4].
> >
> > [1] 
> > https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
> > [2] 
> > http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
> > [3] https://safe-relation.surge.sh/
> > [4] 
> > https://github.com/PolideaInternal/beam/blob/website-develop/website/CONTRIBUTE.md#translation-guide
> >
> >
> > On Thu, Apr 30, 2020 at 7:24 PM Brian Hulette  
> > wrote:
> >>
> >> Changing the URLs is fine with me as long as the old urls will work 
> >> too.
> >>
> >> But do we need to change the filenames for the blog posts to 
> >> accomplish that? It's nice that the blog post markdown files start 
> >> with a date so they naturally sort chronologically. It looks like this 
> >> hugo PR [1] made it possible to extract date metadata and slug (i.e. 
> >> dataflow-python-sdk-is-now-public) separately from the filename.
> >>
> >> [1] https://github.com/gohugoio/hugo/pull/4494
> >>
> >> On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:
> >>>
> >>>
> >>>
> >>> On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:
> 
>  For changed URLs, will previous URLs be mapped to avoid broken 
>  external links?
> >>>
> >>>
> >>> I believe the answer is yes 

Builtin IOs - Link to Java/Pydoc instead of code?

2020-05-04 Thread Pablo Estrada
Hi all,
I just noted that in our Built-in IOs page[1], we tend to link to the code
for the IOs that we mention.

I think it would be better to link to the Javadoc or the Pydoc for those
IOs instead. Thoughts?
Best
-P.

[1] https://beam.apache.org/documentation/io/built-in/


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Robert Bradshaw
On Mon, May 4, 2020 at 6:02 PM Thomas Weise  wrote:
>
> I took a brief look at [1] and looks good overall.
>
> There are some changes missing though (for example [2]), are you planning to 
> add more recent commits later?
>
> Also, there was an earlier question from Brian regarding the possibility to 
> retain the post dates in blog file names. I would second that, it would make 
> the posts significantly easier to navigate.

I'm OK with removing them from the URL if they're to distracting, but
generally agree here. (If it's too difficult, it's not a huge issue.)

> [1] 
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
> [2] 
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/documentation/runners/flink/index.html
>
> Thomas
>
> On Mon, May 4, 2020 at 11:06 AM Hannah Jiang  wrote:
>>
>> Hi Aizhamal,, yes, Wednesday sounds good to me. Thank you.
>>
>>
>> On Mon, May 4, 2020 at 10:40 AM Aizhamal Nurmamat kyzy  
>> wrote:
>>>
>>> Hannah,
>>>
>>> We don't have an exact date, but we are hoping to address all the comments 
>>> and merge the PR by Wednesday. Will it be possible for you to wait until 
>>> then?
>>>
>>> On Thu, Apr 30, 2020 at 4:29 PM Hannah Jiang  wrote:
>
> Since we want to move forward with the PR, I would like to ask the 
> community to hold off changes to the current Beam website for a week, 
> until we are able to review and merge the PR. Is this acceptable to 
> everyone?

 Do we have an exact date when we can push changes to the website? I have 
 PRs to update documents so would like to plan ahead.

 On Thu, Apr 30, 2020 at 1:17 PM Nam Bui  wrote:
>
> Hey guys,
>
> I tried my best to handle renamed files in Git. I have no clue why GitHub 
> doesn't show it, but finally, I made this commit [1] (thanks for your 
> idea @bhulette) so you guys can review changes with ease (there is no 
> bunch of deleted markdown files anymore :D). Also, new staged version is 
> deployed, you could check it out [2].
>
> In case you are interested in translation, here is the proof of concept 
> [3] (the earth icon on the right corner is temporarily used for switching 
> languages). You can take a look at the translation guide for this PoC [4].
>
> [1] 
> https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
> [2] 
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
> [3] https://safe-relation.surge.sh/
> [4] 
> https://github.com/PolideaInternal/beam/blob/website-develop/website/CONTRIBUTE.md#translation-guide
>
>
> On Thu, Apr 30, 2020 at 7:24 PM Brian Hulette  wrote:
>>
>> Changing the URLs is fine with me as long as the old urls will work too.
>>
>> But do we need to change the filenames for the blog posts to accomplish 
>> that? It's nice that the blog post markdown files start with a date so 
>> they naturally sort chronologically. It looks like this hugo PR [1] made 
>> it possible to extract date metadata and slug (i.e. 
>> dataflow-python-sdk-is-now-public) separately from the filename.
>>
>> [1] https://github.com/gohugoio/hugo/pull/4494
>>
>> On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:
>>>
>>>
>>>
>>> On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:

 For changed URLs, will previous URLs be mapped to avoid broken 
 external links?
>>>
>>>
>>> I believe the answer is yes from Nam's response "For now, we keep the 
>>> old URLs working in terms of redirecting them". I very much agree that 
>>> this is very important and should work for all existing urls.
>>>



 On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy 
  wrote:
>
> Hi,
>
> To give a little more context regarding the URLs, the date should 
> still appear on the blog post, but not on the URL.
> For example, we'd have:
> https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>  become 
> https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.
>>>
>>>
>>> I am not a content marketer. IMO, this is a good change. In the past, a 
>>> few times, we edited dates on posts (e.g. a release date was entered 
>>> incorrectly) and we had to either have a mismatch between dates in the 
>>> url and the date in the blog, or change the url. This change 
>>> simplifies, by having date only in place (in content metadata).
>>>
>
>
> The blog posts would have a small header showing the title, author 
> and publish date. But the URL would not have it.
> Thoughts?
>
>
> On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:
>>
>> Hi,
>>

Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Thomas Weise
I took a brief look at [1] and looks good overall.

There are some changes missing though (for example [2]), are you planning
to add more recent commits later?

Also, there was an earlier question from Brian regarding the possibility to
retain the post dates in blog file names. I would second that, it would
make the posts significantly easier to navigate.

[1]
http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
[2]
http://apache-beam-website-pull-requests.storage.googleapis.com/11554/documentation/runners/flink/index.html

Thomas

On Mon, May 4, 2020 at 11:06 AM Hannah Jiang  wrote:

> Hi Aizhamal,, yes, Wednesday sounds good to me. Thank you.
>
>
> On Mon, May 4, 2020 at 10:40 AM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> Hannah,
>>
>> We don't have an exact date, but we are hoping to address all the
>> comments and merge the PR by Wednesday. Will it be possible for you to wait
>> until then?
>>
>> On Thu, Apr 30, 2020 at 4:29 PM Hannah Jiang 
>> wrote:
>>
>>> Since we want to move forward with the PR, I would like to ask the
 community to hold off changes to the current Beam website for a week, until
 we are able to review and merge the PR. Is this acceptable to everyone?
>>>
>>> Do we have an exact date when we can push changes to the website? I have
>>> PRs to update documents so would like to plan ahead.
>>>
>>> On Thu, Apr 30, 2020 at 1:17 PM Nam Bui  wrote:
>>>
 Hey guys,

 I tried my best to handle renamed files in Git. I have no clue why
 GitHub doesn't show it, but finally, I made this commit [1] (thanks for
 your idea @bhulette) so you guys can review changes with ease (there is no
 bunch of deleted markdown files anymore :D). Also, new staged version is
 deployed, you could check it out [2].

 In case you are interested in translation, here is the proof of concept
 [3] (the earth icon on the right corner is temporarily used for switching
 languages). You can take a look at the translation guide for this PoC [4].

 [1]
 https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
 [2]
 http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
 [3] https://safe-relation.surge.sh/
 [4]
 https://github.com/PolideaInternal/beam/blob/website-develop/website/CONTRIBUTE.md#translation-guide


 On Thu, Apr 30, 2020 at 7:24 PM Brian Hulette 
 wrote:

> Changing the URLs is fine with me as long as the old urls will work
> too.
>
> But do we need to change the filenames for the blog posts to
> accomplish that? It's nice that the blog post markdown files start with a
> date so they naturally sort chronologically. It looks like this hugo PR 
> [1]
> made it possible to extract date metadata and slug
> (i.e. dataflow-python-sdk-is-now-public) separately from the filename.
>
> [1] https://github.com/gohugoio/hugo/pull/4494
>
> On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:
>
>>
>>
>> On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:
>>
>>> For changed URLs, will previous URLs be mapped to avoid broken
>>> external links?
>>>
>>
>> I believe the answer is yes from Nam's response "For now, we keep the
>> old URLs working in terms of redirecting them". I very much agree that 
>> this
>> is very important and should work for all existing urls.
>>
>>
>>>
>>>
>>> On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 Hi,

 To give a little more context regarding the URLs, the date should
 still appear on the blog post, but not on the URL.
 For example, we'd have:

 https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
 become
 https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.

>>>
>> I am not a content marketer. IMO, this is a good change. In the past,
>> a few times, we edited dates on posts (e.g. a release date was entered
>> incorrectly) and we had to either have a mismatch between dates in the 
>> url
>> and the date in the blog, or change the url. This change simplifies, by
>> having date only in place (in content metadata).
>>
>>
>>>
 The blog posts would have a small header showing the title, author
 and publish date. But the URL would not have it.
 Thoughts?


 On Thu, Apr 30, 2020 at 9:23 AM Nam Bui 
 wrote:

> Hi,
>
> @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging
> version is "
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/;
> which also includes "/11554", and Hugo considers it as a path so it 
> breaks
> 

Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-04 Thread Alex Amato
Thanks Ismaël :). Done

On Mon, May 4, 2020 at 3:59 PM Ismaël Mejía  wrote:

> Moving the short link to this thread
> https://s.apache.org/beam-histogram-metrics
>
> Alex can you add this link (and any other of your documents that may
> not be there) to
> https://cwiki.apache.org/confluence/display/BEAM/Design+Documents
>
>
> On Tue, May 5, 2020 at 12:51 AM Pablo Estrada  wrote:
> >
> > FYI +Boyuan Zhang worked on implementing a histogram metric that was
> performance-optimized into outer space for Python : ) - I don't recall if
> she ended up getting it merged, but it's worth looking at the work. I also
> remember Scott Wegner wrote the metrics for Java.
> >
> > Best
> > -P.
> >
> > On Mon, May 4, 2020 at 3:33 PM Alex Amato  wrote:
> >>
> >> Hello,
> >>
> >> I have created a proposal for Apache Beam FN API to support Histogram
> Style Metrics. Which defines a method to collect Histogram style metrics
> and pass them over the FN API.
> >>
> >> I would love to hear your feedback in order to improve this proposal,
> please let me know what you think. Thanks for taking a look :)
> >> Alex
>


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-04 Thread Ismaël Mejía
Moving the short link to this thread
https://s.apache.org/beam-histogram-metrics

Alex can you add this link (and any other of your documents that may
not be there) to
https://cwiki.apache.org/confluence/display/BEAM/Design+Documents


On Tue, May 5, 2020 at 12:51 AM Pablo Estrada  wrote:
>
> FYI +Boyuan Zhang worked on implementing a histogram metric that was 
> performance-optimized into outer space for Python : ) - I don't recall if she 
> ended up getting it merged, but it's worth looking at the work. I also 
> remember Scott Wegner wrote the metrics for Java.
>
> Best
> -P.
>
> On Mon, May 4, 2020 at 3:33 PM Alex Amato  wrote:
>>
>> Hello,
>>
>> I have created a proposal for Apache Beam FN API to support Histogram Style 
>> Metrics. Which defines a method to collect Histogram style metrics and pass 
>> them over the FN API.
>>
>> I would love to hear your feedback in order to improve this proposal, please 
>> let me know what you think. Thanks for taking a look :)
>> Alex


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-04 Thread Pablo Estrada
FYI +Boyuan Zhang  worked on implementing a histogram
metric that was performance-optimized into outer space for Python : ) - I
don't recall if she ended up getting it merged, but it's worth looking at
the work. I also remember Scott Wegner wrote the metrics for Java.

Best
-P.

On Mon, May 4, 2020 at 3:33 PM Alex Amato  wrote:

> Hello,
>
> I have created a proposal for Apache Beam FN API to support Histogram
> Style Metrics
> .
> Which defines a method to collect Histogram style metrics and pass them
> over the FN API.
>
> I would love to hear your feedback in order to improve this
> proposal, please let me know what you think. Thanks for taking a look :)
> Alex
>


[Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-04 Thread Alex Amato
Hello,

I have created a proposal for Apache Beam FN API to support Histogram Style
Metrics
.
Which defines a method to collect Histogram style metrics and pass them
over the FN API.

I would love to hear your feedback in order to improve this
proposal, please let me know what you think. Thanks for taking a look :)
Alex


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics

2020-05-04 Thread Alex Amato
Sorry, wrong link. Let's close this thread and I'll send another...

On Mon, May 4, 2020 at 3:28 PM Pablo Estrada  wrote:

> Hi Alex!
> Thanks for the proposal. I've created
> https://s.apache.org/beam-histogram-metrics
>
> On Mon, May 4, 2020 at 2:44 PM Alex Amato  wrote:
>
>> Hello,
>>
>> I have created a proposal for Apache Beam FN API to support Histogram
>> Style Metrics
>> .
>> Which defines a method to collect Histogram style metrics and pass them
>> over the FN API.
>>
>> Also, I would appreciate it if someone could generate an s.apache.org
>> link for this document? Unless there is some way for me to do it myself.
>>
>> I would love to hear your feedback in order to improve this
>> proposal, please let me know what you think. Thanks for taking a look :)
>> Alex
>>
>


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics

2020-05-04 Thread Pablo Estrada
Hi Alex!
Thanks for the proposal. I've created
https://s.apache.org/beam-histogram-metrics

On Mon, May 4, 2020 at 2:44 PM Alex Amato  wrote:

> Hello,
>
> I have created a proposal for Apache Beam FN API to support Histogram
> Style Metrics
> .
> Which defines a method to collect Histogram style metrics and pass them
> over the FN API.
>
> Also, I would appreciate it if someone could generate an s.apache.org
> link for this document? Unless there is some way for me to do it myself.
>
> I would love to hear your feedback in order to improve this
> proposal, please let me know what you think. Thanks for taking a look :)
> Alex
>


[Proposal] Apache Beam Fn API - Histogram Style Metrics

2020-05-04 Thread Alex Amato
Hello,

I have created a proposal for Apache Beam FN API to support Histogram Style
Metrics
.
Which defines a method to collect Histogram style metrics and pass them
over the FN API.

Also, I would appreciate it if someone could generate an s.apache.org link
for this document? Unless there is some way for me to do it myself.

I would love to hear your feedback in order to improve this
proposal, please let me know what you think. Thanks for taking a look :)
Alex


Apache Beam application to Season of Docs 2020

2020-05-04 Thread Aizhamal Nurmamat kyzy
Hi all,

I have submitted the application to the Season of Docs program with the
project ideas we have developed last year [1]. I learnt about its deadline
a few hours ago and didn't want to miss it.

Feel free to add more project ideas (or edit the current ones) until May
7th.

If Beam gets approved, we will get 1 or 2 experienced technical writers to
help us document community processes or some Beam features. Is anyone else
willing to mentor for these projects?

[1] https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs


Re: [DISCUSS] finishBundle once per window

2020-05-04 Thread Reuven Lax
I assume you are referring to elements output from finishBundle.

The problem is that the current window is an input to
WindowFn.assignWindows. The new window can depend on the timestamp, the
element itself, and the original window. I'm not sure how many users rely
on this, however it has been part of our public windowing API for a long
time, so I would guess that some users do use this in their WindowFns.

Reuven

On Mon, May 4, 2020 at 11:48 AM Jan Lukavský  wrote:

> There was a mention in some other thread, that in order to make user
> experience as predictable as possible, we should try to make windows
> idempotent, and once window is assigned, it should be never changed (and
> timestamp move outside of the scope of window, unless a different windowfn
> is applied). Because all Beam window functions are actually time based, and
> output timestamp is known, what is the issue of applying windowfn to
> elements output from @FinishBundle and assign the windows automatically?
> On 5/4/20 8:07 PM, Reuven Lax wrote:
>
> This should not affect the ability of the user to specify the output
> timestamp. Today FinishBundleContext.output forces you to specify the
> window as well as the timestamp, which is a bit awkward. (I believe that it
> also lets you create brand new windows in finishBundle, which is
> interesting, but I'm not quite sure of the use case).
>
> On Mon, May 4, 2020 at 10:29 AM Robert Bradshaw 
> wrote:
>
>> This is a really nice idea. Would the user still need to specify the
>> timestamp of the output? I'm a bit ambivalent about calling it
>> multiple times if OuptutReceiver alone is in the parameter list; this
>> might not be obvious and could be surprising behavior.
>>
>> On Mon, May 4, 2020 at 10:13 AM Reuven Lax  wrote:
>> >
>> > I would like to discuss a minor extension to the Beam model.
>> >
>> > Beam bundles have very few restrictions on what can be in a bundle, in
>> particular s bundle might contain records for many different windows. This
>> was an explicit decision as bundling primarily exists for performance
>> reasons and we found that limiting bundling based on windows or timestamps
>> often led to severe performance problems. However it sometimes makes
>> finishBundle hard to use.
>> >
>> > I've seen multiple cases where users maintain some state in their DoFn
>> that needs finalizing (e.g. writing to an external service) in
>> finishBundle. Often users end up keeping lists of all windows seen in the
>> bundle so they can be processed separately (or sometimes not realizing that
>> their can be multiple windows and writing incorrect code).
>> >
>> > The lack of a window also means that we don't currently support
>> injecting an OuptutReceiver into finishBundle, as there's no good way of
>> knowing which window output should be put into.
>> >
>> > I would like to propose adding a way for finishBundle to inspect the
>> window, either directly (via a BoundedWindow parameter) or indirectly (via
>> an OutputReceiver parameter). In this case, we will execute finishBundle
>> once per window in the bundle. Otherwise, we will execute finishBundle once
>> at the end of the bundle as before. This behavior is backwards compatible,
>> as previously these parameters were disallowed in finishBundle.
>> >
>> > Note that this is similar to something Beam already does in
>> processElement. A single element can exist in multiple windows, however if
>> the processElement "observes" the window then Beam will call processElement
>> once per window.
>> >
>> > In Java, the user code could look like this:
>> >
>> > DoFn<> {
>> >  ...
>> >@FinishBundle
>> >public void finishBundle(IntervalWindow window, OutputReceiver o)
>> {
>> >// This finishBundle will be called once per window in the
>> bundle since it has
>> >   // a parameter that observes the window.
>> >}
>> > }
>> >
>> > This PR shows an implementation of this extension for the Java SDK.
>> >
>> > Thoughts?
>> >
>> > Reuven
>>
>


Re: [PROPOSAL] Preparing for Beam 2.22.0 release

2020-05-04 Thread Luke Cwik
Thanks Brian.

On Mon, May 4, 2020 at 10:57 AM Kyle Weaver  wrote:

> Thanks Brian!
>
> I'd also like to remind everyone to update CHANGES.md with any important
> recent changes that might be missing.
> https://github.com/apache/beam/blob/master/CHANGES.md
>
> On Mon, May 4, 2020 at 1:25 PM Brian Hulette  wrote:
>
>> Hi all,
>>
>> The next (2.22.0) release branch cut is scheduled for May 20, according
>> to the calendar
>> 
>> .
>> I would like to volunteer myself to do this release.
>> The plan is to cut the branch on that date, and cherrypick release-blocking
>> fixes afterwards if any.
>>
>> Any unresolved release blocking JIRA issues for 2.22.0 should have their
>> "Fix Version/s" marked as "2.22.0".
>>
>> Any comments or objections?
>>
>> Brian
>>
>


Re: Exploding windows and FnApiDoFnRunner

2020-05-04 Thread Luke Cwik
Kenn, the optimization is not complex, just never done.

The FnApiDoFnRunner was rewritten to be designed with portability first and
to move away from the assumptions that were baked into the existing DoFn
"runner" implementations and the constructs used in the non-portable
implementation. There are many DoFn "runner" implementations that exist in
Java that are layered on top of each other to handle several special cases
which are also used by "system" DoFns as well.

On Mon, May 4, 2020 at 10:38 AM Robert Burke  wrote:

> Ack ok. Thank you for clarifying!
>  Confirming that Kenn is right, the optimization is pretty much that
> simple. [1] is where it's done in the Go SDK
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/pardo.go#L136
>
> On Mon, May 4, 2020, 10:18 AM Reuven Lax  wrote:
>
>> I wonder how often we even implement this optimization today. If the
>> processElement has an OutputReceiver parameter then we mark it as
>> observesWindow, and that's a pretty common parameter.
>>
>> Arguably this is a bug in our implementation of OutputReceiver though -
>> it should be able to copy all the windows into the output element.
>>
>> Reuven
>>
>> On Mon, May 4, 2020 at 9:37 AM Kenneth Knowles  wrote:
>>
>>> Is the optimization complex in the Fn API context? In non-Fn API it is
>>> basically "if (observesWindow) { explode } else { don't }" [1]. The DoFn
>>> signature tells you everything you need. This might be a good first commit
>>> for someone looking to contribute to the Java SDK harness?
>>>
>>> Kenn
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/591de3473144de54beef0932131025e2a4d8504b/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java#L133
>>>
>>> On Mon, May 4, 2020 at 9:33 AM Robert Bradshaw 
>>> wrote:
>>>
 In Python we only explode windows if the Window is being inspected.
 (There is no separate "DoFnRunner" for FnApi vs. Legacy execution.)

 On Mon, May 4, 2020 at 9:21 AM Luke Cwik  wrote:
 >
 > Reuven you are correct that the optimization has yet to be
 implemented.
 > Robert the FnApiDoFnRunner is the name of a Java class that executes
 Java DoFns in the Java SDK harness. The poor name choice is my fault.
 >
 > On Fri, May 1, 2020 at 9:14 PM Reuven Lax  wrote:
 >>
 >> FnApiDoFnRunner does run Java DoFns.
 >>
 >> On Fri, May 1, 2020 at 9:10 PM Robert Burke 
 wrote:
 >>>
 >>> In the Go SDK this optimization is handled on the SDK side, inthe
 pardo execution node not one the runner side of the FnAPI
 >>>
 >>> But i think I'm about to learn that FnApiDoFnRunner is something
 that runs on the Java SDK side rather than on the runner side, despite the
 name.
 >>>
 >>> On Fri, May 1, 2020, 9:02 PM Reuven Lax  wrote:
 
  Ah - so we don't implement the optimization of not expanding the
 windows if not necessary?
 
  On Fri, May 1, 2020 at 8:56 PM Luke Cwik  wrote:
 >
 > In all the processElementYYY methods the currentWindow is
 assigned as can be seen here as we loop over the set of windows:
 >
 https://github.com/apache/beam/blob/9bb2990c0f6c08dd33d9c6fa1fd91842c644a8e3/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L738
 >
 > On Fri, May 1, 2020 at 8:51 PM Reuven Lax 
 wrote:
 >>
 >> In Beam a WindowedValue can can contain multiple windows,
 because an element can be in multiple windows at once (for example, sliding
 windows). Usually we keep these elements unexpanded, but if the user's doFn
 observes the window  then we have to "explode" the element out, and we run
 the process function once per window. e.g. if the process function looks
 like this
 >>
 >> @ProcessElement
 >> public void process(@Element T e, IntervalWindow w)
 >>
 >> In SimpleDoFnRunner we do this inside processElement. However I
 can't find the equivalent code in FnApiDoFnRunner. How does window
 explosion work in the portable runner?
 >>
 >> Reuven

>>>


Re: [DISCUSS] finishBundle once per window

2020-05-04 Thread Jan Lukavský
There was a mention in some other thread, that in order to make user 
experience as predictable as possible, we should try to make windows 
idempotent, and once window is assigned, it should be never changed (and 
timestamp move outside of the scope of window, unless a different 
windowfn is applied). Because all Beam window functions are actually 
time based, and output timestamp is known, what is the issue of applying 
windowfn to elements output from @FinishBundle and assign the windows 
automatically?


On 5/4/20 8:07 PM, Reuven Lax wrote:
This should not affect the ability of the user to specify the output 
timestamp. Today FinishBundleContext.output forces you to specify the 
window as well as the timestamp, which is a bit awkward. (I believe 
that it also lets you create brand new windows in finishBundle, which 
is interesting, but I'm not quite sure of the use case).


On Mon, May 4, 2020 at 10:29 AM Robert Bradshaw > wrote:


This is a really nice idea. Would the user still need to specify the
timestamp of the output? I'm a bit ambivalent about calling it
multiple times if OuptutReceiver alone is in the parameter list; this
might not be obvious and could be surprising behavior.

On Mon, May 4, 2020 at 10:13 AM Reuven Lax mailto:re...@google.com>> wrote:
>
> I would like to discuss a minor extension to the Beam model.
>
> Beam bundles have very few restrictions on what can be in a
bundle, in particular s bundle might contain records for many
different windows. This was an explicit decision as bundling
primarily exists for performance reasons and we found that
limiting bundling based on windows or timestamps often led to
severe performance problems. However it sometimes makes
finishBundle hard to use.
>
> I've seen multiple cases where users maintain some state in
their DoFn that needs finalizing (e.g. writing to an external
service) in finishBundle. Often users end up keeping lists of all
windows seen in the bundle so they can be processed separately (or
sometimes not realizing that their can be multiple windows and
writing incorrect code).
>
> The lack of a window also means that we don't currently support
injecting an OuptutReceiver into finishBundle, as there's no good
way of knowing which window output should be put into.
>
> I would like to propose adding a way for finishBundle to inspect
the window, either directly (via a BoundedWindow parameter) or
indirectly (via an OutputReceiver parameter). In this case, we
will execute finishBundle once per window in the bundle.
Otherwise, we will execute finishBundle once at the end of the
bundle as before. This behavior is backwards compatible, as
previously these parameters were disallowed in finishBundle.
>
> Note that this is similar to something Beam already does in
processElement. A single element can exist in multiple windows,
however if the processElement "observes" the window then Beam will
call processElement once per window.
>
> In Java, the user code could look like this:
>
> DoFn<> {
>      ...
>    @FinishBundle
>    public void finishBundle(IntervalWindow window,
OutputReceiver o) {
>        // This finishBundle will be called once per window in
the bundle since it has
>       // a parameter that observes the window.
>    }
> }
>
> This PR shows an implementation of this extension for the Java SDK.
>
> Thoughts?
>
> Reuven



Re: [DISCUSS] finishBundle once per window

2020-05-04 Thread Reuven Lax
This should not affect the ability of the user to specify the output
timestamp. Today FinishBundleContext.output forces you to specify the
window as well as the timestamp, which is a bit awkward. (I believe that it
also lets you create brand new windows in finishBundle, which is
interesting, but I'm not quite sure of the use case).

On Mon, May 4, 2020 at 10:29 AM Robert Bradshaw  wrote:

> This is a really nice idea. Would the user still need to specify the
> timestamp of the output? I'm a bit ambivalent about calling it
> multiple times if OuptutReceiver alone is in the parameter list; this
> might not be obvious and could be surprising behavior.
>
> On Mon, May 4, 2020 at 10:13 AM Reuven Lax  wrote:
> >
> > I would like to discuss a minor extension to the Beam model.
> >
> > Beam bundles have very few restrictions on what can be in a bundle, in
> particular s bundle might contain records for many different windows. This
> was an explicit decision as bundling primarily exists for performance
> reasons and we found that limiting bundling based on windows or timestamps
> often led to severe performance problems. However it sometimes makes
> finishBundle hard to use.
> >
> > I've seen multiple cases where users maintain some state in their DoFn
> that needs finalizing (e.g. writing to an external service) in
> finishBundle. Often users end up keeping lists of all windows seen in the
> bundle so they can be processed separately (or sometimes not realizing that
> their can be multiple windows and writing incorrect code).
> >
> > The lack of a window also means that we don't currently support
> injecting an OuptutReceiver into finishBundle, as there's no good way of
> knowing which window output should be put into.
> >
> > I would like to propose adding a way for finishBundle to inspect the
> window, either directly (via a BoundedWindow parameter) or indirectly (via
> an OutputReceiver parameter). In this case, we will execute finishBundle
> once per window in the bundle. Otherwise, we will execute finishBundle once
> at the end of the bundle as before. This behavior is backwards compatible,
> as previously these parameters were disallowed in finishBundle.
> >
> > Note that this is similar to something Beam already does in
> processElement. A single element can exist in multiple windows, however if
> the processElement "observes" the window then Beam will call processElement
> once per window.
> >
> > In Java, the user code could look like this:
> >
> > DoFn<> {
> >  ...
> >@FinishBundle
> >public void finishBundle(IntervalWindow window, OutputReceiver o) {
> >// This finishBundle will be called once per window in the bundle
> since it has
> >   // a parameter that observes the window.
> >}
> > }
> >
> > This PR shows an implementation of this extension for the Java SDK.
> >
> > Thoughts?
> >
> > Reuven
>


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Hannah Jiang
Hi Aizhamal,, yes, Wednesday sounds good to me. Thank you.


On Mon, May 4, 2020 at 10:40 AM Aizhamal Nurmamat kyzy 
wrote:

> Hannah,
>
> We don't have an exact date, but we are hoping to address all the comments
> and merge the PR by Wednesday. Will it be possible for you to wait until
> then?
>
> On Thu, Apr 30, 2020 at 4:29 PM Hannah Jiang 
> wrote:
>
>> Since we want to move forward with the PR, I would like to ask the
>>> community to hold off changes to the current Beam website for a week, until
>>> we are able to review and merge the PR. Is this acceptable to everyone?
>>
>> Do we have an exact date when we can push changes to the website? I have
>> PRs to update documents so would like to plan ahead.
>>
>> On Thu, Apr 30, 2020 at 1:17 PM Nam Bui  wrote:
>>
>>> Hey guys,
>>>
>>> I tried my best to handle renamed files in Git. I have no clue why
>>> GitHub doesn't show it, but finally, I made this commit [1] (thanks for
>>> your idea @bhulette) so you guys can review changes with ease (there is no
>>> bunch of deleted markdown files anymore :D). Also, new staged version is
>>> deployed, you could check it out [2].
>>>
>>> In case you are interested in translation, here is the proof of concept
>>> [3] (the earth icon on the right corner is temporarily used for switching
>>> languages). You can take a look at the translation guide for this PoC [4].
>>>
>>> [1]
>>> https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
>>> [2]
>>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
>>> [3] https://safe-relation.surge.sh/
>>> [4]
>>> https://github.com/PolideaInternal/beam/blob/website-develop/website/CONTRIBUTE.md#translation-guide
>>>
>>>
>>> On Thu, Apr 30, 2020 at 7:24 PM Brian Hulette 
>>> wrote:
>>>
 Changing the URLs is fine with me as long as the old urls will work too.

 But do we need to change the filenames for the blog posts to accomplish
 that? It's nice that the blog post markdown files start with a date so they
 naturally sort chronologically. It looks like this hugo PR [1] made it
 possible to extract date metadata and slug
 (i.e. dataflow-python-sdk-is-now-public) separately from the filename.

 [1] https://github.com/gohugoio/hugo/pull/4494

 On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:

>
>
> On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:
>
>> For changed URLs, will previous URLs be mapped to avoid broken
>> external links?
>>
>
> I believe the answer is yes from Nam's response "For now, we keep the
> old URLs working in terms of redirecting them". I very much agree that 
> this
> is very important and should work for all existing urls.
>
>
>>
>>
>> On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy <
>> aizha...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> To give a little more context regarding the URLs, the date should
>>> still appear on the blog post, but not on the URL.
>>> For example, we'd have:
>>>
>>> https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>>> become
>>> https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.
>>>
>>
> I am not a content marketer. IMO, this is a good change. In the past,
> a few times, we edited dates on posts (e.g. a release date was entered
> incorrectly) and we had to either have a mismatch between dates in the url
> and the date in the blog, or change the url. This change simplifies, by
> having date only in place (in content metadata).
>
>
>>
>>> The blog posts would have a small header showing the title, author
>>> and publish date. But the URL would not have it.
>>> Thoughts?
>>>
>>>
>>> On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:
>>>
 Hi,

 @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging
 version is "
 http://apache-beam-website-pull-requests.storage.googleapis.com/11554/;
 which also includes "/11554", and Hugo considers it as a path so it 
 breaks
 the path of "static files" (like images). We made a fix. Now I'm 
 working on
 "getting git to recognize files as renames" as you suggested.

 @robert: The dates are nice but it causes verbose/long/ugly URLs.
 We discussed with Aizhamal in the development stage and agreed to get 
 rid
 of this. For now, we keep the old URLs working in terms of redirecting
 them. However, from now on, we should change the name convention on 
 blog
 posts to have a fancy URL like "beam.apache.org/blog/myblogpost.md".
 :)



 On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw <
 rober...@google.com> wrote:

> On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay 

Re: [PROPOSAL] Preparing for Beam 2.22.0 release

2020-05-04 Thread Kyle Weaver
Thanks Brian!

I'd also like to remind everyone to update CHANGES.md with any important
recent changes that might be missing.
https://github.com/apache/beam/blob/master/CHANGES.md

On Mon, May 4, 2020 at 1:25 PM Brian Hulette  wrote:

> Hi all,
>
> The next (2.22.0) release branch cut is scheduled for May 20, according to
> the calendar
> 
> .
> I would like to volunteer myself to do this release.
> The plan is to cut the branch on that date, and cherrypick release-blocking
> fixes afterwards if any.
>
> Any unresolved release blocking JIRA issues for 2.22.0 should have their
> "Fix Version/s" marked as "2.22.0".
>
> Any comments or objections?
>
> Brian
>


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Aizhamal Nurmamat kyzy
Hannah,

We don't have an exact date, but we are hoping to address all the comments
and merge the PR by Wednesday. Will it be possible for you to wait until
then?

On Thu, Apr 30, 2020 at 4:29 PM Hannah Jiang  wrote:

> Since we want to move forward with the PR, I would like to ask the
>> community to hold off changes to the current Beam website for a week, until
>> we are able to review and merge the PR. Is this acceptable to everyone?
>
> Do we have an exact date when we can push changes to the website? I have
> PRs to update documents so would like to plan ahead.
>
> On Thu, Apr 30, 2020 at 1:17 PM Nam Bui  wrote:
>
>> Hey guys,
>>
>> I tried my best to handle renamed files in Git. I have no clue why GitHub
>> doesn't show it, but finally, I made this commit [1] (thanks for your
>> idea @bhulette) so you guys can review changes with ease (there is no bunch
>> of deleted markdown files anymore :D). Also, new staged version is
>> deployed, you could check it out [2].
>>
>> In case you are interested in translation, here is the proof of concept
>> [3] (the earth icon on the right corner is temporarily used for switching
>> languages). You can take a look at the translation guide for this PoC [4].
>>
>> [1]
>> https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
>> [2]
>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
>> [3] https://safe-relation.surge.sh/
>> [4]
>> https://github.com/PolideaInternal/beam/blob/website-develop/website/CONTRIBUTE.md#translation-guide
>>
>>
>> On Thu, Apr 30, 2020 at 7:24 PM Brian Hulette 
>> wrote:
>>
>>> Changing the URLs is fine with me as long as the old urls will work too.
>>>
>>> But do we need to change the filenames for the blog posts to accomplish
>>> that? It's nice that the blog post markdown files start with a date so they
>>> naturally sort chronologically. It looks like this hugo PR [1] made it
>>> possible to extract date metadata and slug
>>> (i.e. dataflow-python-sdk-is-now-public) separately from the filename.
>>>
>>> [1] https://github.com/gohugoio/hugo/pull/4494
>>>
>>> On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:
>>>


 On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:

> For changed URLs, will previous URLs be mapped to avoid broken
> external links?
>

 I believe the answer is yes from Nam's response "For now, we keep the
 old URLs working in terms of redirecting them". I very much agree that this
 is very important and should work for all existing urls.


>
>
> On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> Hi,
>>
>> To give a little more context regarding the URLs, the date should
>> still appear on the blog post, but not on the URL.
>> For example, we'd have:
>>
>> https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>> become
>> https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.
>>
>
 I am not a content marketer. IMO, this is a good change. In the past, a
 few times, we edited dates on posts (e.g. a release date was entered
 incorrectly) and we had to either have a mismatch between dates in the url
 and the date in the blog, or change the url. This change simplifies, by
 having date only in place (in content metadata).


>
>> The blog posts would have a small header showing the title, author
>> and publish date. But the URL would not have it.
>> Thoughts?
>>
>>
>> On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:
>>
>>> Hi,
>>>
>>> @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging
>>> version is "
>>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/;
>>> which also includes "/11554", and Hugo considers it as a path so it 
>>> breaks
>>> the path of "static files" (like images). We made a fix. Now I'm 
>>> working on
>>> "getting git to recognize files as renames" as you suggested.
>>>
>>> @robert: The dates are nice but it causes verbose/long/ugly URLs. We
>>> discussed with Aizhamal in the development stage and agreed to get rid 
>>> of
>>> this. For now, we keep the old URLs working in terms of redirecting 
>>> them.
>>> However, from now on, we should change the name convention on blog 
>>> posts to
>>> have a fancy URL like "beam.apache.org/blog/myblogpost.md". :)
>>>
>>>
>>>
>>> On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw 
>>> wrote:
>>>
 On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay 
 wrote:

> Nam, this looks better. At least links are working, and the
> website visually looks similar and generally in good shape. I think 
> there
> are still issues. For example, I do not see any of the images (e.g. 
> the

Re: Exploding windows and FnApiDoFnRunner

2020-05-04 Thread Robert Burke
Ack ok. Thank you for clarifying!
 Confirming that Kenn is right, the optimization is pretty much that
simple. [1] is where it's done in the Go SDK

[1]
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/pardo.go#L136

On Mon, May 4, 2020, 10:18 AM Reuven Lax  wrote:

> I wonder how often we even implement this optimization today. If the
> processElement has an OutputReceiver parameter then we mark it as
> observesWindow, and that's a pretty common parameter.
>
> Arguably this is a bug in our implementation of OutputReceiver though - it
> should be able to copy all the windows into the output element.
>
> Reuven
>
> On Mon, May 4, 2020 at 9:37 AM Kenneth Knowles  wrote:
>
>> Is the optimization complex in the Fn API context? In non-Fn API it is
>> basically "if (observesWindow) { explode } else { don't }" [1]. The DoFn
>> signature tells you everything you need. This might be a good first commit
>> for someone looking to contribute to the Java SDK harness?
>>
>> Kenn
>>
>> [1]
>> https://github.com/apache/beam/blob/591de3473144de54beef0932131025e2a4d8504b/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java#L133
>>
>> On Mon, May 4, 2020 at 9:33 AM Robert Bradshaw 
>> wrote:
>>
>>> In Python we only explode windows if the Window is being inspected.
>>> (There is no separate "DoFnRunner" for FnApi vs. Legacy execution.)
>>>
>>> On Mon, May 4, 2020 at 9:21 AM Luke Cwik  wrote:
>>> >
>>> > Reuven you are correct that the optimization has yet to be implemented.
>>> > Robert the FnApiDoFnRunner is the name of a Java class that executes
>>> Java DoFns in the Java SDK harness. The poor name choice is my fault.
>>> >
>>> > On Fri, May 1, 2020 at 9:14 PM Reuven Lax  wrote:
>>> >>
>>> >> FnApiDoFnRunner does run Java DoFns.
>>> >>
>>> >> On Fri, May 1, 2020 at 9:10 PM Robert Burke 
>>> wrote:
>>> >>>
>>> >>> In the Go SDK this optimization is handled on the SDK side, inthe
>>> pardo execution node not one the runner side of the FnAPI
>>> >>>
>>> >>> But i think I'm about to learn that FnApiDoFnRunner is something
>>> that runs on the Java SDK side rather than on the runner side, despite the
>>> name.
>>> >>>
>>> >>> On Fri, May 1, 2020, 9:02 PM Reuven Lax  wrote:
>>> 
>>>  Ah - so we don't implement the optimization of not expanding the
>>> windows if not necessary?
>>> 
>>>  On Fri, May 1, 2020 at 8:56 PM Luke Cwik  wrote:
>>> >
>>> > In all the processElementYYY methods the currentWindow is assigned
>>> as can be seen here as we loop over the set of windows:
>>> >
>>> https://github.com/apache/beam/blob/9bb2990c0f6c08dd33d9c6fa1fd91842c644a8e3/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L738
>>> >
>>> > On Fri, May 1, 2020 at 8:51 PM Reuven Lax 
>>> wrote:
>>> >>
>>> >> In Beam a WindowedValue can can contain multiple windows, because
>>> an element can be in multiple windows at once (for example, sliding
>>> windows). Usually we keep these elements unexpanded, but if the user's doFn
>>> observes the window  then we have to "explode" the element out, and we run
>>> the process function once per window. e.g. if the process function looks
>>> like this
>>> >>
>>> >> @ProcessElement
>>> >> public void process(@Element T e, IntervalWindow w)
>>> >>
>>> >> In SimpleDoFnRunner we do this inside processElement. However I
>>> can't find the equivalent code in FnApiDoFnRunner. How does window
>>> explosion work in the portable runner?
>>> >>
>>> >> Reuven
>>>
>>


Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-05-04 Thread Eleanore Jin
Hi Max,

Thanks for the information and I saw this PR is already merged, just wonder
is it backported to the affected versions already (i.e. 2.14.0, 2.15.0,
2.16.0, 2.17.0, 2.18.0, 2.19.0, 2.20.0)? Or I have to wait for the 2.20.1
release?

Thanks a lot!
Eleanore

On Wed, Apr 22, 2020 at 2:31 AM Maximilian Michels  wrote:

> Hi Eleanore,
>
> Exactly-once is not affected but the pipeline can fail to checkpoint
> after the maximum number of state cells have been reached. We are
> working on a fix [1].
>
> Cheers,
> Max
>
> [1] https://github.com/apache/beam/pull/11478
>
> On 22.04.20 07:19, Eleanore Jin wrote:
> > Hi Maxi,
> >
> > I assume this will impact the Exactly Once Semantics that beam provided
> > as in the KafkaExactlyOnceSink, the processElement method is also
> > annotated with @RequiresStableInput?
> >
> > Thanks a lot!
> > Eleanore
> >
> > On Tue, Apr 21, 2020 at 12:58 AM Maximilian Michels  > > wrote:
> >
> > Hi Stephen,
> >
> > Thanks for reporting the issue! David, good catch!
> >
> > I think we have to resort to only using a single state cell for
> > buffering on checkpoints, instead of using a new one for every
> > checkpoint. I was under the assumption that, if the state cell was
> > cleared, it would not be checkpointed but that does not seem to be
> > the case.
> >
> > Thanks,
> > Max
> >
> > On 21.04.20 09:29, David Morávek wrote:
> > > Hi Stephen,
> > >
> > > nice catch and awesome report! ;) This definitely needs a proper
> fix.
> > > I've created a new JIRA to track the issue and will try to resolve
> it
> > > soon as this seems critical to me.
> > >
> > > https://issues.apache.org/jira/browse/BEAM-9794
> > >
> > > Thanks,
> > > D.
> > >
> > > On Mon, Apr 20, 2020 at 10:41 PM Stephen Patel
> > mailto:stephenpate...@gmail.com>
> > >  > >> wrote:
> > >
> > > I was able to reproduce this in a unit test:
> > >
> > > @Test
> > >
> > >   *public* *void* test() *throws* InterruptedException,
> > > ExecutionException {
> > >
> > > FlinkPipelineOptions options =
> > > PipelineOptionsFactory./as/(FlinkPipelineOptions.*class*);
> > >
> > > options.setCheckpointingInterval(10L);
> > >
> > > options.setParallelism(1);
> > >
> > > options.setStreaming(*true*);
> > >
> > > options.setRunner(FlinkRunner.*class*);
> > >
> > > options.setFlinkMaster("[local]");
> > >
> > > options.setStateBackend(*new*
> > > MemoryStateBackend(Integer.*/MAX_VALUE/*));
> > >
> > > Pipeline pipeline = Pipeline./create/(options);
> > >
> > > pipeline
> > >
> > > .apply(Create./of/((Void) *null*))
> > >
> > > .apply(
> > >
> > > ParDo./of/(
> > >
> > > *new* DoFn() {
> > >
> > >
> > >   *private* *static* *final* *long*
> > > */serialVersionUID/* = 1L;
> > >
> > >
> > >   @RequiresStableInput
> > >
> > >   @ProcessElement
> > >
> > >   *public* *void* processElement() {}
> > >
> > > }));
> > >
> > > pipeline.run();
> > >
> > >   }
> > >
> > >
> > > It took a while to get to checkpoint 32,767, but eventually it
> > did,
> > > and it failed with the same error I listed above.
> > >
> > > On Thu, Apr 16, 2020 at 11:26 AM Stephen Patel
> > > mailto:stephenpate...@gmail.com>
> > >>
> > wrote:
> > >
> > > I have a Beam Pipeline (2.14) running on Flink (1.8.0,
> > > emr-5.26.0) that uses the RequiresStableInput feature.
> > >
> > > Currently it's configured to checkpoint once a minute, and
> > after
> > > around 32000-33000 checkpoints, it fails with:
> > >
> > > 2020-04-15 13:15:02,920 INFO
> > >
> >   org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> > >   - Triggering checkpoint 32701 @ 1586956502911 for job
> > > 9953424f21e240112dd23ab4f8320b60.
> > > 2020-04-15 13:15:05,762 INFO
> > >
> >   org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> > >   - Completed checkpoint 32701 for job
> > > 9953424f21e240112dd23ab4f8320b60 (795385496 bytes in
> > 2667 ms).
> > > 2020-04-15 13:16:02,919 INFO
> > >
> >   

Re: [DISCUSS] finishBundle once per window

2020-05-04 Thread Robert Bradshaw
This is a really nice idea. Would the user still need to specify the
timestamp of the output? I'm a bit ambivalent about calling it
multiple times if OuptutReceiver alone is in the parameter list; this
might not be obvious and could be surprising behavior.

On Mon, May 4, 2020 at 10:13 AM Reuven Lax  wrote:
>
> I would like to discuss a minor extension to the Beam model.
>
> Beam bundles have very few restrictions on what can be in a bundle, in 
> particular s bundle might contain records for many different windows. This 
> was an explicit decision as bundling primarily exists for performance reasons 
> and we found that limiting bundling based on windows or timestamps often led 
> to severe performance problems. However it sometimes makes finishBundle hard 
> to use.
>
> I've seen multiple cases where users maintain some state in their DoFn that 
> needs finalizing (e.g. writing to an external service) in finishBundle. Often 
> users end up keeping lists of all windows seen in the bundle so they can be 
> processed separately (or sometimes not realizing that their can be multiple 
> windows and writing incorrect code).
>
> The lack of a window also means that we don't currently support injecting an 
> OuptutReceiver into finishBundle, as there's no good way of knowing which 
> window output should be put into.
>
> I would like to propose adding a way for finishBundle to inspect the window, 
> either directly (via a BoundedWindow parameter) or indirectly (via an 
> OutputReceiver parameter). In this case, we will execute finishBundle once 
> per window in the bundle. Otherwise, we will execute finishBundle once at the 
> end of the bundle as before. This behavior is backwards compatible, as 
> previously these parameters were disallowed in finishBundle.
>
> Note that this is similar to something Beam already does in processElement. A 
> single element can exist in multiple windows, however if the processElement 
> "observes" the window then Beam will call processElement once per window.
>
> In Java, the user code could look like this:
>
> DoFn<> {
>  ...
>@FinishBundle
>public void finishBundle(IntervalWindow window, OutputReceiver o) {
>// This finishBundle will be called once per window in the bundle 
> since it has
>   // a parameter that observes the window.
>}
> }
>
> This PR shows an implementation of this extension for the Java SDK.
>
> Thoughts?
>
> Reuven


[PROPOSAL] Preparing for Beam 2.22.0 release

2020-05-04 Thread Brian Hulette
Hi all,

The next (2.22.0) release branch cut is scheduled for May 20, according to
the calendar

.
I would like to volunteer myself to do this release.
The plan is to cut the branch on that date, and cherrypick release-blocking
fixes afterwards if any.

Any unresolved release blocking JIRA issues for 2.22.0 should have their
"Fix Version/s" marked as "2.22.0".

Any comments or objections?

Brian


Re: Exploding windows and FnApiDoFnRunner

2020-05-04 Thread Reuven Lax
I wonder how often we even implement this optimization today. If the
processElement has an OutputReceiver parameter then we mark it as
observesWindow, and that's a pretty common parameter.

Arguably this is a bug in our implementation of OutputReceiver though - it
should be able to copy all the windows into the output element.

Reuven

On Mon, May 4, 2020 at 9:37 AM Kenneth Knowles  wrote:

> Is the optimization complex in the Fn API context? In non-Fn API it is
> basically "if (observesWindow) { explode } else { don't }" [1]. The DoFn
> signature tells you everything you need. This might be a good first commit
> for someone looking to contribute to the Java SDK harness?
>
> Kenn
>
> [1]
> https://github.com/apache/beam/blob/591de3473144de54beef0932131025e2a4d8504b/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java#L133
>
> On Mon, May 4, 2020 at 9:33 AM Robert Bradshaw 
> wrote:
>
>> In Python we only explode windows if the Window is being inspected.
>> (There is no separate "DoFnRunner" for FnApi vs. Legacy execution.)
>>
>> On Mon, May 4, 2020 at 9:21 AM Luke Cwik  wrote:
>> >
>> > Reuven you are correct that the optimization has yet to be implemented.
>> > Robert the FnApiDoFnRunner is the name of a Java class that executes
>> Java DoFns in the Java SDK harness. The poor name choice is my fault.
>> >
>> > On Fri, May 1, 2020 at 9:14 PM Reuven Lax  wrote:
>> >>
>> >> FnApiDoFnRunner does run Java DoFns.
>> >>
>> >> On Fri, May 1, 2020 at 9:10 PM Robert Burke 
>> wrote:
>> >>>
>> >>> In the Go SDK this optimization is handled on the SDK side, inthe
>> pardo execution node not one the runner side of the FnAPI
>> >>>
>> >>> But i think I'm about to learn that FnApiDoFnRunner is something that
>> runs on the Java SDK side rather than on the runner side, despite the name.
>> >>>
>> >>> On Fri, May 1, 2020, 9:02 PM Reuven Lax  wrote:
>> 
>>  Ah - so we don't implement the optimization of not expanding the
>> windows if not necessary?
>> 
>>  On Fri, May 1, 2020 at 8:56 PM Luke Cwik  wrote:
>> >
>> > In all the processElementYYY methods the currentWindow is assigned
>> as can be seen here as we loop over the set of windows:
>> >
>> https://github.com/apache/beam/blob/9bb2990c0f6c08dd33d9c6fa1fd91842c644a8e3/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L738
>> >
>> > On Fri, May 1, 2020 at 8:51 PM Reuven Lax  wrote:
>> >>
>> >> In Beam a WindowedValue can can contain multiple windows, because
>> an element can be in multiple windows at once (for example, sliding
>> windows). Usually we keep these elements unexpanded, but if the user's doFn
>> observes the window  then we have to "explode" the element out, and we run
>> the process function once per window. e.g. if the process function looks
>> like this
>> >>
>> >> @ProcessElement
>> >> public void process(@Element T e, IntervalWindow w)
>> >>
>> >> In SimpleDoFnRunner we do this inside processElement. However I
>> can't find the equivalent code in FnApiDoFnRunner. How does window
>> explosion work in the portable runner?
>> >>
>> >> Reuven
>>
>


[DISCUSS] finishBundle once per window

2020-05-04 Thread Reuven Lax
I would like to discuss a minor extension to the Beam model.

Beam bundles have very few restrictions on what can be in a bundle, in
particular s bundle might contain records for many different windows. This
was an explicit decision as bundling primarily exists for performance
reasons and we found that limiting bundling based on windows or timestamps
often led to severe performance problems. However it sometimes makes
finishBundle hard to use.

I've seen multiple cases where users maintain some state in their DoFn that
needs finalizing (e.g. writing to an external service) in finishBundle.
Often users end up keeping lists of all windows seen in the bundle so they
can be processed separately (or sometimes not realizing that their can be
multiple windows and writing incorrect code).

The lack of a window also means that we don't currently support injecting
an OuptutReceiver into finishBundle, as there's no good way of knowing
which window output should be put into.

I would like to propose adding a way for finishBundle to inspect the
window, either directly (via a BoundedWindow parameter) or indirectly (via
an OutputReceiver parameter). In this case, we will execute finishBundle
once per window in the bundle. Otherwise, we will execute finishBundle once
at the end of the bundle as before. This behavior is backwards compatible,
as previously these parameters were disallowed in finishBundle.

Note that this is similar to something Beam already does in processElement.
A single element can exist in multiple windows, however if the
processElement "observes" the window then Beam will call processElement
once per window.

In Java, the user code could look like this:

DoFn<> {
 ...
   @FinishBundle
   public void finishBundle(IntervalWindow window, OutputReceiver o) {
   // This finishBundle will be called once per window in the bundle
since it has
  // a parameter that observes the window.
   }
}

This PR  shows an implementation
of this extension for the Java SDK.

Thoughts?

Reuven


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Nam Bui
Hey Kenn. Thanks so much for your useful information and research. It's
great to know.

On Mon, May 4, 2020 at 6:33 PM Kenneth Knowles  wrote:

> Regarding the detection of renames, I now recall that I have encountered
> this before: it is controlled by the config diff.renameLimit. The default
> value these days is high enough to work for this PR. I've confirmed this:
>
> git diff --shortstat $(git merge-base github/pr/11554 github/master)
> github/pr/11554
>  631 files changed, 10360 insertions(+), 9938 deletions(-)
>
> (in case the commits change, that is: git diff --shortstat
> 763b7ccd17a420eb634d6799adcd3ecfcf33d6a7
> 0162c9db3e7faf0d0e243c580ffa5ca5f497db98)
>
> But the GitHub UI does not match this. I believe the reason it works for
> the individual commit and fails for the overall PR is that it is calculated
> as part of displaying the end-to-end diff. Since it is n^2 perhaps GitHub
> sets it lower. Git doesn't store anything about any of this, but always
> computes it on the fly (by design, so that improvements apply to old git
> repos automatically).
>
> Other relevant flags are `git diff --find-copies` which finds copied files
> if the original was modified in the commit and `git diff
> --find-copies-harder` which finds copied files from anywhere in the repo.
> These could support a copy --> modify new --> delete old workflow, but I
> doubt any such workflow would preserve `git blame` in the GitHub UI. (you
> can still use these flags with git blame offline to get better blame
> accuracy)
>
> Kenn
>
> On Mon, May 4, 2020 at 1:21 AM Nam Bui  wrote:
>
>> Hey guys,
>>
>> How was your weekend? Thanks for some of the compliments and also
>> recommendations.
>>
>> About the commits, as Brian said, we worked together on the-asf slack. It
>> was the tough one, we even did a few experiments. And finally came up with
>> a solution that preserved all commits and used `git mv`.
>> IMHO, I know it's really difficult to review all of them at first, even
>> though we made a commit [1] which helps you to compare changes since there
>> are tons of files. Therefore, I recommend to check out my work, take a look
>> at Hugo structure and you will link it to Jekyll one quickly. There are no
>> chances about file or directory names, just organize the structure. I write
>> a short details here, hope it would be helpful in terms of reviewing.
>>
>> 1. Syntax
>> - I strongly prefer this one [2]. He wrote about Hugo syntax which is
>> corresponding to Jekyll syntax. It would make sense to your overview,
>> instead of skimming one by one markdown file.
>>
>> 2. Project structure
>> - The main part of Hugo is in "website/www/site". You will briefly
>> confused a little bit here with many directories, so please read this one
>> [3] first, then you'll get into it very quickly. The most important thing
>> here is the flow. In Jekyll, you write a markdown file and then pick the
>> layout with "layout: home" in frontmatter as an example. In Hugo, we have
>> separated "content" and "layouts" directory, the "layouts" will mimic the
>> structure of the "content", and at the end, Hugo will know how to connect
>> each of them behind the scene.
>> - In Jekyll, the components are in "website/src/_include" and it will be
>> moved to "website/src/layouts/partials" in Hugo.
>>
>> 3. Shortcodes.
>> - Just thinking "shortcodes" as utility functions and we will reuse it
>> many times in markdown files. One of the unique features from Hugo, and
>> it's located at "website/www/layouts/shortcodes".
>>
>> A quick Q:
>> @Altay: there are some deleted files if you see them in [1]. Some of them
>> have the different behaviour in Hugo. For instance,
>> "_data/capabilitymatrix.md" will be used directly in
>> frontmatter "website/www/site/content/en/blog/capability-matrix.md", the
>> reason is, it will take more works in Hugo to retrieve data from files and
>> pass them into "shortcodes" in markdown files (other data files are not
>> deleted because they are used in "layouts" HTML files).
>> @Robert: thanks for your review and comments on GitHub. I will walk
>> through all of them today.
>>
>> Best regards,
>> Nam
>>
>>
>> [1]
>> https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
>> [2] https://simpleit.rocks/golang/hugo/migrating-a-jekyll-blog-to-hugo/
>> [3] https://gohugo.io/getting-started/directory-structure/
>>
>> On Fri, May 1, 2020 at 6:24 PM Brian Hulette  wrote:
>>
>>> Regarding move detection: I worked with Nam on this some on the-asf
>>> slack. We couldn't make squashing into a single large commit work - when I
>>> did it, `git log` still showed many dropped and added files. Breaking out a
>>> single commit with the file moves was the best we could manage. I tested a
>>> PR that used this approach on a single file and the github UI did pick up
>>> on it [1]. Sadly it seems to give up on the larger PR.
>>>
>>> I figured this was good enough though, it's difficult to review all of
>>> the 

Re: Exploding windows and FnApiDoFnRunner

2020-05-04 Thread Kenneth Knowles
Is the optimization complex in the Fn API context? In non-Fn API it is
basically "if (observesWindow) { explode } else { don't }" [1]. The DoFn
signature tells you everything you need. This might be a good first commit
for someone looking to contribute to the Java SDK harness?

Kenn

[1]
https://github.com/apache/beam/blob/591de3473144de54beef0932131025e2a4d8504b/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java#L133

On Mon, May 4, 2020 at 9:33 AM Robert Bradshaw  wrote:

> In Python we only explode windows if the Window is being inspected.
> (There is no separate "DoFnRunner" for FnApi vs. Legacy execution.)
>
> On Mon, May 4, 2020 at 9:21 AM Luke Cwik  wrote:
> >
> > Reuven you are correct that the optimization has yet to be implemented.
> > Robert the FnApiDoFnRunner is the name of a Java class that executes
> Java DoFns in the Java SDK harness. The poor name choice is my fault.
> >
> > On Fri, May 1, 2020 at 9:14 PM Reuven Lax  wrote:
> >>
> >> FnApiDoFnRunner does run Java DoFns.
> >>
> >> On Fri, May 1, 2020 at 9:10 PM Robert Burke  wrote:
> >>>
> >>> In the Go SDK this optimization is handled on the SDK side, inthe
> pardo execution node not one the runner side of the FnAPI
> >>>
> >>> But i think I'm about to learn that FnApiDoFnRunner is something that
> runs on the Java SDK side rather than on the runner side, despite the name.
> >>>
> >>> On Fri, May 1, 2020, 9:02 PM Reuven Lax  wrote:
> 
>  Ah - so we don't implement the optimization of not expanding the
> windows if not necessary?
> 
>  On Fri, May 1, 2020 at 8:56 PM Luke Cwik  wrote:
> >
> > In all the processElementYYY methods the currentWindow is assigned
> as can be seen here as we loop over the set of windows:
> >
> https://github.com/apache/beam/blob/9bb2990c0f6c08dd33d9c6fa1fd91842c644a8e3/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L738
> >
> > On Fri, May 1, 2020 at 8:51 PM Reuven Lax  wrote:
> >>
> >> In Beam a WindowedValue can can contain multiple windows, because
> an element can be in multiple windows at once (for example, sliding
> windows). Usually we keep these elements unexpanded, but if the user's doFn
> observes the window  then we have to "explode" the element out, and we run
> the process function once per window. e.g. if the process function looks
> like this
> >>
> >> @ProcessElement
> >> public void process(@Element T e, IntervalWindow w)
> >>
> >> In SimpleDoFnRunner we do this inside processElement. However I
> can't find the equivalent code in FnApiDoFnRunner. How does window
> explosion work in the portable runner?
> >>
> >> Reuven
>


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Kenneth Knowles
Regarding the detection of renames, I now recall that I have encountered
this before: it is controlled by the config diff.renameLimit. The default
value these days is high enough to work for this PR. I've confirmed this:

git diff --shortstat $(git merge-base github/pr/11554 github/master)
github/pr/11554
 631 files changed, 10360 insertions(+), 9938 deletions(-)

(in case the commits change, that is: git diff --shortstat
763b7ccd17a420eb634d6799adcd3ecfcf33d6a7
0162c9db3e7faf0d0e243c580ffa5ca5f497db98)

But the GitHub UI does not match this. I believe the reason it works for
the individual commit and fails for the overall PR is that it is calculated
as part of displaying the end-to-end diff. Since it is n^2 perhaps GitHub
sets it lower. Git doesn't store anything about any of this, but always
computes it on the fly (by design, so that improvements apply to old git
repos automatically).

Other relevant flags are `git diff --find-copies` which finds copied files
if the original was modified in the commit and `git diff
--find-copies-harder` which finds copied files from anywhere in the repo.
These could support a copy --> modify new --> delete old workflow, but I
doubt any such workflow would preserve `git blame` in the GitHub UI. (you
can still use these flags with git blame offline to get better blame
accuracy)

Kenn

On Mon, May 4, 2020 at 1:21 AM Nam Bui  wrote:

> Hey guys,
>
> How was your weekend? Thanks for some of the compliments and also
> recommendations.
>
> About the commits, as Brian said, we worked together on the-asf slack. It
> was the tough one, we even did a few experiments. And finally came up with
> a solution that preserved all commits and used `git mv`.
> IMHO, I know it's really difficult to review all of them at first, even
> though we made a commit [1] which helps you to compare changes since there
> are tons of files. Therefore, I recommend to check out my work, take a look
> at Hugo structure and you will link it to Jekyll one quickly. There are no
> chances about file or directory names, just organize the structure. I write
> a short details here, hope it would be helpful in terms of reviewing.
>
> 1. Syntax
> - I strongly prefer this one [2]. He wrote about Hugo syntax which is
> corresponding to Jekyll syntax. It would make sense to your overview,
> instead of skimming one by one markdown file.
>
> 2. Project structure
> - The main part of Hugo is in "website/www/site". You will briefly
> confused a little bit here with many directories, so please read this one
> [3] first, then you'll get into it very quickly. The most important thing
> here is the flow. In Jekyll, you write a markdown file and then pick the
> layout with "layout: home" in frontmatter as an example. In Hugo, we have
> separated "content" and "layouts" directory, the "layouts" will mimic the
> structure of the "content", and at the end, Hugo will know how to connect
> each of them behind the scene.
> - In Jekyll, the components are in "website/src/_include" and it will be
> moved to "website/src/layouts/partials" in Hugo.
>
> 3. Shortcodes.
> - Just thinking "shortcodes" as utility functions and we will reuse it
> many times in markdown files. One of the unique features from Hugo, and
> it's located at "website/www/layouts/shortcodes".
>
> A quick Q:
> @Altay: there are some deleted files if you see them in [1]. Some of them
> have the different behaviour in Hugo. For instance,
> "_data/capabilitymatrix.md" will be used directly in
> frontmatter "website/www/site/content/en/blog/capability-matrix.md", the
> reason is, it will take more works in Hugo to retrieve data from files and
> pass them into "shortcodes" in markdown files (other data files are not
> deleted because they are used in "layouts" HTML files).
> @Robert: thanks for your review and comments on GitHub. I will walk
> through all of them today.
>
> Best regards,
> Nam
>
>
> [1]
> https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
> [2] https://simpleit.rocks/golang/hugo/migrating-a-jekyll-blog-to-hugo/
> [3] https://gohugo.io/getting-started/directory-structure/
>
> On Fri, May 1, 2020 at 6:24 PM Brian Hulette  wrote:
>
>> Regarding move detection: I worked with Nam on this some on the-asf
>> slack. We couldn't make squashing into a single large commit work - when I
>> did it, `git log` still showed many dropped and added files. Breaking out a
>> single commit with the file moves was the best we could manage. I tested a
>> PR that used this approach on a single file and the github UI did pick up
>> on it [1]. Sadly it seems to give up on the larger PR.
>>
>> I figured this was good enough though, it's difficult to review all of
>> the changes at once, but you can at least review the individual commits
>> without being obfuscated by the moves.
>>
>> [1] https://github.com/apache/beam/pull/11579
>>
>>
>> On Fri, May 1, 2020 at 9:11 AM Robert Bradshaw 
>> wrote:
>>
>>> I just took a look, and added a 

Re: Exploding windows and FnApiDoFnRunner

2020-05-04 Thread Robert Bradshaw
In Python we only explode windows if the Window is being inspected.
(There is no separate "DoFnRunner" for FnApi vs. Legacy execution.)

On Mon, May 4, 2020 at 9:21 AM Luke Cwik  wrote:
>
> Reuven you are correct that the optimization has yet to be implemented.
> Robert the FnApiDoFnRunner is the name of a Java class that executes Java 
> DoFns in the Java SDK harness. The poor name choice is my fault.
>
> On Fri, May 1, 2020 at 9:14 PM Reuven Lax  wrote:
>>
>> FnApiDoFnRunner does run Java DoFns.
>>
>> On Fri, May 1, 2020 at 9:10 PM Robert Burke  wrote:
>>>
>>> In the Go SDK this optimization is handled on the SDK side, inthe pardo 
>>> execution node not one the runner side of the FnAPI
>>>
>>> But i think I'm about to learn that FnApiDoFnRunner is something that runs 
>>> on the Java SDK side rather than on the runner side, despite the name.
>>>
>>> On Fri, May 1, 2020, 9:02 PM Reuven Lax  wrote:

 Ah - so we don't implement the optimization of not expanding the windows 
 if not necessary?

 On Fri, May 1, 2020 at 8:56 PM Luke Cwik  wrote:
>
> In all the processElementYYY methods the currentWindow is assigned as can 
> be seen here as we loop over the set of windows:
> https://github.com/apache/beam/blob/9bb2990c0f6c08dd33d9c6fa1fd91842c644a8e3/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L738
>
> On Fri, May 1, 2020 at 8:51 PM Reuven Lax  wrote:
>>
>> In Beam a WindowedValue can can contain multiple windows, because an 
>> element can be in multiple windows at once (for example, sliding 
>> windows). Usually we keep these elements unexpanded, but if the user's 
>> doFn observes the window  then we have to "explode" the element out, and 
>> we run the process function once per window. e.g. if the process 
>> function looks like this
>>
>> @ProcessElement
>> public void process(@Element T e, IntervalWindow w)
>>
>> In SimpleDoFnRunner we do this inside processElement. However I can't 
>> find the equivalent code in FnApiDoFnRunner. How does window explosion 
>> work in the portable runner?
>>
>> Reuven


Re: Exploding windows and FnApiDoFnRunner

2020-05-04 Thread Luke Cwik
Reuven you are correct that the optimization has yet to be implemented.
Robert the FnApiDoFnRunner is the name of a Java class that executes Java
DoFns in the Java SDK harness. The poor name choice is my fault.

On Fri, May 1, 2020 at 9:14 PM Reuven Lax  wrote:

> FnApiDoFnRunner does run Java DoFns.
>
> On Fri, May 1, 2020 at 9:10 PM Robert Burke  wrote:
>
>> In the Go SDK this optimization is handled on the SDK side, inthe pardo
>> execution node not one the runner side of the FnAPI
>>
>> But i think I'm about to learn that FnApiDoFnRunner is something that
>> runs on the Java SDK side rather than on the runner side, despite the name.
>>
>> On Fri, May 1, 2020, 9:02 PM Reuven Lax  wrote:
>>
>>> Ah - so we don't implement the optimization of not expanding the windows
>>> if not necessary?
>>>
>>> On Fri, May 1, 2020 at 8:56 PM Luke Cwik  wrote:
>>>
 In all the processElementYYY methods the currentWindow is assigned as
 can be seen here as we loop over the set of windows:

 https://github.com/apache/beam/blob/9bb2990c0f6c08dd33d9c6fa1fd91842c644a8e3/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L738

 On Fri, May 1, 2020 at 8:51 PM Reuven Lax  wrote:

> In Beam a WindowedValue can can contain multiple windows, because an
> element can be in multiple windows at once (for example, sliding windows).
> Usually we keep these elements unexpanded, but if the user's doFn observes
> the window  then we have to "explode" the element out, and we run the
> process function once per window. e.g. if the process function looks like
> this
>
> @ProcessElement
> public void process(@Element T e, IntervalWindow w)
>
> In SimpleDoFnRunner we do this inside processElement. However I can't
> find the equivalent code in FnApiDoFnRunner. How does window explosion 
> work
> in the portable runner?
>
> Reuven
>



Re: Non-trivial joins examples

2020-05-04 Thread Marcin Kuthan
@Kenneth - thank for your response, for sure I was inspired a lot by
earlier discussions on the group and latest documentation updates about
Timers:
https://beam.apache.org/documentation/programming-guide/#timers

In the limitations I forgot to mention about SideInputs, it works quite
well for scenarios where one side of the join is updated slowly, very
slowly. But for scenarios where the main stream gets 50k+ events per
seconds and the joined stream ~100 events per second it simply does not
work. Especially if there is no support for updates in Map side input and
the side input has to be updated/broadcasted as a whole.

@Jan - very interesting, as I understood the joins are already implemented
(plenty of them in Scio, classic ones, sparse versions, etc.) the problem
is with limited windows semantics, triggering policy and the time of
emitted events.

Please look at LookupCacheDoFn, it looks like left outer join - but it
isn't. Only the latest Lookup value (right side of the join) is cached. And
the left side of the join is cached only until the first matching lookup is
observed. Not so generic but quite efficient.

https://github.com/mkuthan/beam-examples/blob/master/src/main/scala/org/mkuthan/beam/examples/LookupCacheDoFn.scala

Marcin

On Fri, 1 May 2020 at 22:22, Jan Lukavský  wrote:

> Interestingly, I'm currently also working on a proposal for generic join
> semantics. I plan to send a proposal for review, but unfortunately, there
> are still other things keeping me busy. I take this opportunity to review
> high-level thoughts, maybe someone can give some points.
>
> The general idea is to define a join that can incorporate all other types
> as special cases, where the generic implementation can be simplified or
> optimized, but the semantics remain the same. As I plan to put this down to
> a full design document I will just very roughly outline ideas:
>
>  a) the generic semantics, should be equivalent to running relational join
> against set of tables _after each individual modification of the relation_
> and producing results with timestamp of the last modification
>
>  b) windows "scope" state of each "table" - i.e. when time reaches
> window.maxTimestamp() corresponding "table" is cleared
>
>  c) it should be possible to derive other types of joins from this
> definition by certain manipulations (e.g. buffering multiple updates in
> single window and assigninig all elements timestamp of
> window.maxTimestamp() will yield the classical "windowed join" with the
> requirement to have same windows on both (all) sides as otherwise the
> result will be empty) - the goal of these modification is typically
> enabling some optimization (e.g. the fully generic implementation must
> include time sorting - either implicitly or explicitly, optimized variants
> can drop this requirement).
>
> It would be great is someone has any comments on this bottom-up approach.
>
> Jan
> On 5/1/20 5:30 PM, Kenneth Knowles wrote:
>
> +dev @beam and some people who I talk about joins
> with
>
> Interesting! It is a lot to take in and fully grok the code, so calling in
> reinforcements...
>
> Generally, I think there's agreement that for a lot of real use cases, you
> have to roll your own join using the lower level Beam primitives. So I
> think it would be great to get some of these other approaches to joins into
> Beam, perhaps as an extension of the Java SDK or even in the core (since
> schema joins are in the core). In particular:
>
>  - "join in fixed window with repeater" sounds similar (but not identical)
> to work by Mikhail
>  - "join in global window with cache" sounds similar (but not identical)
> to work and discussions w/ Reza and Tyson
>
> I want to be clear that I am *not* saying there's any duplication. I'm
> guessing these all fit into a collection of different ways to accomplish
> joins, and if everything comes to fruition we will have the great
> opportunity to document how a user should choose between them.
>
> Kenn
>
> On Fri, May 1, 2020 at 7:56 AM Marcin Kuthan 
> wrote:
>
>> Hi,
>>
>> it's my first post here but I'm a group reader for a while, so thank you
>> for sharing the knowledge!
>>
>> I've been using Beam/Scio on Dataflow for about a year, mostly for stream
>> processing from unbounded source like PubSub. During my daily work I found
>> that built-in windowing is very generic and provides reach watermark/late
>> events semantics but there are a few very annoying limitations, e.g:
>> - both side of the join must be defined within compatible windows
>> - for fixed windows, elements close to window boundaries (but in
>> different windows) won't be joined
>> - for sliding windows there is a huge overhead if the duration is much
>> longer than offset
>>
>> I would like to ask you to review a few "join/windowing patterns" with
>> custom stateful ParDos, not so generic as Beam built-ins but perhaps better
>> crafted for more specific needs. I published code with tests, feel free to
>> comment as 

Re: Jenkins jobs not running for my PR 10438

2020-05-04 Thread Robert Bradshaw
Done.

On Mon, May 4, 2020 at 7:35 AM Rehman Murad Ali
 wrote:
>
> Hi Beam committers,
>
> Would you please trigger the basic checks as well as validatesRunner check 
> for this PR?
> https://github.com/apache/beam/pull/11350
>
>
> Thanks & Regards
>
> Rehman Murad Ali
> Software Engineer
> Mobile: +92 3452076766
> Skype: rehman.muradali
>
>
>
> On Fri, May 1, 2020 at 5:11 PM Ismaël Mejía  wrote:
>>
>> done
>>
>> On Fri, May 1, 2020 at 5:31 AM Tomo Suzuki  wrote:
>> >
>> > Hi Beam committers,
>> >
>> > Would you trigger the precommit checks for this PR?
>> > https://github.com/apache/beam/pull/11586
>> >
>> > Regards,
>> > Tomo


Re: Jenkins jobs not running for my PR 10438

2020-05-04 Thread Rehman Murad Ali
Hi Beam committers,

Would you please trigger the basic checks as well as validatesRunner check
for this PR?
https://github.com/apache/beam/pull/11350


*Thanks & Regards*

*Rehman Murad Ali*
Software Engineer
Mobile: +92 3452076766
Skype: rehman.muradali


On Fri, May 1, 2020 at 5:11 PM Ismaël Mejía  wrote:

> done
>
> On Fri, May 1, 2020 at 5:31 AM Tomo Suzuki  wrote:
> >
> > Hi Beam committers,
> >
> > Would you trigger the precommit checks for this PR?
> > https://github.com/apache/beam/pull/11586
> >
> > Regards,
> > Tomo
>


Flink: Lost pane timing at some steps of pipeline

2020-05-04 Thread Jozef Vilcek
I have a pipeline which

1. Read from KafkaIO
2. Does stuff with events and writes windowed file via FileIO
3. Apply statefull DoFn on written files info

The statefull DoFn does some logic which depends on PaneInfo.Timing, if it
is EARLY or something else. When testing in DirectRunner, all is good. But
with FlinkRunner, panes are always NO_FIRING.

To demonstrate this, here is a dummy test pipeline:

val testStream = sc.testStream(testStreamOf[String]
  .advanceWatermarkTo(new Instant(1))
  .addElements(goodMessage, goodMessage)
  .advanceWatermarkTo(new Instant(2))
  .addElements(goodMessage, goodMessage)
  .advanceWatermarkTo(new Instant(200))
  .addElements(goodMessage, goodMessage)
  .advanceWatermarkToInfinity())

testStream
  .withFixedWindows(
duration = Duration.standardSeconds(1),
options = WindowOptions(
  trigger = AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
.withLateFirings(AfterPane.elementCountAtLeast(1)),
  accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES,
  allowedLateness = Duration.standardDays(1)
))
  .keyBy(_ => "static_key")
  .withPaneInfo
  .map { case (element, paneInfo) =>
println(s"#1 - $paneInfo")
element
  }
  //.groupByKey // <- need to uncomment this for Flink to work
  .applyTransform(Reshuffle.viaRandomKey())
  .withPaneInfo
  .map { case (element, paneInfo) =>
println(s"#2 - $paneInfo")
element
  }

When executed with DirectRunner, #1 prints pane with UNKNOWN timing and #2
with EARLY, which is what I expect. When run with Flink runner, both #1 and
#2 writes UNKNOWN timing from PaneInfo.NO_FIRING. Only if I add extra GBK,
then #2 writes panes with EARLY timing.

This is run on Beam 2.19. I was trying to analyze where could be a problem
but got lost. I will be happy for any suggestions or pointers. Does it
sounds like bug or am I doing something wrong?


Beam Dependency Check Report (2020-05-04)

2020-05-04 Thread Apache Jenkins Server
ERROR: File 'src/build/dependencyUpdates/beam-dependency-check-report.html' does not exist

Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-05-04 Thread Nam Bui
Hey guys,

How was your weekend? Thanks for some of the compliments and also
recommendations.

About the commits, as Brian said, we worked together on the-asf slack. It
was the tough one, we even did a few experiments. And finally came up with
a solution that preserved all commits and used `git mv`.
IMHO, I know it's really difficult to review all of them at first, even
though we made a commit [1] which helps you to compare changes since there
are tons of files. Therefore, I recommend to check out my work, take a look
at Hugo structure and you will link it to Jekyll one quickly. There are no
chances about file or directory names, just organize the structure. I write
a short details here, hope it would be helpful in terms of reviewing.

1. Syntax
- I strongly prefer this one [2]. He wrote about Hugo syntax which is
corresponding to Jekyll syntax. It would make sense to your overview,
instead of skimming one by one markdown file.

2. Project structure
- The main part of Hugo is in "website/www/site". You will briefly confused
a little bit here with many directories, so please read this one [3] first,
then you'll get into it very quickly. The most important thing here is the
flow. In Jekyll, you write a markdown file and then pick the layout with
"layout: home" in frontmatter as an example. In Hugo, we have separated
"content" and "layouts" directory, the "layouts" will mimic the structure
of the "content", and at the end, Hugo will know how to connect each of
them behind the scene.
- In Jekyll, the components are in "website/src/_include" and it will be
moved to "website/src/layouts/partials" in Hugo.

3. Shortcodes.
- Just thinking "shortcodes" as utility functions and we will reuse it many
times in markdown files. One of the unique features from Hugo, and it's
located at "website/www/layouts/shortcodes".

A quick Q:
@Altay: there are some deleted files if you see them in [1]. Some of them
have the different behaviour in Hugo. For instance,
"_data/capabilitymatrix.md" will be used directly in
frontmatter "website/www/site/content/en/blog/capability-matrix.md", the
reason is, it will take more works in Hugo to retrieve data from files and
pass them into "shortcodes" in markdown files (other data files are not
deleted because they are used in "layouts" HTML files).
@Robert: thanks for your review and comments on GitHub. I will walk through
all of them today.

Best regards,
Nam


[1]
https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
[2] https://simpleit.rocks/golang/hugo/migrating-a-jekyll-blog-to-hugo/
[3] https://gohugo.io/getting-started/directory-structure/

On Fri, May 1, 2020 at 6:24 PM Brian Hulette  wrote:

> Regarding move detection: I worked with Nam on this some on the-asf slack.
> We couldn't make squashing into a single large commit work - when I did it,
> `git log` still showed many dropped and added files. Breaking out a single
> commit with the file moves was the best we could manage. I tested a PR that
> used this approach on a single file and the github UI did pick up on it
> [1]. Sadly it seems to give up on the larger PR.
>
> I figured this was good enough though, it's difficult to review all of the
> changes at once, but you can at least review the individual commits without
> being obfuscated by the moves.
>
> [1] https://github.com/apache/beam/pull/11579
>
>
> On Fri, May 1, 2020 at 9:11 AM Robert Bradshaw 
> wrote:
>
>> I just took a look, and added a couple of comments, but it mostly looks
>> good. Thanks for creating a commit that preserves changes; that's a big
>> improvement.
>>
>> +1 to Ahmet's suggestion about braking the huge commit up a bit more. I
>> would suggest one that adds the mechanics (etc.), one that applies a script
>> to auto-convert the content (where we can review the script and that it's
>> application give the resulting diff), and a final one that takes care of
>> the things that the script wasn't able to handle (or messed up, rather than
>> spending a huge amount of time getting the script perfect).
>>
>> On Fri, May 1, 2020 at 6:44 AM Kenneth Knowles  wrote:
>>
>>> I believe taking Brian and Robert's advice to help git detect moves
>>> (even more than you already have) will make this much more manageable. I
>>> just tried it out and squashing commits brings it to "631 files changed,
>>> 10363 insertions(+), 9945 deletions(-)" according to git, so that is more
>>> manageable than +47k - 47k. I'm not saying that a total squash is best.
>>> There may be a better way to factor the changes.
>>>
>>> Kenn
>>>
>>> On Thu, Apr 30, 2020 at 8:09 PM Ahmet Altay  wrote:
>>>
 Nam,

  - Website looks good and looks the same as the current website.
 (Visually comparing a few pages, not a deep analysis.)
 - contribute.md looks good. (this is new content.)
 - website/Dockerfile and website/README.md changes look good.
 - I do not know what is the new version of some files, for example: