Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-22 Thread Jean-Baptiste Onofre
+1 (binding)

Regards
JB

> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :
> 
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 2.27.0, as 
> follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
> Reviewers are encouraged to test their own use cases with the release 
> candidate, and vote +1
>  if no issues are found.
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org 
>  [2], which is signed with the key with fingerprint 
> C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.27.0-RC1" [5],
> * website pull request listing the release [6], publishing the API reference 
> manual [7], and the blog post [8].
> * Python artifacts are deployed along with the source release to the 
> dist.apache.org  [2].
> * Validation sheet with a tab for 2.27.0 release to help with validation [9].
> * Docker images published to Docker Hub [10].
> 
> The vote will be open for at least 72 hours, but given the holidays, we will 
> likely extend for a few more days. The release will be adopted by majority 
> approval, with at least 3 PMC affirmative votes.
> 
> Thanks,
> -P.
> 
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380
>  
> 
>  
> [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/ 
> 
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS 
> 
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1145/ 
> 
> [5] https://github.com/apache/beam/tree/v2.27 
> .0-RC1
> [6] https://github.com/apache/beam/pull/13602 
>  
> [7] https://github.com/apache/beam-site/pull/610 
>  
> [8] https://github.com/apache/beam/pull/13603 
>  
> [9] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
>  
> 
>  
> [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image 
> 


[VOTE] Release 2.27.0, release candidate #1

2020-12-22 Thread Pablo Estrada
Hi everyone,
Please review and vote on the release candidate #1 for the version 2.27.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1
 if no issues are found.

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint
C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.27.0-RC1" [5],
* website pull request listing the release [6], publishing the API
reference manual [7], and the blog post [8].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.27.0 release to help with validation
[9].
* Docker images published to Docker Hub [10].

The vote will be open for at least 72 hours, but given the holidays, we
will likely extend for a few more days. The release will be adopted by
majority approval, with at least 3 PMC affirmative votes.

Thanks,
-P.

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380

[2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1145/
[5] https://github.com/apache/beam/tree/v2.27.0-RC1
[6] https://github.com/apache/beam/pull/13602
[7] https://github.com/apache/beam-site/pull/610
[8] https://github.com/apache/beam/pull/13603
[9]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106

[10] https://hub.docker.com/search?q=apache%2Fbeam&type=image


Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-22 Thread Antonio Si
Hi Boyuan,

Let me clarify, I have tried with and without using 
--experiments=beam_fn_api,use_sdf_kafka_read option:

-  with --experiments=use_deprecated_read --fasterrCopy=true, I am able to 
achieve 13K TPS
-  with --experiments="beam_fn_api,use_sdf_kafka_read" --fasterCopy=true, I am 
able to achieve 10K
-  with --fasterCopy=true alone, I am only able to achieve 5K TPS

In our testcase, we have multiple topics, checkpoint intervals is 60s. Some 
topics have a lot higher traffics than others. We look at the case with 
--experiments="beam_fn_api,use_sdf_kafka_read" --fasterCopy=true options a 
little. Based on our observation, each consumer poll() in 
ReadFromKafkaDoFn.processElement() takes about 0.8ms. So for topic with high 
traffics, it will continue in the loop because every poll() will return some 
records. Every poll returns about 200 records. So, it takes about 0.8ms for 
every 200 records. I am not sure if that is part of the reason for the 
performance.

Thanks.

Antonio.

On 2020/12/21 19:03:19, Boyuan Zhang  wrote: 
> Hi Antonio,
> 
> Thanks for the data point. That's very valuable information!
> 
> I didn't use DirectRunner. I am using FlinkRunner.
> > We measured the number of Kafka messages that we can processed per second.
> > With Beam v2.26 with --experiments=use_deprecated_read and
> > --fasterCopy=true,
> > we are able to consume 13K messages per second, but with Beam v2.26
> > without the use_deprecated_read option, we are only able to process 10K
> > messages
> > per second for the same pipeline.
> 
> We do have SDF implementation of Kafka Read instead of using the wrapper.
> Would you like to have a try to see whether it helps you improve your
> situation?  You can use --experiments=beam_fn_api,use_sdf_kafka_read to
> switch to the Kafka SDF Read.
> 
> On Mon, Dec 21, 2020 at 10:54 AM Boyuan Zhang  wrote:
> 
> > Hi Jan,
> >>
> >> it seems that what we would want is to couple the lifecycle of the Reader
> >> not with the restriction but with the particular instance of
> >> (Un)boundedSource (after being split). That could be done in the processing
> >> DoFn, if it contained a cache mapping instance of the source to the
> >> (possibly null - i.e. not yet open) reader. In @NewTracker we could assign
> >> (or create) the reader to the tracker, as the tracker is created for each
> >> restriction.
> >>
> >> WDYT?
> >>
> > I was thinking about this but it seems like it is not applicable to the
> > way how UnboundedSource and UnboundedReader work together.
> > Please correct me if I'm wrong. The UnboundedReader is created from
> > UnboundedSource per CheckpointMark[1], which means for certain sources, the
> > CheckpointMark could affect some attributes like start position of the
> > reader when resuming. So a single UnboundedSource could be mapped to
> > multiple readers because of different instances of CheckpointMarl. That's
> > also the reason why we use CheckpointMark as the restriction.
> >
> > Please let me know if I misunderstand your suggestion.
> >
> > [1]
> > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java#L73-L78
> >
> > On Mon, Dec 21, 2020 at 9:18 AM Antonio Si  wrote:
> >
> >> Hi Boyuan,
> >>
> >> Sorry for my late reply. I was off for a few days.
> >>
> >> I didn't use DirectRunner. I am using FlinkRunner.
> >>
> >> We measured the number of Kafka messages that we can processed per second.
> >> With Beam v2.26 with --experiments=use_deprecated_read and
> >> --fasterCopy=true,
> >> we are able to consume 13K messages per second, but with Beam v2.26
> >> without the use_deprecated_read option, we are only able to process 10K
> >> messages
> >> per second for the same pipeline.
> >>
> >> Thanks and regards,
> >>
> >> Antonio.
> >>
> >> On 2020/12/11 22:19:40, Boyuan Zhang  wrote:
> >> > Hi Antonio,
> >> >
> >> > Thanks for the details! Which version of Beam SDK are you using? And are
> >> > you using --experiments=beam_fn_api with DirectRunner to launch your
> >> > pipeline?
> >> >
> >> > For ReadFromKafkaDoFn.processElement(), it will take a Kafka
> >> > topic+partition as input element and a KafkaConsumer will be assigned to
> >> > this topic+partition then poll records continuously. The Kafka consumer
> >> > will resume reading and return from the process fn when
> >> >
> >> >- There are no available records currently(this is a feature of SDF
> >> >which calls SDF self-initiated checkpoint)
> >> >- The OutputAndTimeBoundedSplittableProcessElementInvoker issues
> >> >checkpoint request to ReadFromKafkaDoFn for getting partial results.
> >> The
> >> >checkpoint frequency for DirectRunner is every 100 output records or
> >> every
> >> >1 seconds.
> >> >
> >> > It seems like either the self-initiated checkpoint or DirectRunner
> >> issued
> >> > checkpoint gives you the performance regression since there is overhead
> >> > when rescheduling residuals. In your case, it's more like that th