Re: Beam summit update on blog?

2020-09-21 Thread Matthias Baetens
+1 totally. I was thinking a retweet of the Beam Summit recording on the
Apache Beam channel would be helpful as well - but I'm happy to look into
doing a blogpost as well!

On Tue, Sep 22, 2020, 04:25 Ahmet Altay  wrote:

> Would it make sense to publish a blog post and a tweet about the Beam
> Summit 2020 with some highlights and links to videos? There are many hours
> of video content. This is great and a social media shareable post / tweet
> could help with promoting this content.
>
> Ahmet
>


Re: Output from Window not getting materialized

2020-09-21 Thread Luke Cwik
On Mon, Sep 21, 2020 at 5:02 PM Praveen K Viswanathan <
harish.prav...@gmail.com> wrote:

> Hi Luke, Thanks for the detailed explanation. This gives more insight to
> new people like me trying to grok the whole concept.
>
> *1) What timestamp your going to output your records at*
>
> ** use upstream input elements timestamp: guidance use the default
> implementation and to get the upstream watermark by default*
> ** use data from within the record being output or external system state
> via an API call: use a watermark estimator*
>
> In the above section, I do not think our source has a watermark concept
> built-in to derive and use it in SDF so we will have to go with the second
> option. If suppose we could extract a timestamp from the source message
> then do we have to setWatermark with that extracted timestamp before each
> output in @ProcessElement? And can we use the Manual Watermark Estimator
> itself for this approach?
>

You want to use context.outputWithTimestamp when parsing your own
timestamps and emitting records.

Using the manual one works but also take a look at timestamp observing
works since it will be told the timestamp of each element being produced.
Using the timestamp observing ones (monotonically increasing or your own)
allows you to decouple the watermark estimator logic from the SDF
implementation.


>
>
> *2) How you want to compute the watermark estimate (if at all)*
> ** the choice here depends on how the elements timestamps progress, are
> they in exactly sorted order, almost sorted order, completely unsorted,
> ...?*
>
> Our elements are "almost sorted order" because of which we want to hold
> off processing message_01 with timestamp 11:00:10 AM until we process
> message-02 with timestamp 11:00:08 AM. How do we enable this ordering while
> processing the messages?
>
> Based on your suggestion, I tried WallTime Estimator and it worked for one
> of our many scenarios. I am planning to test it with a bunch of other
> window types and use that till we get a solid hold on doing it in the above
> mentioned way that can handle the unsorted messages.
>

If you're extracting the timestamps out of your data, it would likely be
best to use the monotonically increasing timestamp estimator or write one
that computes one using some statistical method appropriate to your source.
If you think you have written one that is generally useful, feel free to
contribute it to Beam.

You'll want to look into @RequiresTimeSortedInput[1]. This allows you to
produce the messages in any order and requires the runner to make sure they
are sorted before passing to a downstream stateful DoFn.

1:
https://lists.apache.org/thread.html/9cdac2a363e18be58fa1f14c838c61e8406ae3407e4e2d05e423234c%40%3Cdev.beam.apache.org%3E


>
> Regards,
> Praveen
>
> On Fri, Sep 18, 2020 at 10:06 AM Luke Cwik  wrote:
>
>> To answer your specific question, you should create and return the
>> WallTime estimator. You shouldn't need to interact with it from within
>> your @ProcessElement call since your elements are using the current time
>> for their timestamp.
>>
>> On Fri, Sep 18, 2020 at 10:04 AM Luke Cwik  wrote:
>>
>>> Kafka is a complex example because it is adapting code from before there
>>> was an SDF implementation (namely the TimestampPolicy and the
>>> TimestampFn/TimestampFnS/WatermarkFn/WatermarkFn2 functions).
>>>
>>> There are three types of watermark estimators that are in the Beam Java
>>> SDK today:
>>> Manual: Can be invoked from within your @ProcessElement method within
>>> your SDF allowing you precise control over what the watermark is.
>>> WallTime: Doesn't need to be interacted with, will report the current
>>> time as the watermark time. Once it is instantiated and returned via the
>>> @NewWatermarkEstimator method you don't need to do anything with it. This
>>> is functionally equivalent to calling setWatermark(Instant.now()) right
>>> before returning from the @ProcessElement method in the SplittableDoFn on a
>>> Manual watermark.
>>> TimestampObserving: Is invoked using the output timestamp for every
>>> element that is output. This is functionally equivalent to calling
>>> setWatermark after each output within your @ProcessElement method in the
>>> SplittableDoFn. The MonotonicallyIncreasing implementation for
>>> the TimestampObserving estimator ensures that the largest timestamp seen so
>>> far will be reported for the watermark.
>>>
>>> The default is to not set any watermark estimate.
>>>
>>> For all watermark estimators you're allowed to set the watermark
>>> estimate to anything as the runner will recompute the output watermark as:
>>> new output watermark = max(previous output watermark, min(upstream
>>> watermark, watermark estimates))
>>> This effectively means that the watermark will never go backwards from
>>> the runners point of view but that does mean that setting the watermark
>>> estimate below the previous output watermark (which isn't observable) will
>>> not do anything beyond 

Beam summit update on blog?

2020-09-21 Thread Ahmet Altay
Would it make sense to publish a blog post and a tweet about the Beam
Summit 2020 with some highlights and links to videos? There are many hours
of video content. This is great and a social media shareable post / tweet
could help with promoting this content.

Ahmet


Reviewers for SparkRunner

2020-09-21 Thread tclemons
About a week ago I submitted the following pull request:

https://github.com/apache/beam/pull/12850

I listed @jbonofre as a reviewer as they were listed in the OWNER file for the 
Spark runners.  However, I've yet to hear any response.  Is there someone else 
I should add as a reviewer?

Appreciate it.

Tim.



Re: Output from Window not getting materialized

2020-09-21 Thread Praveen K Viswanathan
Hi Luke, Thanks for the detailed explanation. This gives more insight to
new people like me trying to grok the whole concept.

*1) What timestamp your going to output your records at*

** use upstream input elements timestamp: guidance use the default
implementation and to get the upstream watermark by default*
** use data from within the record being output or external system state
via an API call: use a watermark estimator*

In the above section, I do not think our source has a watermark concept
built-in to derive and use it in SDF so we will have to go with the second
option. If suppose we could extract a timestamp from the source message
then do we have to setWatermark with that extracted timestamp before each
output in @ProcessElement? And can we use the Manual Watermark Estimator
itself for this approach?


*2) How you want to compute the watermark estimate (if at all)*
** the choice here depends on how the elements timestamps progress, are
they in exactly sorted order, almost sorted order, completely unsorted,
...?*

Our elements are "almost sorted order" because of which we want to hold off
processing message_01 with timestamp 11:00:10 AM until we process
message-02 with timestamp 11:00:08 AM. How do we enable this ordering while
processing the messages?

Based on your suggestion, I tried WallTime Estimator and it worked for one
of our many scenarios. I am planning to test it with a bunch of other
window types and use that till we get a solid hold on doing it in the above
mentioned way that can handle the unsorted messages.

Regards,
Praveen

On Fri, Sep 18, 2020 at 10:06 AM Luke Cwik  wrote:

> To answer your specific question, you should create and return the
> WallTime estimator. You shouldn't need to interact with it from within
> your @ProcessElement call since your elements are using the current time
> for their timestamp.
>
> On Fri, Sep 18, 2020 at 10:04 AM Luke Cwik  wrote:
>
>> Kafka is a complex example because it is adapting code from before there
>> was an SDF implementation (namely the TimestampPolicy and the
>> TimestampFn/TimestampFnS/WatermarkFn/WatermarkFn2 functions).
>>
>> There are three types of watermark estimators that are in the Beam Java
>> SDK today:
>> Manual: Can be invoked from within your @ProcessElement method within
>> your SDF allowing you precise control over what the watermark is.
>> WallTime: Doesn't need to be interacted with, will report the current
>> time as the watermark time. Once it is instantiated and returned via the
>> @NewWatermarkEstimator method you don't need to do anything with it. This
>> is functionally equivalent to calling setWatermark(Instant.now()) right
>> before returning from the @ProcessElement method in the SplittableDoFn on a
>> Manual watermark.
>> TimestampObserving: Is invoked using the output timestamp for every
>> element that is output. This is functionally equivalent to calling
>> setWatermark after each output within your @ProcessElement method in the
>> SplittableDoFn. The MonotonicallyIncreasing implementation for
>> the TimestampObserving estimator ensures that the largest timestamp seen so
>> far will be reported for the watermark.
>>
>> The default is to not set any watermark estimate.
>>
>> For all watermark estimators you're allowed to set the watermark estimate
>> to anything as the runner will recompute the output watermark as:
>> new output watermark = max(previous output watermark, min(upstream
>> watermark, watermark estimates))
>> This effectively means that the watermark will never go backwards from
>> the runners point of view but that does mean that setting the watermark
>> estimate below the previous output watermark (which isn't observable) will
>> not do anything beyond holding the watermark at the previous output
>> watermark.
>>
>> Depending on the windowing strategy and allowed lateness, any records
>> that are output with a timestamp that is too early can be considered
>> droppably late, otherwise they will be late/ontime/early.
>>
>> So as an author for an SDF transform, you need to figure out:
>> 1) What timestamp your going to output your records at
>> * use upstream input elements timestamp: guidance use the default
>> implementation and to get the upstream watermark by default
>> * use data from within the record being output or external system state
>> via an API call: use a watermark estimator
>> 2) How you want to compute the watermark estimate (if at all)
>> * the choice here depends on how the elements timestamps progress, are
>> they in exactly sorted order, almost sorted order, completely unsorted, ...?
>>
>> For both of these it is upto you to choose how much flexibility in these
>> decisions you want to give to your users and that should guide what you
>> expose within the API (like how KafkaIO exposes a TimestampPolicy) or how
>> many other sources don't expose anything.
>>
>>
>> On Thu, Sep 17, 2020 at 8:43 AM Praveen K Viswanathan <
>> harish.prav...@gmail.com> wrote:
>>
>>> Hi Luke,
>>>

Re: Enabling checkpointing while running Flink Runner

2020-09-21 Thread Sruthi Sree Kumar
Thanks Kyle. 

On 2020/09/09 14:20:30, Kyle Weaver  wrote: 
> > But, from the configuration, there is no way to pass the checkpoint
> interval.
> 
> Set the checkpointingInterval pipeline option.
> 
> https://beam.apache.org/documentation/runners/flink/
> 
> On Wed, Sep 9, 2020 at 4:44 AM Sruthi Sree Kumar 
> wrote:
> 
> > Hi,
> >
> > How do we enable checkpointing for Flink Runner? To enable checkpointing,
> > we set the checkpoint interval But, from the configuration, there is no way
> > to pass the checkpoint interval.
> >
> > We are trying to enable checkpointing while running NEXMark queries as
> > part of our experiments.
> >
> > --
> > Regards,
> >
> > Sruthi
> >
> >
> 


Re: Reviewers for SparkRunner

2020-09-21 Thread Kyle Weaver
I think Alexey would be a good reviewer for this.

Not sure about this particular case, but in general Beam's OWNERS files are
pretty outdated. If I'm not sure who to ask for a review, usually I look at
the git history of a file to see who has actively changed it recently.

On Mon, Sep 21, 2020 at 1:53 PM  wrote:

> About a week ago I submitted the following pull request:
>
> https://github.com/apache/beam/pull/12850
>
> I listed @jbonofre as a reviewer as they were listed in the OWNER file for
> the Spark runners.  However, I've yet to hear any response.  Is there
> someone else I should add as a reviewer?
>
> Appreciate it.
>
> Tim.
>
>


Re: What is the process to remove a Jenkins job?

2020-09-21 Thread Valentyn Tymofieiev
Thanks!

Added
https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-Deletingjobsthatarenolongerneeded
.

Feel free to correct if necessary.

On Mon, Sep 21, 2020 at 10:50 AM Luke Cwik  wrote:

> When the seed job runs next time, any job that isn't explicitly part of
> the seed job is disabled.
>
> The existing job history will stick around and eventually someone should
> delete them manually from Jenkins.
>

> On Mon, Sep 21, 2020 at 10:46 AM Valentyn Tymofieiev 
> wrote:
>
>> We are removing several jobs associated with Py2 and Py35. Is removing a
>> groovy file sufficient or Jenkins will still remember the job from the
>> earlier 'Seed' invocation and continue running it until manually disabled?
>> If so, what's the process for manually disabling the job?
>>
>> Looked at Jenkins tips on the dev wiki[1] but didn't see these
>> instructions.
>>
>> Thanks!
>>
>> [1] https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips
>>
>


Re: What is the process to remove a Jenkins job?

2020-09-21 Thread Luke Cwik
When the seed job runs next time, any job that isn't explicitly part of the
seed job is disabled.

The existing job history will stick around and eventually someone should
delete them manually from Jenkins.

On Mon, Sep 21, 2020 at 10:46 AM Valentyn Tymofieiev 
wrote:

> We are removing several jobs associated with Py2 and Py35. Is removing a
> groovy file sufficient or Jenkins will still remember the job from the
> earlier 'Seed' invocation and continue running it until manually disabled?
> If so, what's the process for manually disabling the job?
>
> Looked at Jenkins tips on the dev wiki[1] but didn't see these
> instructions.
>
> Thanks!
>
> [1] https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips
>


What is the process to remove a Jenkins job?

2020-09-21 Thread Valentyn Tymofieiev
We are removing several jobs associated with Py2 and Py35. Is removing a
groovy file sufficient or Jenkins will still remember the job from the
earlier 'Seed' invocation and continue running it until manually disabled?
If so, what's the process for manually disabling the job?

Looked at Jenkins tips on the dev wiki[1] but didn't see these instructions.

Thanks!

[1] https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips


Re: How to gracefully stop a beam application

2020-09-21 Thread Alexey Romanenko
Well, I think cancel() should work since it will call stop() in the end for 
CANCELLED state.

> On 21 Sep 2020, at 18:57, Mani Kolbe  wrote:
> 
> Hi Alexey,
> 
> stop() is a protected method. So I tried cancel() instead. Are they different 
> in behaviour?
> 
> Regards,
> Mani
> 
> On Mon, 21 Sep, 2020, 5:27 PM Alexey Romanenko,  > wrote:
> This is how Spark runner handle this:
> 
> https://github.com/apache/beam/blob/3e6f7b77add44b2b7b1a0ef4afd631642b7d0b59/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineResult.java#L173
>  
> 
> 
>> On 21 Sep 2020, at 18:17, Luke Cwik > > wrote:
>> 
>> +user  
>> 
>> On Mon, Sep 21, 2020 at 9:16 AM Luke Cwik > > wrote:
>> You need the "sources" to stop and advance the watermark to infinity and 
>> have that propagate through the entire pipeline. There are propoosals for 
>> pipeline drain[1] and also for snapshot and update[2] for Apache Beam. We 
>> would love contributions in this space.
>> 
>> Max shared some more details about how Flink users typically do this[3], 
>> does that apply to Spark?
>> 
>> 1: 
>> https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8
>>  
>> 
>> 2: 
>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
>>  
>> 
>> 3: 
>> https://lists.apache.org/thread.html/864eb7b4e7192706074059eef1e116146382552fa885dd6054ef4988%40%3Cuser.beam.apache.org%3E
>>  
>> 
>> On Mon, Sep 21, 2020 at 7:43 AM Sunny, Mani Kolbe > > wrote:
>> Forgot to mention, we are using spark runner.
>> 
>>  
>> 
>> From: Sunny, Mani Kolbe mailto:sun...@dnb.com>> 
>> Sent: Monday, September 21, 2020 12:33 PM
>> To: dev@beam.apache.org 
>> Subject: How to gracefully stop a beam application
>> 
>>  
>> 
>> CAUTION: This email originated from outside of D Please do not click 
>> links or open attachments unless you recognize the sender and know the 
>> content is safe.
>> 
>>  
>> 
>> Hello Beam community,
>> 
>>  
>> 
>> When you are running a Beam application in full stream mode, it is 
>> continuously running. What is the recommended way to stop it gracefully for 
>> say maintenance/upgrades etc? When I say gracefully, I mean (1) without data 
>> loss and (2) application existing with exit 0 code.
>> 
>>  
>> 
>> Regards,
>> 
>> Mani
>> 
> 



Re: How to gracefully stop a beam application

2020-09-21 Thread Mani Kolbe
Hi Alexey,

stop() is a protected method. So I tried cancel() instead. Are they
different in behaviour?

Regards,
Mani

On Mon, 21 Sep, 2020, 5:27 PM Alexey Romanenko, 
wrote:

> This is how Spark runner handle this:
>
>
> https://github.com/apache/beam/blob/3e6f7b77add44b2b7b1a0ef4afd631642b7d0b59/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineResult.java#L173
>
> On 21 Sep 2020, at 18:17, Luke Cwik  wrote:
>
> +user 
>
> On Mon, Sep 21, 2020 at 9:16 AM Luke Cwik  wrote:
>
>> You need the "sources" to stop and advance the watermark to infinity and
>> have that propagate through the entire pipeline. There are propoosals for
>> pipeline drain[1] and also for snapshot and update[2] for Apache Beam. We
>> would love contributions in this space.
>>
>> Max shared some more details about how Flink users typically do this[3],
>> does that apply to Spark?
>>
>> 1:
>> https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8
>> 2:
>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
>> 3:
>> https://lists.apache.org/thread.html/864eb7b4e7192706074059eef1e116146382552fa885dd6054ef4988%40%3Cuser.beam.apache.org%3E
>>
>> On Mon, Sep 21, 2020 at 7:43 AM Sunny, Mani Kolbe  wrote:
>>
>>> Forgot to mention, we are using spark runner.
>>>
>>>
>>>
>>> *From:* Sunny, Mani Kolbe 
>>> *Sent:* Monday, September 21, 2020 12:33 PM
>>> *To:* dev@beam.apache.org
>>> *Subject:* How to gracefully stop a beam application
>>>
>>>
>>>
>>> *CAUTION:* This email originated from outside of D Please do not
>>> click links or open attachments unless you recognize the sender and know
>>> the content is safe.
>>>
>>>
>>>
>>> Hello Beam community,
>>>
>>>
>>>
>>> When you are running a Beam application in full stream mode, it is
>>> continuously running. What is the recommended way to stop it gracefully for
>>> say maintenance/upgrades etc? When I say gracefully, I mean (1) without
>>> data loss and (2) application existing with exit 0 code.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Mani
>>>
>>
>


Re: How to gracefully stop a beam application

2020-09-21 Thread Alexey Romanenko
This is how Spark runner handle this:

https://github.com/apache/beam/blob/3e6f7b77add44b2b7b1a0ef4afd631642b7d0b59/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineResult.java#L173

> On 21 Sep 2020, at 18:17, Luke Cwik  wrote:
> 
> +user  
> 
> On Mon, Sep 21, 2020 at 9:16 AM Luke Cwik  > wrote:
> You need the "sources" to stop and advance the watermark to infinity and have 
> that propagate through the entire pipeline. There are propoosals for pipeline 
> drain[1] and also for snapshot and update[2] for Apache Beam. We would love 
> contributions in this space.
> 
> Max shared some more details about how Flink users typically do this[3], does 
> that apply to Spark?
> 
> 1: 
> https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8
>  
> 
> 2: 
> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
>  
> 
> 3: 
> https://lists.apache.org/thread.html/864eb7b4e7192706074059eef1e116146382552fa885dd6054ef4988%40%3Cuser.beam.apache.org%3E
>  
> 
> On Mon, Sep 21, 2020 at 7:43 AM Sunny, Mani Kolbe  > wrote:
> Forgot to mention, we are using spark runner.
> 
>  
> 
> From: Sunny, Mani Kolbe  
> Sent: Monday, September 21, 2020 12:33 PM
> To: dev@beam.apache.org 
> Subject: How to gracefully stop a beam application
> 
>  
> 
> CAUTION: This email originated from outside of D Please do not click links 
> or open attachments unless you recognize the sender and know the content is 
> safe.
> 
>  
> 
> Hello Beam community,
> 
>  
> 
> When you are running a Beam application in full stream mode, it is 
> continuously running. What is the recommended way to stop it gracefully for 
> say maintenance/upgrades etc? When I say gracefully, I mean (1) without data 
> loss and (2) application existing with exit 0 code.
> 
>  
> 
> Regards,
> 
> Mani
> 



Re: How to gracefully stop a beam application

2020-09-21 Thread Luke Cwik
+user 

On Mon, Sep 21, 2020 at 9:16 AM Luke Cwik  wrote:

> You need the "sources" to stop and advance the watermark to infinity and
> have that propagate through the entire pipeline. There are propoosals for
> pipeline drain[1] and also for snapshot and update[2] for Apache Beam. We
> would love contributions in this space.
>
> Max shared some more details about how Flink users typically do this[3],
> does that apply to Spark?
>
> 1:
> https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8
> 2:
> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
> 3:
> https://lists.apache.org/thread.html/864eb7b4e7192706074059eef1e116146382552fa885dd6054ef4988%40%3Cuser.beam.apache.org%3E
>
> On Mon, Sep 21, 2020 at 7:43 AM Sunny, Mani Kolbe  wrote:
>
>> Forgot to mention, we are using spark runner.
>>
>>
>>
>> *From:* Sunny, Mani Kolbe 
>> *Sent:* Monday, September 21, 2020 12:33 PM
>> *To:* dev@beam.apache.org
>> *Subject:* How to gracefully stop a beam application
>>
>>
>>
>> *CAUTION:* This email originated from outside of D Please do not
>> click links or open attachments unless you recognize the sender and know
>> the content is safe.
>>
>>
>>
>> Hello Beam community,
>>
>>
>>
>> When you are running a Beam application in full stream mode, it is
>> continuously running. What is the recommended way to stop it gracefully for
>> say maintenance/upgrades etc? When I say gracefully, I mean (1) without
>> data loss and (2) application existing with exit 0 code.
>>
>>
>>
>> Regards,
>>
>> Mani
>>
>


Re: [DISCUSS] Move Avro dependency out of core Beam

2020-09-21 Thread Cristian Constantinescu
All the proposed solutions seem reasonable. I'm not sure if one has an edge
over the other. I guess it depends on how cautiously the community would
like to move.

Maybe it's just my impression, but it seems to me that there are a few
changes that are held back for the sake of backwards compatibility. If this
is truly the case, maybe we can start thinking about the next major version
of Beam where:
- the dependency report email (Beam Dependency Check Report (2020-09-21))
can be tackled. There are quite a few deps in there that need to be
updated. I'm sure there will be more breaking changes.
- review some architectural decisions that are difficult to correct without
breaking things (eg: Avro in Core)
- compatibility with Java 11+
- features that we can't implement with our current code base

I'm not sure what Beam's roadmap is, but maybe we could set up a branch in
the main repo (and the checks) and try to tackle all this work so we get a
better idea of the scope (and unforeseen issues that will come forward)
that's really needed for a potential RC build in the short term future.

On Wed, Sep 16, 2020 at 6:40 AM Robert Bradshaw  wrote:

> An adapter seems a reasonable approach, and shouldn't be too hard.
>
> If the breakage is "we no longer provide Avro 1.8 by default; please
> depend on it explicitly if this breaks you" that seems reasonable to me, as
> it's easy to detect and remedy.
>
> On Tue, Sep 15, 2020 at 2:42 PM Ismaël Mejía  wrote:
>
>> Avro differences in our implementation are pretty minimal if you look at
>> the PR,
>> to the point that an Adapter should be really tiny if even needed.
>>
>> The big backwards incompatible changes in Avro > 1.8 were to remove
>> external
>> libraries from the public APIs e.g. guava, jackson and joda-time. Of
>> course this
>> does not seem to be much but almost every project using Avro was using
>> some of
>> these dependencies, luckily for Beam it was only joda-time and that is
>> already
>> fixed.
>>
>> Keeping backwards compatibility by making Avro part of an extension that
>> is
>> optional for core and using only Avro 1.8 compatible features on Beam's
>> source
>> code is the simplest path, and allow us to avoid all the issues, notice
>> that the
>> dependency that triggered the need for Avro 1.9 (and this thread) is a
>> runtime
>> dependency used by Confluent Schema Registry and this is an issue because
>> sdks-java-core is leaking Avro. Apart from this I am not aware of any
>> feature in
>> any other project that obliges anyone to use Avro 1.9 or 1.10 specific
>> code.
>>
>> Of course a really valid reason to want to use a more recent version of
>> Avro is
>> that Avro 1.8 leaks also unmaintained dependencies with serious security
>> issues
>> (Jackson 1.x).
>>
>> So in the end my main concern is breaking existing users code, this has
>> less
>> impact for us (Talend) but probably more for the rest of the community,
>> but if
>> we agree to break backwards compatibility for the sake of cleanliness
>> well we
>> should probably proceed, and of course give users also some warning.
>>
>> On Mon, Sep 14, 2020 at 7:13 PM Luke Cwik  wrote:
>> >
>> > In the Kafka module we reflectively figure out which version of Kafka
>> exists[1] on the classpath and then reflectively invoke some APIs to work
>> around differences in Kafka allowing our users to bring whichever version
>> they want.
>> >
>> > We could do something similar here and reflectively figure out which
>> Avro is on the classpath and invoke the appropriate methods. If we are
>> worried about performance of using reflection, we can write and compile two
>> different versions of an Avro adapter class and choose which one to use
>> (using reflection only once during class loading).
>> >
>> > e.g.
>> > AvroAdapter {
>> >   static final AvroAdapter INSTANCE;
>> >   static {
>> > if (avro19?) {
>> >   INSTANCE = new Avro19Adapater();
>> > } else {
>> >   INSTANCE = new Avro18Adapter();
>> >   }
>> >
>> >   ... methods needed for AvroAdapter implementations ...
>> > }
>> >
>> > Using reflection allows the user to choose which version they use and
>> pushes down the incompatibility issue from Apache Beam to our deps (e.g.
>> Spark).
>> >
>> > 1:
>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java
>> >
>> > On Fri, Sep 11, 2020 at 11:42 AM Kenneth Knowles 
>> wrote:
>> >>
>> >> I am not deep on the details myself but have reviewed various Avro
>> upgrade changes such as https://github.com/apache/beam/pull/9779 and
>> also some internal that I cannot link to. I believe the changes are small
>> and quite possibly we can create sdks/java/extensions/avro that works with
>> both Avro 1.8 and 1.9 and make Dataflow worker compatible with whatever the
>> user chooses. (I would expect Spark is trying to get to that point too?)
>> >>
>> >> So then if we have that can we achieve the goals? Spark runner users
>> that do not use Avro 

RE: How to gracefully stop a beam application

2020-09-21 Thread Sunny, Mani Kolbe
Forgot to mention, we are using spark runner.

From: Sunny, Mani Kolbe 
Sent: Monday, September 21, 2020 12:33 PM
To: dev@beam.apache.org
Subject: How to gracefully stop a beam application

CAUTION: This email originated from outside of D Please do not click links 
or open attachments unless you recognize the sender and know the content is 
safe.

Hello Beam community,

When you are running a Beam application in full stream mode, it is continuously 
running. What is the recommended way to stop it gracefully for say 
maintenance/upgrades etc? When I say gracefully, I mean (1) without data loss 
and (2) application existing with exit 0 code.

Regards,
Mani


Beam Dependency Check Report (2020-09-21)

2020-09-21 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
cachetools
3.1.1
4.1.1
2019-12-23
2020-07-08BEAM-9017
chromedriver-binary
83.0.4103.39.0
86.0.4240.22.0
2020-07-08
2020-09-07BEAM-10426
fastavro
0.23.6
1.0.0.post1
2020-08-03
2020-08-28BEAM-10798
mock
2.0.0
3.0.5
2019-05-20
2019-05-20BEAM-7369
mypy-protobuf
1.18
1.23
2020-03-24
2020-06-29BEAM-10346
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
pyarrow
0.17.1
1.0.1
2020-07-27
2020-08-24BEAM-10582
PyHamcrest
1.10.1
2.0.2
2020-01-20
2020-07-08BEAM-9155
pytest
4.6.11
6.0.2
2020-07-08
2020-09-14BEAM-8606
pytest-xdist
1.34.0
2.1.0
2020-08-17
2020-08-28BEAM-10713
setuptools
49.6.0
50.3.0
2020-08-17
2020-09-07BEAM-10714
tenacity
5.1.5
6.2.0
2019-11-11
2020-06-29BEAM-8607
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.amazonaws:amazon-kinesis-producer
0.13.1
0.14.1
2019-07-31
2020-07-31BEAM-10628
com.azure:azure-storage-blob
12.1.0
12.8.0
2019-12-05
2020-08-13BEAM-10800
com.datastax.cassandra:cassandra-driver-core
3.8.0
4.0.0
2019-10-29
2019-03-18BEAM-8674
com.esotericsoftware:kryo
4.0.2
5.0.0-RC9
2018-03-20
2020-08-14BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.29.0
0.33.0
2020-07-20
2020-09-14BEAM-6645
com.google.api.grpc:grpc-google-cloud-pubsub-v1
1.85.1
1.90.1
2020-03-09
2020-08-04BEAM-8677
com.google.api.grpc:grpc-google-common-protos
1.12.0
1.18.1
2018-06-29
2020-08-11BEAM-8633
com.google.api.grpc:proto-google-cloud-bigquerystorage-v1beta1
0.85.1
0.105.1
2020-01-08
2020-08-31BEAM-8678
com.google.api.grpc:proto-google-cloud-bigtable-v2
1.9.1
1.15.0
2020-01-10
2020-09-02BEAM-8679
com.google.api.grpc:proto-google-cloud-datastore-v1
0.85.0
0.88.0
2019-12-05
2020-09-17BEAM-8680
com.google.api.grpc:proto-google-cloud-pubsub-v1
1.85.1
1.90.1
2020-03-09
2020-08-04BEAM-8681
com.google.api.grpc:proto-google-cloud-spanner-admin-database-v1
1.59.0
2.0.1
2020-07-16
2020-09-18BEAM-8682
com.google.apis:google-api-services-bigquery
v2-rev20200719-1.30.10
v2-rev20200827-1.30.10
2020-07-26
2020-09-03BEAM-8684
com.google.apis:google-api-services-clouddebugger
v2-rev20200501-1.30.10
v2-rev20200807-1.30.10
2020-07-14
2020-08-17BEAM-8750
com.google.apis:google-api-services-cloudresourcemanager
v1-rev20200720-1.30.10
v2-rev20200831-1.30.10
2020-07-25
2020-09-03BEAM-8751
com.google.apis:google-api-services-dataflow
v1b3-rev20200713-1.30.10
v1beta3-rev12-1.20.0
2020-07-25
2015-04-29BEAM-8752
com.google.apis:google-api-services-healthcare
v1beta1-rev20200713-1.30.10
v1-rev20200909-1.30.10
2020-07-24
2020-09-15BEAM-10349
com.google.apis:google-api-services-pubsub
v1-rev20200713-1.30.10
v1-rev20200909-1.30.10
2020-07-25
2020-09-18BEAM-8753
com.google.apis:google-api-services-storage
v1-rev20200611-1.30.10
v1-rev20200814-1.30.10
2020-07-10
2020-09-07BEAM-8754
com.google.auto.service:auto-service
1.0-rc6
1.0-rc7
2019-07-16
2020-05-13BEAM-5541
com.google.auto.service:auto-service-annotations
1.0-rc6
1.0-rc7
2019-07-16
2020-05-13BEAM-10350
com.google.cloud:google-cloud-bigquery
1.108.0
1.118.0
2020-02-28
2020-09-17BEAM-8687
com.google.cloud:google-cloud-bigquerystorage
0.125.0-beta
1.5.1
2020-02-20
  

How to gracefully stop a beam application

2020-09-21 Thread Sunny, Mani Kolbe
Hello Beam community,

When you are running a Beam application in full stream mode, it is continuously 
running. What is the recommended way to stop it gracefully for say 
maintenance/upgrades etc? When I say gracefully, I mean (1) without data loss 
and (2) application existing with exit 0 code.

Regards,
Mani


Re: [DISCUSS] Clearing timers (https://github.com/apache/beam/pull/12836)

2020-09-21 Thread Jan Lukavský
Big +1 to fixing this issue. Clearing & setting timer should IMHO follow 
the same "happens-before" semantics. If timer is cleared before being 
actually fired, then it should definitely not fire.


Jan

On 9/19/20 2:26 AM, Reuven Lax wrote:
It appears that this PR tried to fix things for the Dataflow runner - 
https://github.com/apache/beam/pull/11924. It also ensure that if a 
timer fired in a bundle but was reset mid bundle to a later time mid 
bundle, then we will skip that timer firing.


I believe this bug was also fixed in the direct runner previously. We 
probably still need to fix it for other runners and portability.


Reuven

On Fri, Sep 18, 2020 at 4:48 PM Boyuan Zhang > wrote:


Hi Reuven,

Would you like to share the links to potential fixes? We can
figure out what we can do there.

On Fri, Sep 18, 2020 at 4:21 PM Reuven Lax mailto:re...@google.com>> wrote:



On Fri, Sep 18, 2020 at 3:14 PM Luke Cwik mailto:lc...@google.com>> wrote:

PR 12836[1] is adding support for clearing timers and
there is a discussion about what the semantics for a
cleared timer should be.

So far we have:
1) Clearing an unset timer is a no-op
2) If the last action on the timer was to clear it, then a
future bundle should not see it fire

Ambiguity occurs if the last action on a timer was to
clear it within the same bundle then should the current
bundle not see it fire if it has yet to become visible to
the user? Since element processing and timer firings are
"unordered", this can happen.

Having the clear prevent the timer from firing within the
same bundle if it has yet to fire could make sense and
simplifies clearing timer loops. For example:

|@ProcessElement process(ProcessContext c) { if
(initialCondition) { setTimer(); } else { clearTimer(); }
} @OnTimer onTimer(...) { do some side effect set timer to
fire again in the future }|

would require logic within the onTimer() method to check
to see if we should stop instead of relying on the fact
that the clear will prevent the timer loop.

On the other hand, we currently don't prevent timers from
firing that are eligible within the same bundle if their
firing time is changed within the bundle to some future
time. Clearing timers could be treated conceptually like
setting them to "infinity" and hence the current set logic
would suggest that we shouldn't prevent timer firings
that are part of the same bundle.


This "current" behavior is a bug, and one that has led to some
weird effects. There have been some PRs attempting to fix it,
and I think we should prioritize fixing this bug.


Are there additional use cases that we should consider
that suggest one approach over the other?
What do people think?

1: https://github.com/apache/beam/pull/12836