Re: [Proposal] Add a new Beam example to ingest data from Kafka to Pub/Sub

2020-10-14 Thread Reza Ardeshir Rokni
Just a thought, but what if in the future there were templates for other
runners?

Then having a template folder would fit nicely no? We could even have a
runner specific subfolder and maybe even a shared area for things that
could be used by all templates for all runners?

On Thu, 15 Oct 2020 at 11:47, Kenneth Knowles  wrote:

> Hi Ilya,
>
> I have added you to the "Contributors" role on Jira so you can be assigned
> tickets, and given you the ticket you filed since you are already solving
> it. Thanks!
>
> I have a very high level thought: Since Dataflow's "Flex Templates"
> feature is just any pipeline, perhaps the main pipeline can be more of an
> "example" and fit into the `examples/` folder? Then the containerization
> and Google-specific* JSON could be alongside. In this way, users of other
> runners could possibly use or learn from it even if they are not interested
> in GCP. I understand this is not your primary goal, considering
> the contribution. I just want to open this for discussion.
>
> Kenn
>
> *In fact, the JSON is very generic. It is not really "Google specific" in
> concept, just in practice.
>
> On Wed, Oct 14, 2020 at 12:14 PM Ilya Kozyrev 
> wrote:
>
>> Hi Beam Community,
>>
>> There was no feedback on the proposal, and I would like to submit PR for
>> this proposal.
>>
>> I created a JIRA improvement
>>  to track this
>> proposal and now submitting  PR
>>  in the Beam repository
>> related to the proposal that I sent before. We suggest adding /template
>> folder to the repository root to help discover templates by developers.
>> This will provide structure for future templates development for Beam.
>>
>> Could someone kindle help with reviewing the PR
>>  ?
>>
>> Thank you,
>> Ilya
>>
>> On 7 Oct 2020, at 21:23, Ilya Kozyrev  wrote:
>>
>> Hi Beam Community,
>>
>> I have a proposal to add Apache Beam example that is a template to ingest
>> data from Apache Kafka to Google Cloud Pub/Sub. More detailed information
>> about the proposed template can be found in README
>> 
>>  file,
>> and a prototype  was built with
>> a team. I'd like to ask for your feedback before moving forward with
>> finishing the template.
>>
>> I did not see a folder that provides easily discoverable templates to a
>> developer.  I would like to propose adding a "templates" folder where other
>> Apache Beam templates may be added in the future. E.g.,
>> beam/templates/java/kafka-to-pubsub could be used for the Kafka to Pub/Sub
>> template.
>>
>> Please share your feedback/comments about this proposal in the thread.
>>
>> Thank you,
>> Ilya
>>
>>
>>


Re: beam-sdks-java-bom.pom cannot be signed after upgrade to Gradle 6

2020-10-14 Thread Garrett Jones
My knowledge of this stuff has gotten rusty. What I remember: I had to do
special work since Gradle didn't have support for generating BOMs. The
generation process is kind of finicky because various stages need to run in
order and that order isn't obvious based on the structure of the build
rules; it must be that the signing stage doesn't see the generated pom
file. If you just want to get the release unblocked, you might have to
either 1) add a hack to generate the asc file at the right stage, or 2)
downgrade Gradle. Also, it's worth checking to see if Gradle 6 now has
support for generating BOMs, which it didn't used to have, but that would
be a bigger change than you might want to accept for unblocking the release.

-- Garrett Jones [ go/garrettjones-user-manual |
go/reconsider-inline-replies ]


On Wed, Oct 14, 2020 at 8:07 PM Kenneth Knowles  wrote:

> +Garrett Jones  who appears to have been
> involved and +Michael Luckey  who has touched this
> build.gradle (according to git) and is a bit of a build wizard.
>
> Kenn
>
> On Wed, Oct 14, 2020 at 7:34 PM Robin Qiu  wrote:
>
>> Hi all,
>>
>> I am working on creating Beam 2.25.0 RC1. The repo I created (
>> https://repository.apache.org/#stagingRepositories) failed to close
>> because
>>
>> Missing Signature:
>>> '/org/apache/beam/beam-sdks-java-bom/2.25.0/beam-sdks-java-bom-2.25.0.pom.asc'
>>> does not exist for 'beam-sdks-java-bom-2.25.0.pom'.
>>
>>
>> I checked pom files in other modules and their signatures are present, so
>> I think this problem only happens to beam-sdks-java-bom-2.25.0.pom. Also
>> this has not happened in previous releases. I suspect this is caused by the
>> recent upgrade to Gradle 6.
>>
>> I found that
>> https://github.com/apache/beam/blob/master/sdks/java/bom/build.gradle
>> does something special. It does not use a generated pom, instead it uses
>> its own template and copies that
>> to sdks/java/bom/build/publications/mavenJava/ as pom-default.xml. When I
>> run the publish task locally, I found in
>> sdks/java/bom/build/publications/mavenJava/ that the pom-default.xml is
>> signed (i.e. pom-default.xml.asc is present), but
>> beam-sdks-java-bom-2.25.0.pom is not signed (i.e.
>> beam-sdks-java-bom-2.25.0.pom.asc is not present) in the output repository.
>>
>> I tried to understand how the Gradle plugins (maven-publish and signing)
>> work and changed a few different configurations in
>> https://github.com/apache/beam/blob/master/sdks/java/bom/build.gradle
>> but that didn't help. Does anyone have context on this issue or have any
>> suggestions that I could try? This is currently blocking the 2.25.0 release.
>>
>> Thanks,
>> Robin
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "datapls-plat-team" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to datapls-plat-team+unsubscr...@google.com.
>> To view this discussion on the web visit
>> https://groups.google.com/a/google.com/d/msgid/datapls-plat-team/CA%2BVw4HU8wfux9OUf9Kt_V5q221WrEhHqYwaq_fCSk4ViqddJuA%40mail.gmail.com
>> 
>> .
>>
>


Re: [Proposal] Add a new Beam example to ingest data from Kafka to Pub/Sub

2020-10-14 Thread Kenneth Knowles
Hi Ilya,

I have added you to the "Contributors" role on Jira so you can be assigned
tickets, and given you the ticket you filed since you are already solving
it. Thanks!

I have a very high level thought: Since Dataflow's "Flex Templates" feature
is just any pipeline, perhaps the main pipeline can be more of an "example"
and fit into the `examples/` folder? Then the containerization and
Google-specific* JSON could be alongside. In this way, users of other
runners could possibly use or learn from it even if they are not interested
in GCP. I understand this is not your primary goal, considering
the contribution. I just want to open this for discussion.

Kenn

*In fact, the JSON is very generic. It is not really "Google specific" in
concept, just in practice.

On Wed, Oct 14, 2020 at 12:14 PM Ilya Kozyrev 
wrote:

> Hi Beam Community,
>
> There was no feedback on the proposal, and I would like to submit PR for
> this proposal.
>
> I created a JIRA improvement
>  to track this proposal
> and now submitting  PR  in the
> Beam repository related to the proposal that I sent before. We suggest
> adding /template folder to the repository root to help discover templates
> by developers. This will provide structure for future templates development
> for Beam.
>
> Could someone kindle help with reviewing the PR
>  ?
>
> Thank you,
> Ilya
>
> On 7 Oct 2020, at 21:23, Ilya Kozyrev  wrote:
>
> Hi Beam Community,
>
> I have a proposal to add Apache Beam example that is a template to ingest
> data from Apache Kafka to Google Cloud Pub/Sub. More detailed information
> about the proposed template can be found in README
> 
>  file,
> and a prototype  was built with a
> team. I'd like to ask for your feedback before moving forward with
> finishing the template.
>
> I did not see a folder that provides easily discoverable templates to a
> developer.  I would like to propose adding a "templates" folder where other
> Apache Beam templates may be added in the future. E.g.,
> beam/templates/java/kafka-to-pubsub could be used for the Kafka to Pub/Sub
> template.
>
> Please share your feedback/comments about this proposal in the thread.
>
> Thank you,
> Ilya
>
>
>


Re: beam-sdks-java-bom.pom cannot be signed after upgrade to Gradle 6

2020-10-14 Thread Kenneth Knowles
+Garrett Jones  who appears to have been involved
and +Michael Luckey  who has touched this build.gradle
(according to git) and is a bit of a build wizard.

Kenn

On Wed, Oct 14, 2020 at 7:34 PM Robin Qiu  wrote:

> Hi all,
>
> I am working on creating Beam 2.25.0 RC1. The repo I created (
> https://repository.apache.org/#stagingRepositories) failed to close
> because
>
> Missing Signature:
>> '/org/apache/beam/beam-sdks-java-bom/2.25.0/beam-sdks-java-bom-2.25.0.pom.asc'
>> does not exist for 'beam-sdks-java-bom-2.25.0.pom'.
>
>
> I checked pom files in other modules and their signatures are present, so
> I think this problem only happens to beam-sdks-java-bom-2.25.0.pom. Also
> this has not happened in previous releases. I suspect this is caused by the
> recent upgrade to Gradle 6.
>
> I found that
> https://github.com/apache/beam/blob/master/sdks/java/bom/build.gradle
> does something special. It does not use a generated pom, instead it uses
> its own template and copies that
> to sdks/java/bom/build/publications/mavenJava/ as pom-default.xml. When I
> run the publish task locally, I found in
> sdks/java/bom/build/publications/mavenJava/ that the pom-default.xml is
> signed (i.e. pom-default.xml.asc is present), but
> beam-sdks-java-bom-2.25.0.pom is not signed (i.e.
> beam-sdks-java-bom-2.25.0.pom.asc is not present) in the output repository.
>
> I tried to understand how the Gradle plugins (maven-publish and signing)
> work and changed a few different configurations in
> https://github.com/apache/beam/blob/master/sdks/java/bom/build.gradle but
> that didn't help. Does anyone have context on this issue or have any
> suggestions that I could try? This is currently blocking the 2.25.0 release.
>
> Thanks,
> Robin
>
> --
> You received this message because you are subscribed to the Google Groups
> "datapls-plat-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datapls-plat-team+unsubscr...@google.com.
> To view this discussion on the web visit
> https://groups.google.com/a/google.com/d/msgid/datapls-plat-team/CA%2BVw4HU8wfux9OUf9Kt_V5q221WrEhHqYwaq_fCSk4ViqddJuA%40mail.gmail.com
> 
> .
>


beam-sdks-java-bom.pom cannot be signed after upgrade to Gradle 6

2020-10-14 Thread Robin Qiu
Hi all,

I am working on creating Beam 2.25.0 RC1. The repo I created (
https://repository.apache.org/#stagingRepositories) failed to close because

Missing Signature:
> '/org/apache/beam/beam-sdks-java-bom/2.25.0/beam-sdks-java-bom-2.25.0.pom.asc'
> does not exist for 'beam-sdks-java-bom-2.25.0.pom'.


I checked pom files in other modules and their signatures are present, so I
think this problem only happens to beam-sdks-java-bom-2.25.0.pom. Also this
has not happened in previous releases. I suspect this is caused by the
recent upgrade to Gradle 6.

I found that
https://github.com/apache/beam/blob/master/sdks/java/bom/build.gradle does
something special. It does not use a generated pom, instead it uses its own
template and copies that to sdks/java/bom/build/publications/mavenJava/ as
pom-default.xml. When I run the publish task locally, I found in
sdks/java/bom/build/publications/mavenJava/ that the pom-default.xml is
signed (i.e. pom-default.xml.asc is present), but
beam-sdks-java-bom-2.25.0.pom is not signed (i.e.
beam-sdks-java-bom-2.25.0.pom.asc is not present) in the output repository.

I tried to understand how the Gradle plugins (maven-publish and signing)
work and changed a few different configurations in
https://github.com/apache/beam/blob/master/sdks/java/bom/build.gradle but
that didn't help. Does anyone have context on this issue or have any
suggestions that I could try? This is currently blocking the 2.25.0 release.

Thanks,
Robin


Re: [Proposal] Add a new Beam example to ingest data from Kafka to Pub/Sub

2020-10-14 Thread Ilya Kozyrev
Hi Beam Community,

There was no feedback on the proposal, and I would like to submit PR for this 
proposal.

I created a JIRA improvement 
to track this proposal and now submitting  
PR in the Beam repository related to 
the proposal that I sent before. We suggest adding /template folder to the 
repository root to help discover templates by developers. This will provide 
structure for future templates development for Beam.

Could someone kindle help with reviewing the 
PR ?

Thank you,
Ilya

On 7 Oct 2020, at 21:23, Ilya Kozyrev 
mailto:ilya.kozy...@akvelon.com>> wrote:

Hi Beam Community,

I have a proposal to add Apache Beam example that is a template to ingest data 
from Apache Kafka to Google Cloud Pub/Sub. More detailed information about the 
proposed template can be found in 
README
 file, and a prototype was built with a 
team. I'd like to ask for your feedback before moving forward with finishing 
the template.

I did not see a folder that provides easily discoverable templates to a 
developer.  I would like to propose adding a "templates" folder where other 
Apache Beam templates may be added in the future. E.g., 
beam/templates/java/kafka-to-pubsub could be used for the Kafka to Pub/Sub 
template.

Please share your feedback/comments about this proposal in the thread.

Thank you,
Ilya



Re: Dataflow updates fail with "Coder has changed" error using KafkaIO with SchemaCoder

2020-10-14 Thread Cameron Morgan
We are using an Avro Schema Registry and converting these schemas to Beam 
Schemas with `AvroUtils.toBeamSchema`: 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L314
 


It could be possible that the order is not preserved, but it’s not immediately 
obvious how to me. I will try to sort the fields before using the schema in the 
SchemaCoder.

These issues are interesting, and probably a good place to get started 
contributing to Beam anyways. I will follow up.

Thanks,
Cameron

> On Oct 9, 2020, at 1:47 PM, Brian Hulette  wrote:
> 
> Hi Cameron,
> 
> Thanks for bringing this up on the dev list. I'm quite familiar with Beam 
> schemas, but I should be clear I'm not that familiar with Dataflow's pipeline 
> update. +Reuven Lax  may need to check me there.
> 
> > I am curious if it has been determined what makes a Schema the same as 
> > another schema. From what I have seen in the codebase, it changes.
> 
> You're right, schema equality means different things in different contexts, 
> and we should be more clear about this. As I understand it, for pipeline 
> update the important thing isn't so much whether the schemas are actually 
> equal, but whether data encoded with the old schema can be understood by a 
> SchemaCoder referencing the new schema, because it's probable that the new 
> SchemaCoder will receive data that was encoded with the old SchemaCoder. In 
> order to satisfy that requirement, the old and the new schemas must have the 
> same fields* in the same order.
> It might not seem like maintaining the ordering is an issue, but it is for 
> schemas inferred from Java types. That's because there's no guarantee about 
> the order in which we'll discover the fields or methods when using reflection 
> APIs. I believe Reuven did some experiments here and found that the ordering 
> is essentially random, so when we infer a schema from a Java type in two 
> different executions it can result in two completely different field orders.
> 
> There are a couple of things we definitely need to do on the Beam side to 
> support pipeline update for SchemaCoder with possibly out-of-order fields:
> - BEAM-10277: Java's RowCoder needs to respect the encoding_position field in 
> the schema proto. This provides a layer of indirection for field ordering 
> that runners can modify to "fix" schemas that have the same fields but out of 
> order.
> - Java's SchemaCoder needs to encode the schema in a portable way, so that 
> runners will be able to inspect and modify the schema proto as described 
> above. Currently SchemaCoder is still represented in the pipeline proto as a 
> serialized Java class, so runners can't easily inspect/modify it.
> 
> 
> 
> All that being said, it looks like you may not be using SchemaCoder with a 
> schema inferred from a Java type. Where is `outputSchema` coming from? Is it 
> possible to make sure it maintains a consistent field order?
> If you can do that, this may be an easier problem. I think then we could make 
> a change on the Dataflow side to ignore the schema's UUID when checking for 
> update compatibility.
> On the other hand, if you need to support pipeline update for schemas with 
> out-of-order fields, we'd need to address the above tasks first. If you're 
> willing to work on them I can help direct you, these are things I've been 
> hoping to work on but haven't been able to get to.
> 
> Brian
> 
> * Looking forward we don't actually want to require the schemas to have the 
> same fields, we could allow adding/removing fields with certain limitations.
> 
> On Thu, Oct 8, 2020 at 12:55 PM Cameron Morgan  > wrote:
> Hey everyone,
> 
> Summary: 
> 
> There is an issue with the Dataflow runner and the “Update” capability while 
> using the beam native Row type, which I imagine also blocks the snapshots 
> feature (as the docs say the snapshots have the same restrictions as the 
> Update feature) but I have no experience there.
> 
> Currently when reading from KafkaIO with the valueCoder set as a SchemaCoder:
> 
> ```
> KafkaIO.Read()
> .withTopic(topic)
> .withKeyDeserializer(ByteArrayDeserializer::class.java)
> .withValueDeserializerAndCoder([Deserializer], 
> SchemaCoder.of(outputSchema))
> ```
> 
> Updates fail consistently with the error:
> ```
> The original job has not been aborted., The Coder or type for step 
> ReadInputTopic/Read(KafkaUnboundedSource)/DataflowRunner.StreamingUnboundedRead.ReadWithIds
>  has changed
> ```
> 
> There is an open issue about this, 
> https://issues.apache.org/jira/browse/BEAM-9502 
>  but I have not seen it 
> discussed in the mailing list so I wanted to start it. 
> 
> Investigation so far: 
> 
> This failing on Beam 2.2

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

2020-10-14 Thread Luke Cwik
Thanks Alexey, that is correct.

On Wed, Oct 14, 2020 at 10:33 AM Alexey Romanenko 
wrote:

> Thanks Luke, just I guess that the proper link should be this one:
>
> https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
>
> On 13 Oct 2020, at 00:23, Luke Cwik  wrote:
>
> I have a draft[1] off the blog ready. Please take a look.
>
> 1:
> http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo
>
> On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik  wrote:
>
>>
>>
>> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles  wrote:
>>
>>>
>>>
>>> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik  wrote:
>>>
 For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will
 use SDF powered Read transforms. Users can opt-out
 with --experiments=use_deprecated_read.

>>>
>>> Huzzah! In our release notes maybe be clear about the expectations for
>>> users:
>>>
>>> Done in https://github.com/apache/beam/pull/13015
>>
>>
>>>  - semantics are expected to be the same: file bugs for any change in
>>> results
>>>  - perf may vary: file bugs or write to user@
>>>
>>> I was unable to get Spark done for 2.25 as I found out that Spark
 streaming doesn't support watermark holds[1]. If someone knows more about
 the watermark system in Spark I could use some guidance here as I believe I
 have a version of unbounded SDF support written for Spark (I get all the
 expected output from tests, just that watermarks aren't being held back so
 PAssert fails).

>>>
>>> Spark's watermarks are not comparable to Beam's. The rule as I
>>> understand it is that any data that is later than `max(seen timestamps) -
>>> allowedLateness` is dropped. One difference is that dropping is relative to
>>> the watermark instead of expiring windows, like early versions of Beam. The
>>> other difference is that it track the latest event (some call it a "high
>>> water mark" because it is the highest datetime value seen) where Beam's
>>> watermark is an approximation of the earliest (some call it a "low water
>>> mark" because it is a guarantee that it will not dip lower). When I chatted
>>> about this with Amit in the early days, it was necessary to implement a
>>> Beam-style watermark using Spark state. I think that may still be the case,
>>> for correct results.
>>>
>>>
>> In the Spark implementation I saw that watermark holds weren't wired at
>> all to control Sparks watermarks and this was causing triggers to fire too
>> early.
>>
>>
>>> Also, I started a doc[2] to produce an updated blog post since the
 original SplittableDoFn blog from 2017 is out of date[3]. I was thinking of
 making this a new blog post and having the old blog post point to it. We
 could also remove the old blog post and or update it. Any thoughts?

>>>
>>> New blog post w/ pointer from the old one.
>>>
>>> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read
 expansion into each of the runners instead of having it within Read
 transform within beam-sdks-java-core.

>>>
>>> Approved! I did CC a bunch of runner authors already. I think the
>>> important thing is if a default changes we should be sure everyone is OK
>>> with the perf changes, and everyone is confident that no incorrect results
>>> are produced. The abstractions between sdk-core, runners-core-*, and
>>> individual runners is important to me:
>>>
>>>  - The SDK's job is to produce a portable, un-tweaked pipeline so moving
>>> flags out of SDK core (and IOs) ASAP is super important.
>>>  - The runner's job is to execute that pipeline, if they can, however
>>> they want. If a runner wants to run Read transforms differently/directly
>>> that is fine. If a runner is incapable of supporting SDF, then Read is
>>> better than nothing. Etc.
>>>  - The runners-core-* job is to just be internal libraries for runner
>>> authors to share code, and should not make any decisions about the Beam
>>> model, etc.
>>>
>>> Kenn
>>>
>>> 1: https://github.com/apache/beam/pull/12603
 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
 3: https://beam.apache.org/blog/splittable-do-fn/
 4: https://github.com/apache/beam/pull/13006


 On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels 
 wrote:

> Thanks Luke! I've had a pass.
>
> -Max
>
> On 28.08.20 01:22, Luke Cwik wrote:
> > As an update.
> >
> > Direct and Twister2 are done.
> > Samza: is ready for review[1].
> > Flink: is almost ready for review. [2] lays all the groundwork for
> the
> > migration and [3] finishes the migration (there is a timeout
> happening
> > in FlinkSubmissionTest that I'm trying to figure out).
> > No further updates on Spark[4] or Jet[5].
> >
> > @Maximilian Michels  or @t...@apache.org
> > , can either of you take a look at
> the
> > Flink PRs?
> > @ke.wu...@icloud.com 

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

2020-10-14 Thread Alexey Romanenko
Thanks Luke, just I guess that the proper link should be this one:
https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE

> On 13 Oct 2020, at 00:23, Luke Cwik  wrote:
> 
> I have a draft[1] off the blog ready. Please take a look.
> 
> 1: 
> http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo
>  
> 
> On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik  > wrote:
> 
> 
> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles  > wrote:
> 
> 
> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik  > wrote:
> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will use 
> SDF powered Read transforms. Users can opt-out with 
> --experiments=use_deprecated_read.
> 
> Huzzah! In our release notes maybe be clear about the expectations for users:
> 
> Done in https://github.com/apache/beam/pull/13015 
> 
>  
>  - semantics are expected to be the same: file bugs for any change in results
>  - perf may vary: file bugs or write to user@
> 
> I was unable to get Spark done for 2.25 as I found out that Spark streaming 
> doesn't support watermark holds[1]. If someone knows more about the watermark 
> system in Spark I could use some guidance here as I believe I have a version 
> of unbounded SDF support written for Spark (I get all the expected output 
> from tests, just that watermarks aren't being held back so PAssert fails).
> 
> Spark's watermarks are not comparable to Beam's. The rule as I understand it 
> is that any data that is later than `max(seen timestamps) - allowedLateness` 
> is dropped. One difference is that dropping is relative to the watermark 
> instead of expiring windows, like early versions of Beam. The other 
> difference is that it track the latest event (some call it a "high water 
> mark" because it is the highest datetime value seen) where Beam's watermark 
> is an approximation of the earliest (some call it a "low water mark" because 
> it is a guarantee that it will not dip lower). When I chatted about this with 
> Amit in the early days, it was necessary to implement a Beam-style watermark 
> using Spark state. I think that may still be the case, for correct results.
> 
> 
> In the Spark implementation I saw that watermark holds weren't wired at all 
> to control Sparks watermarks and this was causing triggers to fire too early.
>  
> Also, I started a doc[2] to produce an updated blog post since the original 
> SplittableDoFn blog from 2017 is out of date[3]. I was thinking of making 
> this a new blog post and having the old blog post point to it. We could also 
> remove the old blog post and or update it. Any thoughts?
> 
> New blog post w/ pointer from the old one.
> 
> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read 
> expansion into each of the runners instead of having it within Read transform 
> within beam-sdks-java-core.
> 
> Approved! I did CC a bunch of runner authors already. I think the important 
> thing is if a default changes we should be sure everyone is OK with the perf 
> changes, and everyone is confident that no incorrect results are produced. 
> The abstractions between sdk-core, runners-core-*, and individual runners is 
> important to me:
> 
>  - The SDK's job is to produce a portable, un-tweaked pipeline so moving 
> flags out of SDK core (and IOs) ASAP is super important.
>  - The runner's job is to execute that pipeline, if they can, however they 
> want. If a runner wants to run Read transforms differently/directly that is 
> fine. If a runner is incapable of supporting SDF, then Read is better than 
> nothing. Etc.
>  - The runners-core-* job is to just be internal libraries for runner authors 
> to share code, and should not make any decisions about the Beam model, etc.
> 
> Kenn
> 
> 1: https://github.com/apache/beam/pull/12603 
> 
> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE 
> 
> 3: https://beam.apache.org/blog/splittable-do-fn/ 
> 
> 4: https://github.com/apache/beam/pull/13006 
> 
> 
> 
> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels  > wrote:
> Thanks Luke! I've had a pass.
> 
> -Max
> 
> On 28.08.20 01:22, Luke Cwik wrote:
> > As an update.
> > 
> > Direct and Twister2 are done.
> > Samza: is ready for review[1].
> > Flink: is almost ready for review. [2] lays all the groundwork for the 
> > migration and [3] finishes the migration (there is a timeout happening 
> > in FlinkSubmissionTest that I'm trying to figure out).
> > No further updates on Spark[4] or Jet[5].
> > 
> > @Maximilian Michels > or 
> > @t...@apache.org 

Re: [DISCUSS] Move Avro dependency out of core Beam

2020-10-14 Thread Brian Hulette
It sounds like there's a consensus around making Beam core work with either
Avro 1.8 or 1.9. It looks like +Tomo Suzuki  actually
accomplished this already back in January for BEAM-9144 [1] and a user has
run pipelines on Dataflow with Avro 1.9.

Would we need to do anything else to unblock using Confluent Schema
Registry 5.5 (BEAM-9330 [2])?

[1] https://issues.apache.org/jira/browse/BEAM-9144
[2] https://issues.apache.org/jira/browse/BEAM-9330

On Tue, Oct 13, 2020 at 8:34 AM Piotr Szuberski 
wrote:

> We are working on IOs dependencies, most of them are pending in PRs right
> now.
>
> Any further thoughts/decisions on moving Avro out of Beam core?
>
> On 2020/09/21 14:56:16, Cristian Constantinescu  wrote:
> > All the proposed solutions seem reasonable. I'm not sure if one has an
> edge
> > over the other. I guess it depends on how cautiously the community would
> > like to move.
> >
> > Maybe it's just my impression, but it seems to me that there are a few
> > changes that are held back for the sake of backwards compatibility. If
> this
> > is truly the case, maybe we can start thinking about the next major
> version
> > of Beam where:
> > - the dependency report email (Beam Dependency Check Report (2020-09-21))
> > can be tackled. There are quite a few deps in there that need to be
> > updated. I'm sure there will be more breaking changes.
>


Re: [Proposal] Website Revamp project

2020-10-14 Thread Gris Cuevas
Hi Everyone, 

We're ready to start work on the revamp of the website, we'll use the PRD 
shared in this thread previously. 

Polidea will be the team working on this revamp and we'll be bringing designs 
and proposals to the community for review as the project progresses. 

Thank you! 
Gris

On 2020/09/09 18:14:18, Gris Cuevas  wrote: 
> Hi Beam community, 
> 
> In a previous thread [1] I mentioned that I was going to work on product 
> requirements document (PRD) for a project to address some of the requests and 
> ideas collected by Aizhamal, Rose and David in previous efforts. 
> 
> The PRD is ready [2], and I'd like to get your feedback before moving forward 
> into implementation. Please add you comments by Sunday Septermber 13, 2020. 
> 
> We're looking to start work on this project in around 2 weeks. 
> 
> Thank you! 
> Gris
> 
> [1] 
> https://lists.apache.org/thread.html/r1a4cea1e8b53bef73c49f75df13956024d8d78bc81b36e54ef71f8a9%40%3Cdev.beam.apache.org%3E
> 
> [2] https://s.apache.org/beam-site-revamp
> 


Re: Self-checkpoint Support on Portable Flink

2020-10-14 Thread Maximilian Michels
Duplicates cannot happen because the state of all operators will be 
rolled back to the latest checkpoint, in case of failures.


On 14.10.20 06:31, Reuven Lax wrote:
Does this mean that we have to deal with duplicate messages over the 
back edge? Or will that not happen, since duplicates mean that we rolled 
back a checkpoint.


On Tue, Oct 13, 2020 at 2:59 AM Maximilian Michels > wrote:


There would be ways around the lack of checkpointing in cycles, e.g.
buffer and backloop only after checkpointing is complete, similarly how
we implement @RequiresStableInput in the Flink Runner.

-Max

On 07.10.20 04:05, Reuven Lax wrote:
 > It appears that there's a proposal
 >

(https://cwiki.apache.org/confluence/display/FLINK/FLIP-16%3A+Loop+Fault+Tolerance



 >

>)

 > and an abandoned PR to fix this, but AFAICT this remains a
limitation of
 > Flink. If Flink can't guarantee processing of records on back
edges, I
 > don't think we can use cycles, as we might otherwise lose the
residuals.
 >
 > On Tue, Oct 6, 2020 at 6:16 PM Reuven Lax mailto:re...@google.com>
 > >> wrote:
 >
 >     This is what I was thinking of
 >
 >     "Flink currently only provides processing guarantees for jobs
 >     without iterations. Enabling checkpointing on an iterative job
 >     causes an exception. In order to force checkpointing on an
iterative
 >     program the user needs to set a special flag when enabling
 >     checkpointing:|env.enableCheckpointing(interval,
 >     CheckpointingMode.EXACTLY_ONCE, force = true)|.
 >
 >     Please note that records in flight in the loop edges (and the
state
 >     changes associated with them) will be lost during failure."
 >
 >
 >
 >
 >
 >
 >     On Tue, Oct 6, 2020 at 5:44 PM Boyuan Zhang
mailto:boyu...@google.com>
 >     >> wrote:
 >
 >         Hi Reuven,
 >
 >         As Luke mentioned, at least there are some limitations around
 >         tracking watermark with flink cycles. I'm going to use
State +
 >         Timer without flink cycle to support self-checkpoint. For
 >         dynamic split, we can either explore flink cycle approach or
 >         limit depth approach.
 >
 >         On Tue, Oct 6, 2020 at 5:33 PM Reuven Lax
mailto:re...@google.com>
 >         >> wrote:
 >
 >             Aren't there some limitations associated with flink
cycles?
 >             I seem to remember various features that could not be
used.
 >             I'm assuming that watermarks are not supported across
 >             cycles, but is there anything else?
 >
 >             On Tue, Oct 6, 2020 at 7:12 AM Maximilian Michels
 >             mailto:m...@apache.org>
>> wrote:
 >
 >                 Thanks for starting the conversation. The two
approaches
 >                 both look good
 >                 to me. Probably we want to start with approach #1 for
 >                 all Runners to be
 >                 able to support delaying bundles. Flink supports
cycles
 >                 and thus
 >                 approach #2 would also be applicable and could be
used
 >                 to implement
 >                 dynamic splitting.
 >
 >                 -Max
 >
 >                 On 05.10.20 23:13, Luke Cwik wrote:
 >                  > Thanks Boyuan, I left a few comments.
 >                  >
 >                  > On Mon, Oct 5, 2020 at 11:12 AM Boyuan Zhang
 >                 mailto:boyu...@google.com>
>
 >                  > 
 >                                   >
 >                  >     Hi team,
 >                  >
 >                  >     I'm looking at adding self-checkpoint
support to
 >                 portable Flink
 >                  >     runner(BEAM-10940
 >                  >   
  
 >                 >>) for
 >                 both