date:20190828

Re: [VOTE] Vendored Dependencies Release

2019-08-28 Thread Andrew Pilloud

You need to close the release for it to be published to the staging server.
I can help if you still have questions.

Andrew

On Wed, Aug 28, 2019, 8:48 PM Rui Wang  wrote:

> I can see prgapachebeam-1083 is in open status in staging repository. I am
> not sure why it is not public exposed. I probably need some guidance on it.
>
>
> -Rui
>
> On Wed, Aug 28, 2019 at 3:50 PM Kai Jiang  wrote:
>
>> Hi Rui,
>>
>> For accessing artifacts [1] in Maven Central Repository, is this intent
>> to be not public exposed?
>>
>> Best,
>> Kai
>>
>> [1]
>> https://repository.apache.org/content/repositories/orgapachebeam-1083/
>>
>> On Wed, Aug 28, 2019 at 11:57 AM Kai Jiang  wrote:
>>
>>> +1 (non-binding)Thanks Rui!
>>>
>>> On Tue, Aug 27, 2019 at 10:46 PM Rui Wang  wrote:
>>>
 Please review the release of the following artifacts that we vendor:

  * beam-vendor-calcite-1_20_0

 Hi everyone,

 Please review and vote on the release candidate #1 for the
 org.apache.beam:beam-vendor-calcite-1_20_0:0.1, as follows:

 [ ] +1, Approve the release

 [ ] -1, Do not approve the release (please provide specific comments)


 The complete staging area is available for your review, which includes:

 * the official Apache source release to be deployed to dist.apache.org
 [1], which is signed with the key with fingerprint
 0D7BE1A252DBCEE89F6491BBDFA64862B703F5C8 [2],

 * all artifacts to be deployed to the Maven Central Repository [3],

 * commit hash "664e25019fc1977e7041e4b834e8d9628b912473" [4],

 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.

 Thanks,

 Rui

 [1] https://dist.apache.org/repos/dist/dev/beam/vendor/calcite/1_20_0

 [2] https://dist.apache.org/repos/dist/release/beam/KEYS

 [3]
 https://repository.apache.org/content/repositories/orgapachebeam-1083/

 [4]
 https://github.com/apache/beam/commit/664e25019fc1977e7041e4b834e8d9628b912473

Re: Is it too late to switch to Java 8 time for the schema aware Row and Beam SQL?

2019-08-28 Thread Rui Wang

We need to figure out a way to make sure we can read the data without
losing precision. It will likely be case by case.


-Rui

On Wed, Aug 28, 2019 at 2:27 AM Alex Van Boxel  wrote:

> Thanks, how will ZetaSQL support higher precision as the input in general
> will be Instant anyway. Will it rely on the "pending" standardized logical
> types?
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Aug 19, 2019 at 7:02 AM Rui Wang  wrote:
>
>> However, more challengings come from:
>>
>> 1. How to read data without losing precision. Beam Java SDK uses Joda
>> already so it's very likely that you will need update IO somehow to support
>> higher precision.
>> 2. How to process higher precision in BeamSQL. It means SQL functions
>> should support higher precision. If you use Beam Calcite, unfortunately it
>> will only support up to millis. If you use Beam ZetaSQL (under review),
>> there are opportunities to support higher precision for SQL functions.
>>
>>
>> -Rui
>>
>> On Sun, Aug 18, 2019 at 9:52 PM Rui Wang  wrote:
>>
>>> We have been discussing it for a long time. I think if you only want to
>>> support more precision (e.g. up to nanosecond) for BeamSQL, it's actually
>>> relatively straightforward to support it by using a logical type for
>>> BeamSQL.
>>>
>>>
>>> -Rui
>>>
>>> On Sat, Aug 17, 2019 at 7:21 AM Alex Van Boxel  wrote:
>>>
 I know it's probably futile, but the more I'm working on features that
 are related to schema awareness I'm getting a bit frustrated about the lack
 of precision of the joda instance.

 As soon as we have a conversion to the DateTime I need to drop
 precession, this happens with the Protobuf timestamp (nanoseconds), but I
 also notice it with BigQuery (milliseconds).

 Suggestions?

  _/
 _/ Alex Van Boxel

>>>

Re: [VOTE] Vendored Dependencies Release

2019-08-28 Thread Rui Wang

I can see prgapachebeam-1083 is in open status in staging repository. I am
not sure why it is not public exposed. I probably need some guidance on it.


-Rui

On Wed, Aug 28, 2019 at 3:50 PM Kai Jiang  wrote:

> Hi Rui,
>
> For accessing artifacts [1] in Maven Central Repository, is this intent to
> be not public exposed?
>
> Best,
> Kai
>
> [1] https://repository.apache.org/content/repositories/orgapachebeam-1083/
>
> On Wed, Aug 28, 2019 at 11:57 AM Kai Jiang  wrote:
>
>> +1 (non-binding)Thanks Rui!
>>
>> On Tue, Aug 27, 2019 at 10:46 PM Rui Wang  wrote:
>>
>>> Please review the release of the following artifacts that we vendor:
>>>
>>>  * beam-vendor-calcite-1_20_0
>>>
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #1 for the
>>> org.apache.beam:beam-vendor-calcite-1_20_0:0.1, as follows:
>>>
>>> [ ] +1, Approve the release
>>>
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which includes:
>>>
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [1], which is signed with the key with fingerprint
>>> 0D7BE1A252DBCEE89F6491BBDFA64862B703F5C8 [2],
>>>
>>> * all artifacts to be deployed to the Maven Central Repository [3],
>>>
>>> * commit hash "664e25019fc1977e7041e4b834e8d9628b912473" [4],
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>>
>>> Rui
>>>
>>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/calcite/1_20_0
>>>
>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>
>>> [3]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1083/
>>>
>>> [4]
>>> https://github.com/apache/beam/commit/664e25019fc1977e7041e4b834e8d9628b912473
>>>
>>>

Re: Improve container support

2019-08-28 Thread Sam Bourne

>
> If docker hub is the defacto place let's use that. bintray, or gcr.io (with
> a new GCP project) also sounds like good options. I am impartial to the
> choice of the service. Does anyone have a strong preference here?


Not a strong preference, but it would simplify stuff for us if you choose
dockerhub to limit the number of places we need access to behind our
firewall. Also, the "Official Docker Images" of Flink are there
https://hub.docker.com/_/flink.

On Thu, Aug 29, 2019 at 6:45 AM Hannah Jiang  wrote:

> Thanks for letting me know about acces issue and sharing solution.
> Here I created a new one
> 
> with gmail.com account.
> Please let me know if you still see the problems.
>
> On Wed, Aug 28, 2019 at 6:08 AM Lukasz Cwik  wrote:
>
>> Google locks down docs created wtih @google.com addresses. Hannah please
>> recreate the doc using a non @google.com address and share it with the
>> community. You'll want to replace Google short link with an Apache short
>> link (s.apache.org).
>>
>> On Wed, Aug 28, 2019 at 5:40 AM Gleb Kanterov  wrote:
>>
>>> Google Doc doesn't seem to be shared with dev@. Can anybody
>>> double-check?
>>>
>>> On Wed, Aug 28, 2019 at 7:36 AM Hannah Jiang 
>>> wrote:
>>>
 add dev@

 On Tue, Aug 27, 2019 at 9:29 PM Hannah Jiang 
 wrote:

> Thanks for commenting and discussions.
> I created a Google Docs
> 
>  for
> easy commenting and reviewing. From this moment, all changes will be
> updated to the Google Docs and I will sync to wiki after finalize all 
> plans.
>
> Thanks,
> Hannah
>
> On Tue, Aug 27, 2019 at 9:24 PM Ahmet Altay  wrote:
>
>> Hi datapls-engprod,
>>
>> I have a question. Do you know what would it take to create a new gcp
>> project similar to apache-beam-testing for purposes of distributing gcr
>> packages? We can use the same billing account.
>>
>> Hannah, Robert, depending on the complexity of creating another gcp
>> project we can go with that, or simply create a new bintray account. 
>> Either
>> way would give us a clean new project to publish artifacts.
>>
>> Ahmet
>>
>> -- Forwarded message -
>> From: Robert Bradshaw 
>> Date: Tue, Aug 27, 2019 at 6:48 PM
>> Subject: Re: Improve container support
>> To: dev 
>>
>>
>> On Tue, Aug 27, 2019 at 6:20 PM Ahmet Altay  wrote:
>> >
>> > On Tue, Aug 27, 2019 at 5:50 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>> >>
>> >> On Tue, Aug 27, 2019 at 3:35 PM Hannah Jiang <
>> hannahji...@google.com> wrote:
>> >> >
>> >> > Hi team
>> >> >
>> >> > I am working on improving docker container support for Beam. We
>> would like to publish prebuilt containers for each release version and
>> daily snapshot. Current work focuses on release images only and it would 
>> be
>> part of the release process.
>> >>
>> >> This would be great!
>> >>
>> >> > The release images will be pushed to GCR which is publicly
>> accessible(pullable). We will use the following locations.
>> >> > Repository: gcr.io/beam
>> >> > Project: apache-beam-testing
>> >>
>> >> Given that these are release artifacts, we should use a project
>> with
>> >> more restricted access than "anyone who opens a PR on github."
>> >
>> >
>> > We have two options:
>> > -  gcr.io works based on the permissions of the gcs bucket that is
>> backing it. GCS supports bucket only permissions. These permissions needs
>> to be explicitly granted and the service accounts used by jenkins jobs 
>> does
>> not have these explicit permissions today.
>> > - we can create a new project in gcr, bintray or anything else that
>> offers the same service.
>>
>> I think the cleanest is to simply have a new project whose membership
>> consists of (interested) PMC members. If we have to populate this
>> manually I think that'd still be OK as the churn is quite low.
>>
>
>>>
>>> --
>>> Cheers,
>>> Gleb
>>>
>>

Re: [VOTE] Vendored Dependencies Release

2019-08-28 Thread Kai Jiang

Hi Rui,

For accessing artifacts [1] in Maven Central Repository, is this intent to
be not public exposed?

Best,
Kai

[1] https://repository.apache.org/content/repositories/orgapachebeam-1083/

On Wed, Aug 28, 2019 at 11:57 AM Kai Jiang  wrote:

> +1 (non-binding)Thanks Rui!
>
> On Tue, Aug 27, 2019 at 10:46 PM Rui Wang  wrote:
>
>> Please review the release of the following artifacts that we vendor:
>>
>>  * beam-vendor-calcite-1_20_0
>>
>> Hi everyone,
>>
>> Please review and vote on the release candidate #1 for the
>> org.apache.beam:beam-vendor-calcite-1_20_0:0.1, as follows:
>>
>> [ ] +1, Approve the release
>>
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which includes:
>>
>> * the official Apache source release to be deployed to dist.apache.org
>> [1], which is signed with the key with fingerprint
>> 0D7BE1A252DBCEE89F6491BBDFA64862B703F5C8 [2],
>>
>> * all artifacts to be deployed to the Maven Central Repository [3],
>>
>> * commit hash "664e25019fc1977e7041e4b834e8d9628b912473" [4],
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>>
>> Rui
>>
>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/calcite/1_20_0
>>
>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>
>> [3]
>> https://repository.apache.org/content/repositories/orgapachebeam-1083/
>>
>> [4]
>> https://github.com/apache/beam/commit/664e25019fc1977e7041e4b834e8d9628b912473
>>
>>

[PROPOSAL] Preparing for Beam 2.16.0 release

2019-08-28 Thread Mark Liu

Hi all,

Beam 2.16 release branch cut is scheduled on Sep 11 according to the
release calendar [1]. I would like to volunteer myself to do this release.
The plan is to cut the branch on that date, and cherrypick release-blocking
fixes afterwards if any.

If you have release blocking issues for 2.16 please mark their "Fix
Version" as 2.16.0 [2]. This tag is already created in JIRA in case you
would like to move any non-blocking issues to that version.

Any thoughts, comments, objections?

Regards.
Mark Liu

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2]
https://issues.apache.org/jira/browse/BEAM-8105?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.16.0

Re: RowWithGetters, FieldValueGetter being Serializable?

2019-08-28 Thread Reuven Lax

My first thought is that Row shouldn't need to be serializable at all.
However it's possible that we have a dependency in some code on it being
serializable. Worth testing it out to see what breaks.

On Wed, Aug 28, 2019 at 2:55 AM Alex Van Boxel  wrote:

> Hi,
>
> I noticed that RowWithGetters and FieldValueGetter are both serializable
> (all in package org.apache.beam.sdk.values). I do have my doubt if they
> should be.
>
> Certainly RowWithGetters would be problematic:
>
>- it references the underlying object that could be anything and it's *not
>guaranteed* Serializable. In my case I'm referring a Protobuf message
>that is not.
>- the FieldValueGetter should also not be Serializable as they are
>generated by the factory. I'm implementing Getters that also needs
>FieldDescriptors to access the underlying dynamic Protobuf fields,
>FieldDescriptors are also not serializable.
>
> The *only* class that should be serializable is RowWithStorage as the
> current implementation will convert any type of Row to this as soon as a
> serialization step needs to happen.
>
> Thoughts?! If you all agree, I'll create a ticket and fix this, as this is
> a bit blocking my implementation of Protobuf (as it wont pass SpotBug
> complaining on non serializable fields in FieldValueGetter).
>
>  _/
> _/ Alex Van Boxel
>

Re: [VOTE] Vendored Dependencies Release

2019-08-28 Thread Kai Jiang

+1 (non-binding)Thanks Rui!

On Tue, Aug 27, 2019 at 10:46 PM Rui Wang  wrote:

> Please review the release of the following artifacts that we vendor:
>
>  * beam-vendor-calcite-1_20_0
>
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the
> org.apache.beam:beam-vendor-calcite-1_20_0:0.1, as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
>
> * the official Apache source release to be deployed to dist.apache.org
> [1], which is signed with the key with fingerprint
> 0D7BE1A252DBCEE89F6491BBDFA64862B703F5C8 [2],
>
> * all artifacts to be deployed to the Maven Central Repository [3],
>
> * commit hash "664e25019fc1977e7041e4b834e8d9628b912473" [4],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
>
> Rui
>
> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/calcite/1_20_0
>
> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>
> [3] https://repository.apache.org/content/repositories/orgapachebeam-1083/
>
> [4]
> https://github.com/apache/beam/commit/664e25019fc1977e7041e4b834e8d9628b912473
>
>

Re: Improve container support

2019-08-28 Thread Ahmet Altay

+1 to Pablo's comment. I believe any place that offers this service would
be fine and we would need to figure out how to set it up correctly. Our
pypi model seems to work fine so far (PMC members are owners, release
managers are added as maintainers as needed.). To me, the important part is
that we make progress so that we can add the containers to the release
process. We postponed this decision in the past 2 releases because the
conversation started late along with the release branch cut emails.

If docker hub is the defacto place let's use that. bintray, or gcr.io (with
a new GCP project) also sounds like good options. I am impartial to the
choice of the service. Does anyone have a strong preference here?

Ahmet

On Wed, Aug 28, 2019 at 8:25 AM Pablo Estrada  wrote:

> A question about the repository we're using. It seems that Docker Hub is
> the de-facto repo for docker images? Is GCR pretty much the same in terms
> of access, authentication, etc?
> We don't have to figure out the repository immediately, and it's fine to
> iterate - but I wanted to make sure we think about that : )
>
> On Wed, Aug 28, 2019 at 6:08 AM Lukasz Cwik  wrote:
>
>> Google locks down docs created wtih @google.com addresses. Hannah please
>> recreate the doc using a non @google.com address and share it with the
>> community. You'll want to replace Google short link with an Apache short
>> link (s.apache.org).
>>
>> On Wed, Aug 28, 2019 at 5:40 AM Gleb Kanterov  wrote:
>>
>>> Google Doc doesn't seem to be shared with dev@. Can anybody
>>> double-check?
>>>
>>> On Wed, Aug 28, 2019 at 7:36 AM Hannah Jiang 
>>> wrote:
>>>
 add dev@

 On Tue, Aug 27, 2019 at 9:29 PM Hannah Jiang 
 wrote:

> Thanks for commenting and discussions.
> I created a Google Docs
> 
>  for
> easy commenting and reviewing. From this moment, all changes will be
> updated to the Google Docs and I will sync to wiki after finalize all 
> plans.
>
> Thanks,
> Hannah
>
> On Tue, Aug 27, 2019 at 9:24 PM Ahmet Altay  wrote:
>
>> Hi datapls-engprod,
>>
>> I have a question. Do you know what would it take to create a new gcp
>> project similar to apache-beam-testing for purposes of distributing gcr
>> packages? We can use the same billing account.
>>
>> Hannah, Robert, depending on the complexity of creating another gcp
>> project we can go with that, or simply create a new bintray account. 
>> Either
>> way would give us a clean new project to publish artifacts.
>>
>> Ahmet
>>
>> -- Forwarded message -
>> From: Robert Bradshaw 
>> Date: Tue, Aug 27, 2019 at 6:48 PM
>> Subject: Re: Improve container support
>> To: dev 
>>
>>
>> On Tue, Aug 27, 2019 at 6:20 PM Ahmet Altay  wrote:
>> >
>> > On Tue, Aug 27, 2019 at 5:50 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>> >>
>> >> On Tue, Aug 27, 2019 at 3:35 PM Hannah Jiang <
>> hannahji...@google.com> wrote:
>> >> >
>> >> > Hi team
>> >> >
>> >> > I am working on improving docker container support for Beam. We
>> would like to publish prebuilt containers for each release version and
>> daily snapshot. Current work focuses on release images only and it would 
>> be
>> part of the release process.
>> >>
>> >> This would be great!
>> >>
>> >> > The release images will be pushed to GCR which is publicly
>> accessible(pullable). We will use the following locations.
>> >> > Repository: gcr.io/beam
>> >> > Project: apache-beam-testing
>> >>
>> >> Given that these are release artifacts, we should use a project
>> with
>> >> more restricted access than "anyone who opens a PR on github."
>> >
>> >
>> > We have two options:
>> > -  gcr.io works based on the permissions of the gcs bucket that is
>> backing it. GCS supports bucket only permissions. These permissions needs
>> to be explicitly granted and the service accounts used by jenkins jobs 
>> does
>> not have these explicit permissions today.
>> > - we can create a new project in gcr, bintray or anything else that
>> offers the same service.
>>
>> I think the cleanest is to simply have a new project whose membership
>> consists of (interested) PMC members. If we have to populate this
>> manually I think that'd still be OK as the churn is quite low.
>>
>
>>>
>>> --
>>> Cheers,
>>> Gleb
>>>
>>

Re: Improve container support

2019-08-28 Thread Pablo Estrada

A question about the repository we're using. It seems that Docker Hub is
the de-facto repo for docker images? Is GCR pretty much the same in terms
of access, authentication, etc?
We don't have to figure out the repository immediately, and it's fine to
iterate - but I wanted to make sure we think about that : )

On Wed, Aug 28, 2019 at 6:08 AM Lukasz Cwik  wrote:

> Google locks down docs created wtih @google.com addresses. Hannah please
> recreate the doc using a non @google.com address and share it with the
> community. You'll want to replace Google short link with an Apache short
> link (s.apache.org).
>
> On Wed, Aug 28, 2019 at 5:40 AM Gleb Kanterov  wrote:
>
>> Google Doc doesn't seem to be shared with dev@. Can anybody double-check?
>>
>> On Wed, Aug 28, 2019 at 7:36 AM Hannah Jiang 
>> wrote:
>>
>>> add dev@
>>>
>>> On Tue, Aug 27, 2019 at 9:29 PM Hannah Jiang 
>>> wrote:
>>>
 Thanks for commenting and discussions.
 I created a Google Docs
 
  for
 easy commenting and reviewing. From this moment, all changes will be
 updated to the Google Docs and I will sync to wiki after finalize all 
 plans.

 Thanks,
 Hannah

 On Tue, Aug 27, 2019 at 9:24 PM Ahmet Altay  wrote:

> Hi datapls-engprod,
>
> I have a question. Do you know what would it take to create a new gcp
> project similar to apache-beam-testing for purposes of distributing gcr
> packages? We can use the same billing account.
>
> Hannah, Robert, depending on the complexity of creating another gcp
> project we can go with that, or simply create a new bintray account. 
> Either
> way would give us a clean new project to publish artifacts.
>
> Ahmet
>
> -- Forwarded message -
> From: Robert Bradshaw 
> Date: Tue, Aug 27, 2019 at 6:48 PM
> Subject: Re: Improve container support
> To: dev 
>
>
> On Tue, Aug 27, 2019 at 6:20 PM Ahmet Altay  wrote:
> >
> > On Tue, Aug 27, 2019 at 5:50 PM Robert Bradshaw 
> wrote:
> >>
> >> On Tue, Aug 27, 2019 at 3:35 PM Hannah Jiang <
> hannahji...@google.com> wrote:
> >> >
> >> > Hi team
> >> >
> >> > I am working on improving docker container support for Beam. We
> would like to publish prebuilt containers for each release version and
> daily snapshot. Current work focuses on release images only and it would 
> be
> part of the release process.
> >>
> >> This would be great!
> >>
> >> > The release images will be pushed to GCR which is publicly
> accessible(pullable). We will use the following locations.
> >> > Repository: gcr.io/beam
> >> > Project: apache-beam-testing
> >>
> >> Given that these are release artifacts, we should use a project with
> >> more restricted access than "anyone who opens a PR on github."
> >
> >
> > We have two options:
> > -  gcr.io works based on the permissions of the gcs bucket that is
> backing it. GCS supports bucket only permissions. These permissions needs
> to be explicitly granted and the service accounts used by jenkins jobs 
> does
> not have these explicit permissions today.
> > - we can create a new project in gcr, bintray or anything else that
> offers the same service.
>
> I think the cleanest is to simply have a new project whose membership
> consists of (interested) PMC members. If we have to populate this
> manually I think that'd still be OK as the churn is quite low.
>

>>
>> --
>> Cheers,
>> Gleb
>>
>

Re: Write-through-cache in State logic

2019-08-28 Thread Maximilian Michels


I've tried to put the current design into code. Any feedback appreciated for 
these changes to enable caching of user state:

Proto: https://github.com/apache/beam/pull/9440
Runner: https://github.com/apache/beam/pull/9374
Python SDK: https://github.com/apache/beam/pull/9418

Thanks,
Max

On 28.08.19 11:48, Maximilian Michels wrote:

> Just to clarify, the repeated list of cache tokens in the process
> bundle request is used to validate reading *and* stored when writing?
> In that sense, should they just be called version identifiers or
> something like that?

We could call them version identifiers, though cache tokens were always
a means to identify versions of a state.

On 28.08.19 11:10, Maximilian Michels wrote:
>> cachetools sounds like a fine choice to me.
>
> For the first version I've implemented a simple LRU cache. If you want
> to have a look:
> 
https://github.com/apache/beam/pull/9418/files#diff-ed2d70e99442b6e1668e30409d3383a6R60
>
>
>> Open up a PR for the proto changes and we can work through any minor
>> comments there.
>
> Proto changes: https://github.com/apache/beam/pull/9440
>
>
> Thanks,
> Max
>
> On 27.08.19 23:00, Robert Bradshaw wrote:
>> Just to clarify, the repeated list of cache tokens in the process
>> bundle request is used to validate reading *and* stored when writing?
>> In that sense, should they just be called version identifiers or
>> something like that?
>>
>> On Tue, Aug 27, 2019 at 11:33 AM Maximilian Michels 
>> wrote:
>>>
>>> Thanks. Updated:
>>>
>>> message ProcessBundleRequest {
>>>    // (Required) A reference to the process bundle descriptor that
>>> must be
>>>    // instantiated and executed by the SDK harness.
>>>    string process_bundle_descriptor_reference = 1;
>>>
>>>    // A cache token which can be used by an SDK to check for the
>>> validity
>>>    // of cached elements which have a cache token associated.
>>>    message CacheToken {
>>>
>>>  // A flag to indicate a cache token is valid for user state.
>>>  message UserState {}
>>>
>>>  // A flag to indicate a cache token is valid for a side input.
>>>  message SideInput {
>>>    // The id of a side input.
>>>    string side_input = 1;
>>>  }
>>>
>>>  // The scope of a cache token.
>>>  oneof type {
>>>    UserState user_state = 1;
>>>    SideInput side_input = 2;
>>>  }
>>>
>>>  // The cache token identifier which should be globally unique.
>>>  bytes token = 10;
>>>    }
>>>
>>>    // (Optional) A list of cache tokens that can be used by an SDK
>>> to reuse
>>>    // cached data returned by the State API across multiple bundles.
>>>    repeated CacheToken cache_tokens = 2;
>>> }
>>>
>>> On 27.08.19 19:22, Lukasz Cwik wrote:
>>>
>>> SideInputState -> SideInput (side_input_state -> side_input)
>>> + more comments around the messages and the fields.
>>>
>>>
>>> On Tue, Aug 27, 2019 at 10:18 AM Maximilian Michels 
>>> wrote:

 We would have to differentiate cache tokens for user state and side
 inputs. How about something like this?

 message ProcessBundleRequest {
    // (Required) A reference to the process bundle descriptor that
 must be
    // instantiated and executed by the SDK harness.
    string process_bundle_descriptor_reference = 1;

    message CacheToken {

  message UserState {
  }

  message SideInputState {
    string side_input_id = 1;
  }

  oneof type {
    UserState user_state = 1;
    SideInputState side_input_state = 2;
  }

  bytes token = 10;
    }

    // (Optional) A list of cache tokens that can be used by an SDK
 to reuse
    // cached data returned by the State API across multiple bundles.
    repeated CacheToken cache_tokens = 2;
 }

 -Max

 On 27.08.19 18:43, Lukasz Cwik wrote:

 The bundles view of side inputs should never change during
 processing and should have a point in time snapshot.

 I was just trying to say that the cache token for side inputs being
 deferred till side input request time simplified the runners
 implementation since that is conclusively when the runner would
 need to take a look at the side input. Putting them as part of the
 ProcesBundleRequest complicates that but does make the SDK
 implementation significantly simpler which is a win.

 On Tue, Aug 27, 2019 at 9:14 AM Maximilian Michels 
 wrote:
>
> Thanks for the quick response.
>
> Just to clarify, the issue with versioning side input is also present
> when supplying the cache tokens on a request basis instead of per
> bundle. The SDK never knows when the Runner receives a new version of
> the side input. Like you pointed out, it needs to mark side inputs as
> stale and generate new cache tokens for the stale side inputs.
>
> The difference between per-request tokens

Re: Master broken (likely due to Mockito upgrade)

2019-08-28 Thread Maximilian Michels


Master is ok again. Thank you Ryan for this one: 
https://github.com/apache/beam/pull/9442

On 28.08.19 14:32, Maximilian Michels wrote:

Ismael pointed out that reverting #9338 works but the real culprit is
https://github.com/apache/beam/pull/9000. Updated the PR to revert this
commit instead.

-Max

On 28.08.19 14:21, Maximilian Michels wrote:
> Hi,
>
> Most of you probably realized that the master is currently broken:
> https://builds.apache.org/job/beam_PreCommit_Java_Commit/7505/
> https://builds.apache.org/job/beam_PreCommit_Java_Cron/
>
> I did some bisecting and found that Mockito was updated:
> https://github.com/apache/beam/pull/9338
>
> This was merged 7 days ago. I don' know why we are now seeing these
> errors now, but testing this locally I was able to reproduce the errors
> with master and they were gone after reverting the Mockito changes. I
> opened up a PR with a revert: https://github.com/apache/beam/pull/9441
>
> Thanks,
> Max
>

Stop publishing unneeded Java artifacts

2019-08-28 Thread Łukasz Gajowy

Hi all,

I wanted to notify that in PR 9417
 I'm planning to turn off
publishing of the following modules' artifacts to the maven repository:

   - :runners:google-cloud-dataflow-java:worker:windmill
   - :sdks:java:build-tools
   - :sdks:java:javadoc
   - :sdks:java:testing:expansion-service
   - :sdks:java:io:bigquery-io-perf-tests
   - :sdks:java:io:file-based-io-tests
   - :sdks:java:io:elasticsearch-tests:elasticsearch-tests-2
   - :sdks:java:io:elasticsearch-tests:elasticsearch-tests-5
   - :sdks:java:io:elasticsearch-tests:elasticsearch-tests-6
   - :sdks:java:io:elasticsearch-tests:elasticsearch-tests-common
   - :sdks:java:testing:load-tests
   - :sdks:java:testing:nexmark
   - :sdks:java:testing:test-utils

AFAIK, the purpose of these modules is to keep related
tests/test-utils/utils together. We are not expecting users to make use of
such artifacts. Please let me know if you have any objections. If there are
none and the PR gets merged, the artifacts will no longer be published.

Thanks!
Łukasz

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-28 Thread Frederik Bode

Congrats Valentyn!!


On Wed, 28 Aug 2019 at 15:28, Tanay Tummalapalli 
wrote:

> Congratulations Valentyn!
>
> On Wed, Aug 28, 2019 at 7:16 AM Ruoyun Huang  wrote:
>
>> Congratulations Valentyn!
>>
>> On Tue, Aug 27, 2019 at 6:16 PM Daniel Oliveira 
>> wrote:
>>
>>> Congratulations Valentyn!
>>>
>>> On Tue, Aug 27, 2019, 11:31 AM Boyuan Zhang  wrote:
>>>
 Congratulations!

 On Tue, Aug 27, 2019 at 10:44 AM Udi Meiri  wrote:

> Congrats!
>
> On Tue, Aug 27, 2019 at 9:50 AM Yichi Zhang  wrote:
>
>> Congrats Valentyn!
>>
>> On Tue, Aug 27, 2019 at 7:55 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Thank you everyone!
>>>
>>> On Tue, Aug 27, 2019 at 2:57 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Congrats, well deserved!

 On 27 Aug 2019, at 11:25, Jan Lukavský  wrote:

 Congrats Valentyn!
 On 8/26/19 11:43 PM, Rui Wang wrote:

 Congratulations!


 -Rui

 On Mon, Aug 26, 2019 at 2:36 PM Hannah Jiang <
 hannahji...@google.com> wrote:

> Congratulations Valentyn, well deserved!
>
> On Mon, Aug 26, 2019 at 2:34 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> Congrats Valentyn!
>>
>> On Mon, Aug 26, 2019 at 2:32 PM Pablo Estrada 
>> wrote:
>>
>>> Thanks Valentyn!
>>>
>>> On Mon, Aug 26, 2019 at 2:29 PM Robin Qiu 
>>> wrote:
>>>
 Thank you Valentyn! Congratulations!

 On Mon, Aug 26, 2019 at 2:28 PM Robert Bradshaw <
 rober...@google.com> wrote:

> Hi,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Valentyn Tymofieiev
>
> Valentyn has made numerous contributions to Beam over the last
> several
> years (including 100+ pull requests), most recently pushing
> through
> the effort to make Beam compatible with Python 3. He is also
> an active
> participant in design discussions on the list, participates in
> release
> candidate validation, and proactively helps keep our tests
> green.
>
> In consideration of Valentyn's contributions, the Beam PMC
> trusts him
> with the responsibilities of a Beam committer [1].
>
> Thank you, Valentyn, for your contributions and looking
> forward to many more!
>
> Robert, on behalf of the Apache Beam PMC
>
> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>


>>
>> --
>> 
>> Ruoyun  Huang
>>
>>

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-28 Thread Tanay Tummalapalli

Congratulations Valentyn!

On Wed, Aug 28, 2019 at 7:16 AM Ruoyun Huang  wrote:

> Congratulations Valentyn!
>
> On Tue, Aug 27, 2019 at 6:16 PM Daniel Oliveira 
> wrote:
>
>> Congratulations Valentyn!
>>
>> On Tue, Aug 27, 2019, 11:31 AM Boyuan Zhang  wrote:
>>
>>> Congratulations!
>>>
>>> On Tue, Aug 27, 2019 at 10:44 AM Udi Meiri  wrote:
>>>
 Congrats!

 On Tue, Aug 27, 2019 at 9:50 AM Yichi Zhang  wrote:

> Congrats Valentyn!
>
> On Tue, Aug 27, 2019 at 7:55 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Thank you everyone!
>>
>> On Tue, Aug 27, 2019 at 2:57 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Congrats, well deserved!
>>>
>>> On 27 Aug 2019, at 11:25, Jan Lukavský  wrote:
>>>
>>> Congrats Valentyn!
>>> On 8/26/19 11:43 PM, Rui Wang wrote:
>>>
>>> Congratulations!
>>>
>>>
>>> -Rui
>>>
>>> On Mon, Aug 26, 2019 at 2:36 PM Hannah Jiang 
>>> wrote:
>>>
 Congratulations Valentyn, well deserved!

 On Mon, Aug 26, 2019 at 2:34 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Congrats Valentyn!
>
> On Mon, Aug 26, 2019 at 2:32 PM Pablo Estrada 
> wrote:
>
>> Thanks Valentyn!
>>
>> On Mon, Aug 26, 2019 at 2:29 PM Robin Qiu 
>> wrote:
>>
>>> Thank you Valentyn! Congratulations!
>>>
>>> On Mon, Aug 26, 2019 at 2:28 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>>
 Hi,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Valentyn Tymofieiev

 Valentyn has made numerous contributions to Beam over the last
 several
 years (including 100+ pull requests), most recently pushing
 through
 the effort to make Beam compatible with Python 3. He is also an
 active
 participant in design discussions on the list, participates in
 release
 candidate validation, and proactively helps keep our tests
 green.

 In consideration of Valentyn's contributions, the Beam PMC
 trusts him
 with the responsibilities of a Beam committer [1].

 Thank you, Valentyn, for your contributions and looking forward
 to many more!

 Robert, on behalf of the Apache Beam PMC

 [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>>>
>
> --
> 
> Ruoyun  Huang
>
>

Re: Improve container support

2019-08-28 Thread Lukasz Cwik

Google locks down docs created wtih @google.com addresses. Hannah please
recreate the doc using a non @google.com address and share it with the
community. You'll want to replace Google short link with an Apache short
link (s.apache.org).

On Wed, Aug 28, 2019 at 5:40 AM Gleb Kanterov  wrote:

> Google Doc doesn't seem to be shared with dev@. Can anybody double-check?
>
> On Wed, Aug 28, 2019 at 7:36 AM Hannah Jiang 
> wrote:
>
>> add dev@
>>
>> On Tue, Aug 27, 2019 at 9:29 PM Hannah Jiang 
>> wrote:
>>
>>> Thanks for commenting and discussions.
>>> I created a Google Docs
>>> 
>>>  for
>>> easy commenting and reviewing. From this moment, all changes will be
>>> updated to the Google Docs and I will sync to wiki after finalize all plans.
>>>
>>> Thanks,
>>> Hannah
>>>
>>> On Tue, Aug 27, 2019 at 9:24 PM Ahmet Altay  wrote:
>>>
 Hi datapls-engprod,

 I have a question. Do you know what would it take to create a new gcp
 project similar to apache-beam-testing for purposes of distributing gcr
 packages? We can use the same billing account.

 Hannah, Robert, depending on the complexity of creating another gcp
 project we can go with that, or simply create a new bintray account. Either
 way would give us a clean new project to publish artifacts.

 Ahmet

 -- Forwarded message -
 From: Robert Bradshaw 
 Date: Tue, Aug 27, 2019 at 6:48 PM
 Subject: Re: Improve container support
 To: dev 

 On Tue, Aug 27, 2019 at 6:20 PM Ahmet Altay  wrote:
 >
 > On Tue, Aug 27, 2019 at 5:50 PM Robert Bradshaw 
 wrote:
 >>
 >> On Tue, Aug 27, 2019 at 3:35 PM Hannah Jiang 
 wrote:
 >> >
 >> > Hi team
 >> >
 >> > I am working on improving docker container support for Beam. We
 would like to publish prebuilt containers for each release version and
 daily snapshot. Current work focuses on release images only and it would be
 part of the release process.
 >>
 >> This would be great!
 >>
 >> > The release images will be pushed to GCR which is publicly
 accessible(pullable). We will use the following locations.
 >> > Repository: gcr.io/beam
 >> > Project: apache-beam-testing
 >>
 >> Given that these are release artifacts, we should use a project with
 >> more restricted access than "anyone who opens a PR on github."
 >
 >
 > We have two options:
 > -  gcr.io works based on the permissions of the gcs bucket that is
 backing it. GCS supports bucket only permissions. These permissions needs
 to be explicitly granted and the service accounts used by jenkins jobs does
 not have these explicit permissions today.
 > - we can create a new project in gcr, bintray or anything else that
 offers the same service.

 I think the cleanest is to simply have a new project whose membership
 consists of (interested) PMC members. If we have to populate this
 manually I think that'd still be OK as the churn is quite low.

>>>
>
> --
> Cheers,
> Gleb
>

Re: Improve container support

2019-08-28 Thread Gleb Kanterov

Google Doc doesn't seem to be shared with dev@. Can anybody double-check?

On Wed, Aug 28, 2019 at 7:36 AM Hannah Jiang  wrote:

> add dev@
>
> On Tue, Aug 27, 2019 at 9:29 PM Hannah Jiang 
> wrote:
>
>> Thanks for commenting and discussions.
>> I created a Google Docs
>> 
>>  for
>> easy commenting and reviewing. From this moment, all changes will be
>> updated to the Google Docs and I will sync to wiki after finalize all plans.
>>
>> Thanks,
>> Hannah
>>
>> On Tue, Aug 27, 2019 at 9:24 PM Ahmet Altay  wrote:
>>
>>> Hi datapls-engprod,
>>>
>>> I have a question. Do you know what would it take to create a new gcp
>>> project similar to apache-beam-testing for purposes of distributing gcr
>>> packages? We can use the same billing account.
>>>
>>> Hannah, Robert, depending on the complexity of creating another gcp
>>> project we can go with that, or simply create a new bintray account. Either
>>> way would give us a clean new project to publish artifacts.
>>>
>>> Ahmet
>>>
>>> -- Forwarded message -
>>> From: Robert Bradshaw 
>>> Date: Tue, Aug 27, 2019 at 6:48 PM
>>> Subject: Re: Improve container support
>>> To: dev 
>>>
>>>
>>> On Tue, Aug 27, 2019 at 6:20 PM Ahmet Altay  wrote:
>>> >
>>> > On Tue, Aug 27, 2019 at 5:50 PM Robert Bradshaw 
>>> wrote:
>>> >>
>>> >> On Tue, Aug 27, 2019 at 3:35 PM Hannah Jiang 
>>> wrote:
>>> >> >
>>> >> > Hi team
>>> >> >
>>> >> > I am working on improving docker container support for Beam. We
>>> would like to publish prebuilt containers for each release version and
>>> daily snapshot. Current work focuses on release images only and it would be
>>> part of the release process.
>>> >>
>>> >> This would be great!
>>> >>
>>> >> > The release images will be pushed to GCR which is publicly
>>> accessible(pullable). We will use the following locations.
>>> >> > Repository: gcr.io/beam
>>> >> > Project: apache-beam-testing
>>> >>
>>> >> Given that these are release artifacts, we should use a project with
>>> >> more restricted access than "anyone who opens a PR on github."
>>> >
>>> >
>>> > We have two options:
>>> > -  gcr.io works based on the permissions of the gcs bucket that is
>>> backing it. GCS supports bucket only permissions. These permissions needs
>>> to be explicitly granted and the service accounts used by jenkins jobs does
>>> not have these explicit permissions today.
>>> > - we can create a new project in gcr, bintray or anything else that
>>> offers the same service.
>>>
>>> I think the cleanest is to simply have a new project whose membership
>>> consists of (interested) PMC members. If we have to populate this
>>> manually I think that'd still be OK as the churn is quite low.
>>>
>>

-- 
Cheers,
Gleb

Re: Master broken (likely due to Mockito upgrade)

2019-08-28 Thread Maximilian Michels


Ismael pointed out that reverting #9338 works but the real culprit is 
https://github.com/apache/beam/pull/9000. Updated the PR to revert this commit 
instead.

-Max

On 28.08.19 14:21, Maximilian Michels wrote:

Hi,

Most of you probably realized that the master is currently broken:
https://builds.apache.org/job/beam_PreCommit_Java_Commit/7505/
https://builds.apache.org/job/beam_PreCommit_Java_Cron/

I did some bisecting and found that Mockito was updated:
https://github.com/apache/beam/pull/9338

This was merged 7 days ago. I don' know why we are now seeing these
errors now, but testing this locally I was able to reproduce the errors
with master and they were gone after reverting the Mockito changes. I
opened up a PR with a revert: https://github.com/apache/beam/pull/9441

Thanks,
Max

Master broken (likely due to Mockito upgrade)

2019-08-28 Thread Maximilian Michels


Hi,

Most of you probably realized that the master is currently broken:
https://builds.apache.org/job/beam_PreCommit_Java_Commit/7505/
https://builds.apache.org/job/beam_PreCommit_Java_Cron/

I did some bisecting and found that Mockito was updated:
https://github.com/apache/beam/pull/9338

This was merged 7 days ago. I don' know why we are now seeing these 
errors now, but testing this locally I was able to reproduce the errors 
with master and they were gone after reverting the Mockito changes. I 
opened up a PR with a revert: https://github.com/apache/beam/pull/9441


Thanks,
Max

RowWithGetters, FieldValueGetter being Serializable?

2019-08-28 Thread Alex Van Boxel

Hi,

I noticed that RowWithGetters and FieldValueGetter are both serializable
(all in package org.apache.beam.sdk.values). I do have my doubt if they
should be.

Certainly RowWithGetters would be problematic:

   - it references the underlying object that could be anything and it's *not
   guaranteed* Serializable. In my case I'm referring a Protobuf message
   that is not.
   - the FieldValueGetter should also not be Serializable as they are
   generated by the factory. I'm implementing Getters that also needs
   FieldDescriptors to access the underlying dynamic Protobuf fields,
   FieldDescriptors are also not serializable.

The *only* class that should be serializable is RowWithStorage as the
current implementation will convert any type of Row to this as soon as a
serialization step needs to happen.

Thoughts?! If you all agree, I'll create a ticket and fix this, as this is
a bit blocking my implementation of Protobuf (as it wont pass SpotBug
complaining on non serializable fields in FieldValueGetter).

 _/
_/ Alex Van Boxel

Re: Write-through-cache in State logic

2019-08-28 Thread Maximilian Michels


Just to clarify, the repeated list of cache tokens in the process
bundle request is used to validate reading *and* stored when writing?
In that sense, should they just be called version identifiers or
something like that?


We could call them version identifiers, though cache tokens were always 
a means to identify versions of a state.


On 28.08.19 11:10, Maximilian Michels wrote:

cachetools sounds like a fine choice to me.


For the first version I've implemented a simple LRU cache. If you want 
to have a look: 
https://github.com/apache/beam/pull/9418/files#diff-ed2d70e99442b6e1668e30409d3383a6R60 



Open up a PR for the proto changes and we can work through any minor 
comments there.


Proto changes: https://github.com/apache/beam/pull/9440


Thanks,
Max

On 27.08.19 23:00, Robert Bradshaw wrote:

Just to clarify, the repeated list of cache tokens in the process
bundle request is used to validate reading *and* stored when writing?
In that sense, should they just be called version identifiers or
something like that?

On Tue, Aug 27, 2019 at 11:33 AM Maximilian Michels  
wrote:


Thanks. Updated:

message ProcessBundleRequest {
   // (Required) A reference to the process bundle descriptor that 
must be

   // instantiated and executed by the SDK harness.
   string process_bundle_descriptor_reference = 1;

   // A cache token which can be used by an SDK to check for the 
validity

   // of cached elements which have a cache token associated.
   message CacheToken {

 // A flag to indicate a cache token is valid for user state.
 message UserState {}

 // A flag to indicate a cache token is valid for a side input.
 message SideInput {
   // The id of a side input.
   string side_input = 1;
 }

 // The scope of a cache token.
 oneof type {
   UserState user_state = 1;
   SideInput side_input = 2;
 }

 // The cache token identifier which should be globally unique.
 bytes token = 10;
   }

   // (Optional) A list of cache tokens that can be used by an SDK to 
reuse

   // cached data returned by the State API across multiple bundles.
   repeated CacheToken cache_tokens = 2;
}

On 27.08.19 19:22, Lukasz Cwik wrote:

SideInputState -> SideInput (side_input_state -> side_input)
+ more comments around the messages and the fields.


On Tue, Aug 27, 2019 at 10:18 AM Maximilian Michels  
wrote:


We would have to differentiate cache tokens for user state and side 
inputs. How about something like this?


message ProcessBundleRequest {
   // (Required) A reference to the process bundle descriptor that 
must be

   // instantiated and executed by the SDK harness.
   string process_bundle_descriptor_reference = 1;

   message CacheToken {

 message UserState {
 }

 message SideInputState {
   string side_input_id = 1;
 }

 oneof type {
   UserState user_state = 1;
   SideInputState side_input_state = 2;
 }

 bytes token = 10;
   }

   // (Optional) A list of cache tokens that can be used by an SDK 
to reuse

   // cached data returned by the State API across multiple bundles.
   repeated CacheToken cache_tokens = 2;
}

-Max

On 27.08.19 18:43, Lukasz Cwik wrote:

The bundles view of side inputs should never change during 
processing and should have a point in time snapshot.


I was just trying to say that the cache token for side inputs being 
deferred till side input request time simplified the runners 
implementation since that is conclusively when the runner would need 
to take a look at the side input. Putting them as part of the 
ProcesBundleRequest complicates that but does make the SDK 
implementation significantly simpler which is a win.


On Tue, Aug 27, 2019 at 9:14 AM Maximilian Michels  
wrote:


Thanks for the quick response.

Just to clarify, the issue with versioning side input is also present
when supplying the cache tokens on a request basis instead of per
bundle. The SDK never knows when the Runner receives a new version of
the side input. Like you pointed out, it needs to mark side inputs as
stale and generate new cache tokens for the stale side inputs.

The difference between per-request tokens and per-bundle tokens 
would be
that the side input can only change after a bundle completes vs. 
during
the bundle. Side inputs are always fuzzy in that regard because 
there is
no precise instance where side inputs are atomically updated, other 
than

the assumption that they eventually will be updated. In that regard
per-bundle tokens for side input seem to be fine.

All of the above is not an issue for user state, as its cache can 
remain

valid for the lifetime of a Runner<=>SDK Harness connection. A simple
solution would be to not cache side input because there are many cases
where the caching just adds additional overhead. However, I can also
imagine cases where side input is valid forever and caching would be
very beneficial.

For the first version I want to focus on user state because that's 
where
I

Re: Is it too late to switch to Java 8 time for the schema aware Row and Beam SQL?

2019-08-28 Thread Alex Van Boxel

Thanks, how will ZetaSQL support higher precision as the input in general
will be Instant anyway. Will it rely on the "pending" standardized logical
types?

 _/
_/ Alex Van Boxel


On Mon, Aug 19, 2019 at 7:02 AM Rui Wang  wrote:

> However, more challengings come from:
>
> 1. How to read data without losing precision. Beam Java SDK uses Joda
> already so it's very likely that you will need update IO somehow to support
> higher precision.
> 2. How to process higher precision in BeamSQL. It means SQL functions
> should support higher precision. If you use Beam Calcite, unfortunately it
> will only support up to millis. If you use Beam ZetaSQL (under review),
> there are opportunities to support higher precision for SQL functions.
>
>
> -Rui
>
> On Sun, Aug 18, 2019 at 9:52 PM Rui Wang  wrote:
>
>> We have been discussing it for a long time. I think if you only want to
>> support more precision (e.g. up to nanosecond) for BeamSQL, it's actually
>> relatively straightforward to support it by using a logical type for
>> BeamSQL.
>>
>>
>> -Rui
>>
>> On Sat, Aug 17, 2019 at 7:21 AM Alex Van Boxel  wrote:
>>
>>> I know it's probably futile, but the more I'm working on features that
>>> are related to schema awareness I'm getting a bit frustrated about the lack
>>> of precision of the joda instance.
>>>
>>> As soon as we have a conversion to the DateTime I need to drop
>>> precession, this happens with the Protobuf timestamp (nanoseconds), but I
>>> also notice it with BigQuery (milliseconds).
>>>
>>> Suggestions?
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>

Re: Write-through-cache in State logic

2019-08-28 Thread Maximilian Michels


cachetools sounds like a fine choice to me.


For the first version I've implemented a simple LRU cache. If you want 
to have a look: 
https://github.com/apache/beam/pull/9418/files#diff-ed2d70e99442b6e1668e30409d3383a6R60



Open up a PR for the proto changes and we can work through any minor comments 
there.


Proto changes: https://github.com/apache/beam/pull/9440


Thanks,
Max

On 27.08.19 23:00, Robert Bradshaw wrote:

Just to clarify, the repeated list of cache tokens in the process
bundle request is used to validate reading *and* stored when writing?
In that sense, should they just be called version identifiers or
something like that?

On Tue, Aug 27, 2019 at 11:33 AM Maximilian Michels  wrote:


Thanks. Updated:

message ProcessBundleRequest {
   // (Required) A reference to the process bundle descriptor that must be
   // instantiated and executed by the SDK harness.
   string process_bundle_descriptor_reference = 1;

   // A cache token which can be used by an SDK to check for the validity
   // of cached elements which have a cache token associated.
   message CacheToken {

 // A flag to indicate a cache token is valid for user state.
 message UserState {}

 // A flag to indicate a cache token is valid for a side input.
 message SideInput {
   // The id of a side input.
   string side_input = 1;
 }

 // The scope of a cache token.
 oneof type {
   UserState user_state = 1;
   SideInput side_input = 2;
 }

 // The cache token identifier which should be globally unique.
 bytes token = 10;
   }

   // (Optional) A list of cache tokens that can be used by an SDK to reuse
   // cached data returned by the State API across multiple bundles.
   repeated CacheToken cache_tokens = 2;
}

On 27.08.19 19:22, Lukasz Cwik wrote:

SideInputState -> SideInput (side_input_state -> side_input)
+ more comments around the messages and the fields.


On Tue, Aug 27, 2019 at 10:18 AM Maximilian Michels  wrote:


We would have to differentiate cache tokens for user state and side inputs. How 
about something like this?

message ProcessBundleRequest {
   // (Required) A reference to the process bundle descriptor that must be
   // instantiated and executed by the SDK harness.
   string process_bundle_descriptor_reference = 1;

   message CacheToken {

 message UserState {
 }

 message SideInputState {
   string side_input_id = 1;
 }

 oneof type {
   UserState user_state = 1;
   SideInputState side_input_state = 2;
 }

 bytes token = 10;
   }

   // (Optional) A list of cache tokens that can be used by an SDK to reuse
   // cached data returned by the State API across multiple bundles.
   repeated CacheToken cache_tokens = 2;
}

-Max

On 27.08.19 18:43, Lukasz Cwik wrote:

The bundles view of side inputs should never change during processing and 
should have a point in time snapshot.

I was just trying to say that the cache token for side inputs being deferred 
till side input request time simplified the runners implementation since that 
is conclusively when the runner would need to take a look at the side input. 
Putting them as part of the ProcesBundleRequest complicates that but does make 
the SDK implementation significantly simpler which is a win.

On Tue, Aug 27, 2019 at 9:14 AM Maximilian Michels  wrote:


Thanks for the quick response.

Just to clarify, the issue with versioning side input is also present
when supplying the cache tokens on a request basis instead of per
bundle. The SDK never knows when the Runner receives a new version of
the side input. Like you pointed out, it needs to mark side inputs as
stale and generate new cache tokens for the stale side inputs.

The difference between per-request tokens and per-bundle tokens would be
that the side input can only change after a bundle completes vs. during
the bundle. Side inputs are always fuzzy in that regard because there is
no precise instance where side inputs are atomically updated, other than
the assumption that they eventually will be updated. In that regard
per-bundle tokens for side input seem to be fine.

All of the above is not an issue for user state, as its cache can remain
valid for the lifetime of a Runner<=>SDK Harness connection. A simple
solution would be to not cache side input because there are many cases
where the caching just adds additional overhead. However, I can also
imagine cases where side input is valid forever and caching would be
very beneficial.

For the first version I want to focus on user state because that's where
I see the most benefit for caching. I don't see a problem though for the
Runner to detect new side input and reflect that in the cache tokens
supplied for a new bundle.

-Max

On 26.08.19 22:27, Lukasz Cwik wrote:

Your summary below makes sense to me. I can see that recovery from
rolling back doesn't need to be a priority and simplifies the solution
for user state caching down to one token.

Providing cache

Re: Did someone create a Beam Bintray account?

2019-08-28 Thread Hannah Jiang

I created a Bintray account tonight. Can you please verify it? We can update 
the email later to a more reasonable one. 

> On Aug 27, 2019, at 11:04 PM, Pablo Estrada  wrote:
> 
> I don't know what Bintray is, but I seem to remember it's related to Python 
> wheels? 
> 
> In any case, there are emails coming into moderation asking to validate the 
> account. So if you created the account, please reach out to me/the PMC to 
> figure out what it is / what to do.
> Best
> -P.

Did someone create a Beam Bintray account?

2019-08-28 Thread Pablo Estrada

I don't know what Bintray is, but I seem to remember it's related to Python
wheels?

In any case, there are emails coming into moderation asking to validate the
account. So if you created the account, please reach out to me/the PMC to
figure out what it is / what to do.
Best
-P.

Re: [VOTE] Vendored Dependencies Release

Re: Is it too late to switch to Java 8 time for the schema aware Row and Beam SQL?

Re: [VOTE] Vendored Dependencies Release

Re: Improve container support

Re: [VOTE] Vendored Dependencies Release

[PROPOSAL] Preparing for Beam 2.16.0 release

Re: RowWithGetters, FieldValueGetter being Serializable?

Re: [VOTE] Vendored Dependencies Release

Re: Improve container support

Re: Improve container support

Re: Write-through-cache in State logic

Re: Master broken (likely due to Mockito upgrade)

Stop publishing unneeded Java artifacts

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

Re: Improve container support

Re: Improve container support

Re: Master broken (likely due to Mockito upgrade)

Master broken (likely due to Mockito upgrade)

RowWithGetters, FieldValueGetter being Serializable?

Re: Write-through-cache in State logic

Re: Is it too late to switch to Java 8 time for the schema aware Row and Beam SQL?

Re: Write-through-cache in State logic

Re: Did someone create a Beam Bintray account?

Did someone create a Beam Bintray account?

25 matches

Site Navigation

Mail list logo

Footer information