from:"Charles Chen"

Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Charles Chen

Is it feasible to keep the endpoint information in the path?  It seems
pretty desirable to keep URIs "universal" so that it's possible to
understand what is being pointed to without explicit service configuration,
so maybe you can have a scheme like "s3+endpoint=api.example.com
://my/bucket/path"?

On Thu, May 20, 2021 at 12:31 PM Kenneth Knowles  wrote:

> $.02
>
> Most important is community to maintain it. It cannot be a separate
> project or subproject (lots of ASF projects have this, so they share
> governance) without that.
>
> To add additional friction of separate release and dependency in build
> before you have community, it should be extremely stable so you upgrade
> rarely. See the process of upgrading our vendored deps. It is considerable.
>
> Kenn
>
> On Thu, May 20, 2021 at 12:07 PM Stephan Hoyer  wrote:
>
>> On Thu, May 20, 2021 at 10:12 AM Chad Dombrova  wrote:
>>
>>> Hi Brian,
>>> I think the main goal would be to make a python package that could be
>>> pip installed independently of apache_beam.  That goal could be
>>> accomplished with option 3, thus preserving all of the benefits of a
>>> monorepo. If it gains enough popularity and contributors outside of the
>>> Beam community, then options 1 and 2 could be considered to make it easier
>>> to foster a new community of contributors.
>>>
>>
>> This sounds like a lovely goal!
>>
>> I'll just mention the "fsspec" Python project, which came out of Dask:
>> https://filesystem-spec.readthedocs.io/en/latest/
>>
>> As far as I can tell, it serves basically this exact same purpose
>> (generic filesystems with high-performance IO), and has started to get some
>> traction in other projects, e.g., it's now used in pandas. I don't know if
>> it would be suitable for Beam, but it might be worth a try.
>>
>> Cheers,
>> Stephan
>>
>>
>>> Beam has a lot of great tech in it, and it makes me think of Celery,
>>> which is a much older python project of a similar ilk that spawned a series
>>> of useful independent projects: kombu [1], an AMQP messaging library, and
>>> billiard [2], a multiprocessing library.
>>>
>>> Obviously, there are a number of pros and cons to consider.  The cons
>>> are pretty clear: even within a monorepo it will make the Beam build more
>>> complicated.  The pros are a bit more abstract.  The fileIO project could
>>> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
>>> etc), thereby increasing awareness of Beam amongst the types of
>>> cloud-friendly python developers who would need the fileIO package.
>>>
>>> -chad
>>>
>>> [1] https://github.com/celery/kombu
>>> [2] https://github.com/celery/billiard
>>>
>>>
>>>
>>>
>>> On Thu, May 20, 2021 at 7:57 AM Brian Hulette 
>>> wrote:
>>>
 That's an interesting idea. What do you mean by its own project? A
 couple of possibilities:
 - Spinning off a new ASF project
 - A separate Beam-governed repository (e.g. apache/beam-filesystems)
 - More clearly separate it in the current build system and release
 artifacts that allow it to be used independently

 Personally I'd be resistant to the first two (I am a Google engineer
 and I like monorepos after all), but I don't see a major problem with the
 last one, except that it gives us another surface to maintain.

 Brian

 On Wed, May 19, 2021 at 8:38 PM Chad Dombrova 
 wrote:

> This is a random idea, but the whole file IO system inside Beam would
> actually be awesome to extract into its own project.  IIRC, it’s not
> particularly tied to Beam.
>
> I’m not saying this should be done now, but it’s be nice to keep it
> mind for a future goal.
>
> -chad
>
>
>
> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
> wrote:
>
>> That would be great to add, Matt. Of course it's important to make
>> this backwards compatible, but other than that, the addition would be 
>> very
>> welcome.
>>
>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>> whether there’s general support for this idea before fleshing it out
>>> further, getting internal approvals, etc.
>>>
>>>
>>>
>>> I’m working with multiple storage systems that speak the S3 api. I
>>> would like to support FileIO operations for these storage systems, but
>>> S3FileSystem hardcodes the s3 scheme (the various systems use different 
>>> URI
>>> schemes) and it is in any case impossible to instantiate more than one 
>>> in
>>> the current design.
>>>
>>>
>>>
>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked 
>>> out
>>> the details yet, but it will take some thought to make this work in a
>>> non-hacky way.
>>>
>>>
>>>
>>>

Re: [ANNOUNCE] New PMC Member: Chamikara Jayalath

2021-01-21 Thread Charles Chen

Congrats Cham!

On Thu, Jan 21, 2021, 5:39 PM Chamikara Jayalath 
wrote:

> Thanks everybody :)
>
> - Cham
>
> On Thu, Jan 21, 2021 at 5:22 PM Pablo Estrada  wrote:
>
>> Yoohoo Cham : )
>>
>> On Thu, Jan 21, 2021 at 5:20 PM Udi Meiri  wrote:
>>
>>> Congrats Cham!
>>>
>>> On Thu, Jan 21, 2021 at 4:25 PM Griselda Cuevas  wrote:
>>>
 Congratulations Cham!!! Well deserved :)

 On Thu, 21 Jan 2021 at 15:23, Connell O'Callaghan 
 wrote:

> Well done Cham!!! Thank you for all your contributions to date!!!
>
>
> On Thu, Jan 21, 2021 at 3:18 PM Rui Wang  wrote:
>
>> Congratulations, Cham!
>>
>> -Rui
>>
>> On Thu, Jan 21, 2021 at 3:15 PM Robert Bradshaw 
>> wrote:
>>
>>> Congratulations, Cham!
>>>
>>> On Thu, Jan 21, 2021 at 3:13 PM Brian Hulette 
>>> wrote:
>>>
 Great news, congratulations Cham!

 On Thu, Jan 21, 2021 at 3:08 PM Robin Qiu 
 wrote:

> Congratulations, Cham!
>
> On Thu, Jan 21, 2021 at 3:05 PM Tyson Hamilton 
> wrote:
>
>> Woo! Congrats Cham!
>>
>> On Thu, Jan 21, 2021 at 3:02 PM Robert Burke 
>> wrote:
>>
>>> Congratulations! That's fantastic news.
>>>
>>> On Thu, Jan 21, 2021, 2:59 PM Reza Rokni  wrote:
>>>
 Congratulations!

 On Fri, Jan 22, 2021 at 6:58 AM Ankur Goenka 
 wrote:

> Congrats Cham!
>
> On Thu, Jan 21, 2021 at 2:57 PM Ahmet Altay 
> wrote:
>
>> Hi all,
>>
>> Please join me and the rest of Beam PMC in welcoming
>> Chamikara Jayalath as our
>> newest PMC member.
>>
>> Cham has been part of the Beam community from its early days
>> and contributed to the project in significant ways, including 
>> contributing
>> new features and improvements especially related Beam IOs, 
>> advocating for
>> users, and mentoring new community members.
>>
>> Congratulations Cham! And thanks for being a part of Beam!
>>
>> Ahmet
>>
>

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-26 Thread Charles Chen

Thank you and congratulations Valentyn!  Much appreciated and deserved!

On Mon, Aug 26, 2019 at 2:33 PM Reza Rokni  wrote:

> Thanks Valentin!
>
> On Tue, 27 Aug 2019, 05:32 Pablo Estrada,  wrote:
>
>> Thanks Valentyn!
>>
>> On Mon, Aug 26, 2019 at 2:29 PM Robin Qiu  wrote:
>>
>>> Thank you Valentyn! Congratulations!
>>>
>>> On Mon, Aug 26, 2019 at 2:28 PM Robert Bradshaw 
>>> wrote:
>>>
 Hi,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Valentyn Tymofieiev

 Valentyn has made numerous contributions to Beam over the last several
 years (including 100+ pull requests), most recently pushing through
 the effort to make Beam compatible with Python 3. He is also an active
 participant in design discussions on the list, participates in release
 candidate validation, and proactively helps keep our tests green.

 In consideration of Valentyn's contributions, the Beam PMC trusts him
 with the responsibilities of a Beam committer [1].

 Thank you, Valentyn, for your contributions and looking forward to many
 more!

 Robert, on behalf of the Apache Beam PMC

 [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>

Re: [ANNOUNCE] Beam 2.15.0 Released!

2019-08-23 Thread Charles Chen

Thank you Yifan!

On Fri, Aug 23, 2019 at 11:12 AM Hannah Jiang 
wrote:

> Thank you Yifan!
>
> On Fri, Aug 23, 2019 at 11:09 AM Yichi Zhang  wrote:
>
>> Thank you Yifan!
>>
>> On Fri, Aug 23, 2019 at 11:06 AM Robin Qiu  wrote:
>>
>>> Thank you Yifan!
>>>
>>> On Fri, Aug 23, 2019 at 11:05 AM Rui Wang  wrote:
>>>
 Thank you Yifan!

 -Rui

 On Fri, Aug 23, 2019 at 9:21 AM Pablo Estrada 
 wrote:

> Thanks Yifan!
>
> On Fri, Aug 23, 2019 at 8:54 AM Connell O'Callaghan <
> conne...@google.com> wrote:
>
>>
>> +1 thank you Yifan!!!
>>
>> On Fri, Aug 23, 2019 at 8:49 AM Ahmet Altay  wrote:
>>
>>> Thank you Yifan!
>>>
>>> On Fri, Aug 23, 2019 at 8:00 AM Yifan Zou 
>>> wrote:
>>>
 The Apache Beam team is pleased to announce the release of version
 2.15.0.

 Apache Beam is an open source unified programming model to define
 and
 execute data processing pipelines, including ETL, batch and stream
 (continuous) processing. See https://beam.apache.org

 You can download the release here:

 https://beam.apache.org/get-started/downloads/

 This release includes bug fixes, features, and improvements
 detailed on
 the Beam blog:
 https://beam.apache.org/blog/2019/08/22/beam-2.15.0.html

 Thanks to everyone who contributed to this release, and we hope you
 enjoy
 using Beam 2.15.0.

 Yifan Zou

>>>

Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Charles Chen

Congrats Pablo and thank you for your contributions!

On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev 
wrote:

> Congrats, Pablo!
>
> On Wed, May 15, 2019 at 10:41 AM Yifan Zou  wrote:
>
>> Congratulations, Pablo!
>>
>> *From: *Maximilian Michels 
>> *Date: *Wed, May 15, 2019 at 2:06 AM
>> *To: * 
>>
>> Congrats Pablo! Thank you for your help to grow the Beam community!
>>>
>>> On 15.05.19 10:33, Tim Robertson wrote:
>>> > Congratulations Pablo
>>> >
>>> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía >> > > wrote:
>>> >
>>> > Congrats Pablo, well deserved, nece to see your work recognized!
>>> >
>>> > On Wed, May 15, 2019 at 9:59 AM Pei HE >> > > wrote:
>>> >  >
>>> >  > Congrats, Pablo!
>>> >  >
>>> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
>>> >  > mailto:ttanay.apa...@gmail.com>>
>>> wrote:
>>> >  > >
>>> >  > > Congratulations Pablo!
>>> >  > >
>>> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
>>> adude3...@gmail.com
>>> > > wrote:
>>> >  > >>
>>> >  > >> Congrats, Pablo!
>>> >  > >>
>>> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
>>> > mailto:conne...@google.com>> wrote:
>>> >  > >>>
>>> >  > >>> Awesome well done Pablo!!!
>>> >  > >>>
>>> >  > >>> Kenn thank you for sharing this great news with us!!!
>>> >  > >>>
>>> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
>>> > mailto:al...@google.com>> wrote:
>>> >  > 
>>> >  >  Congratulations!
>>> >  > 
>>> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
>>> > mailto:rob...@frantil.com>> wrote:
>>> >  > >
>>> >  > > Woohoo! Well deserved.
>>> >  > >
>>> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax <
>>> re...@google.com
>>> > > wrote:
>>> >  > >>
>>> >  > >> Congratulations!
>>> >  > >>
>>> >  > >> From: Mikhail Gryzykhin >> > >
>>> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
>>> >  > >> To: mailto:dev@beam.apache.org>>
>>> >  > >>
>>> >  > >>> Congratulations Pablo!
>>> >  > >>>
>>> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
>>> > mailto:k...@apache.org>> wrote:
>>> >  > 
>>> >  >  Hi all,
>>> >  > 
>>> >  >  Please join me and the rest of the Beam PMC in
>>> welcoming
>>> > Pablo Estrada to join the PMC.
>>> >  > 
>>> >  >  Pablo first picked up BEAM-722 in October of 2016 and
>>> > has been a steady part of the Beam community since then. In
>>> addition
>>> > to technical work on Beam Python & Java & runners, I would
>>> highlight
>>> > how Pablo grows Beam's community by helping users, working on GSoC,
>>> > giving talks at Beam Summits and other OSS conferences including
>>> > Flink Forward, and holding training workshops. I cannot do justice
>>> > to Pablo's contributions in a single paragraph.
>>> >  > 
>>> >  >  Thanks Pablo, for being a part of Beam.
>>> >  > 
>>> >  >  Kenn
>>> >
>>>
>>

Re: Beam's Conda package

2019-05-10 Thread Charles Chen

Looks like this is where it's living:
https://github.com/conda-forge/apache-beam-feedstock/tree/c96274713fcc5970c967c20e84859e73d0efa0d0

*From: *Lukasz Cwik 
*Date: *Fri, May 10, 2019 at 1:02 PM
*To: *dev

I'm not aware of who set up conda as well. There seem to have been ~4500
> downloads of the package so that is a good amount of users.
>
> On Fri, May 10, 2019 at 11:45 AM Ahmet Altay  wrote:
>
>> Hi all,
>>
>> There a conda package for apache-beam [1]. As far as I know, we do not
>> release this package. Does anyone know who owns this? It was last updated
>> to use 2.9.0, at least it would be good to add a newer version there.
>>
>> We also don't test in that environment so I am not sure how well it works
>> or who uses it.
>>
>> Thank you,
>> Ahmet
>>
>> [1] https://anaconda.org/conda-forge/apache-beam
>>
>

Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Charles Chen

Thank you Udi!

On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy 
wrote:

> Congratulations, Udi! Thank you for all your contributions!!!
>
> *From: *Pablo Estrada 
> *Date: *Fri, May 3, 2019 at 1:45 PM
> *To: *dev
>
> Thanks Udi and congrats!
>>
>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>>> Udi Meiri.
>>>
>>> Udi has been contributing to Beam since late 2017, starting with HDFS
>>> support in the Python SDK and continuing with a ton of Python work. I also
>>> will highlight his work on community-building infrastructure, including
>>> documentation, experiments with ways to find reviewers for pull requests,
>>> gradle build work, analyzing and reducing build times.
>>>
>>> In consideration of Udi's contributions, the Beam PMC trusts Udi with
>>> the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Udi, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-
>>> beam-committer
>>>
>>

Re: [VOTE] Release 2.11.0, release candidate #2

2019-02-26 Thread Charles Chen

Thank you, +1.  I tested Python 3 support in batch and streaming mode
(using wordcount and streaming wordcount) on both DirectRunner and
DataflowRunner.

On Tue, Feb 26, 2019 at 7:54 AM Konstantinos Katsiapis 
wrote:

> +1.
> (Same rational as my earlier post for RC1).
>
> On Tue, Feb 26, 2019 at 2:19 AM Maximilian Michels  wrote:
>
>> +1 (binding)
>>
>> * Verified checksums
>> * Ran quickstart WordCount tests local/cluster with the Flink Runner
>>
>> -Max
>>
>> On 26.02.19 10:40, Ahmet Altay wrote:
>> > Hi everyone,
>> >
>> > Please review and vote on the release candidate #2 for the version
>> > 2.11.0, as follows:
>> >
>> > [ ] +1, Approve the release
>> > [ ] -1, Do not approve the release (please provide specific comments)
>> >
>> > The complete staging area is available for your review, which includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to dist.apache.org
>> >  [2], which is signed with the key with
>> > fingerprint 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
>> > * all artifacts to be deployed to the Maven Central Repository [4],
>> > * source code tag "v2.11.0-RC2" [5],
>> > * website pull request listing the release [6] and publishing the API
>> > reference manual [7].
>> > * Python artifacts are deployed along with the source release to the
>> > dist.apache.org  [2].
>> > * Validation sheet with a tab for 2.11.0 release to help with
>> validation
>> > [8].
>> >
>> > The vote will be open for at least 72 hours. It is adopted by majority
>> > approval, with at least 3 PMC affirmative votes.
>> >
>> > Thanks,
>> > Ahmet
>> >
>> > [1]
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12344775
>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.11.0/
>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> > [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1064/
>> > [5] https://github.com/apache/beam/tree/v2.11.0-RC2
>> > [6] https://github.com/apache/beam/pull/7924
>> > [7] https://github.com/apache/beam-site/pull/587
>> > [8]
>> >
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=542393513
>> >
>>
>
>
> --
> Gus Katsiapis | Software Engineer | katsia...@google.com | 650-918-7487
>

Re: [VOTE] Release 2.11.0, release candidate #1

2019-02-25 Thread Charles Chen

+1.  I tested Python 3 support in batch and streaming mode (using wordcount
and streaming wordcount) on both DirectRunner and DataflowRunner.

On Mon, Feb 25, 2019 at 7:54 AM Łukasz Gajowy  wrote:

> Hi,
>
> https://issues.apache.org/jira/browse/BEAM-6697 Is this issue a release
> blocker? I'm asking because performance tests are not part of the release
> verification checklist. (Should they be?)
>
> The issue seems to be related to `google_cloud_bigdataoss_version` change
> (1.9.0 -> 1.9.12)
>
> Łukasz
>
> pon., 25 lut 2019 o 11:08 Maximilian Michels  napisał(a):
>
>> +1 (binding)
>>
>> On 25.02.19 03:44, Konstantinos Katsiapis wrote:
>> > +1
>> >
>> > We (TFX ) are really looking forward to
>> > the Python 3 compatibility that Apache Beam 2.11 brings. The 2.11
>> > release will allow several of our existing Apache Beam based libraries
>> > like TensorFlow Data Validation
>> > , TensorFlow
>> > Transform  and
>> TensorFlow
>> > Model Analysis  to
>> > be Python 3 Compatible (since they are already Python 3 "Ready" and as
>> > such blocked on this release).
>> >
>> > Thanks,
>> > Gus
>> >
>> > On Fri, Feb 22, 2019 at 7:08 PM Reuven Lax > > > wrote:
>> >
>> > +1 (binding)
>> >
>> > On Fri, Feb 22, 2019 at 5:09 PM Robert Bradshaw <
>> rober...@google.com
>> > > wrote:
>> >
>> > +1 (binding)
>> >
>> > I verified the artifacts for correctness, as well as one of the
>> > wheels
>> > on simple pipelines (Python 3).
>> >
>> >
>> > On Sat, Feb 23, 2019 at 1:01 AM Kenneth Knowles <
>> k...@apache.org
>> > > wrote:
>> >  >
>> >  > +1 (binding)
>> >  >
>> >  > Kenn
>> >  >
>> >  > On Fri, Feb 22, 2019 at 3:51 PM Ahmet Altay <
>> al...@google.com
>> > > wrote:
>> >  >>
>> >  >>
>> >  >>
>> >  >> On Fri, Feb 22, 2019 at 3:46 PM Kenneth Knowles
>> > mailto:k...@apache.org>> wrote:
>> >  >>>
>> >  >>> I believe you need to sign & hash the Python wheels. The
>> > instructions is unfortunately a bit hidden in the release guide
>> > without an entry in the table of contents:
>> >  >>
>> >  >>
>> >  >> Done, thank you for the pointer.
>> >  >>
>> >  >>>
>> >  >>>
>> >  >>> "Once all python wheels have been staged dist.apache.org
>> > , please run
>> > ./sign_hash_python_wheels.sh to sign and hash python wheels."
>> >  >>>
>> >  >>> On Fri, Feb 22, 2019 at 8:40 AM Ahmet Altay
>> > mailto:al...@google.com>> wrote:
>> >  
>> >  
>> >  
>> >   On Fri, Feb 22, 2019 at 1:32 AM Robert Bradshaw
>> > mailto:rober...@google.com>> wrote:
>> >  >
>> >  > It looks like
>> > https://github.com/apache/beam/blob/release-2.11.0/build.gradle
>> >  > differs from the copy in the release source tarball (line
>> > 22, and some
>> >  > whitespace below). Other than that, the artifacts and
>> > signatures look
>> >  > good.
>> >  
>> >  
>> >   Thank you. I fixed the issue (please take a look again).
>> > The difference was due to
>> > https://issues.apache.org/jira/browse/BEAM-6726.
>> >  
>> >  >
>> >  >
>> >  > On Fri, Feb 22, 2019 at 9:50 AM Ahmet Altay
>> > mailto:al...@google.com>> wrote:
>> >  > >
>> >  > > Hi everyone,
>> >  > >
>> >  > > Please review and vote on the release candidate #1 for
>> > the version 2.11.0, as follows:
>> >  > >
>> >  > > [ ] +1, Approve the release
>> >  > > [ ] -1, Do not approve the release (please provide
>> > specific comments)
>> >  > >
>> >  > > The complete staging area is available for your review,
>> > which includes:
>> >  > > * JIRA release notes [1],
>> >  > > * the official Apache source release to be deployed to
>> > dist.apache.org  [2], which is signed
>> > with the key with fingerprint
>> > 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
>> >  > > * all artifacts to be deployed to the Maven Central
>> > Repository [4],
>> >  > > * source code tag "v2.11.0-RC1" [5],
>> >  > > * website pull request listing the release [6] and
>> > publishing the API reference manual [7].
>> >  > > * Python artifacts are

Re: 2.7.1 (LTS) release?

2019-01-31 Thread Charles Chen

I would be in favor of keeping the old 2.7.0 release branch / tag static so
that referring to it will always get the right 2.7.0 code.

On Thu, Jan 31, 2019 at 10:24 AM Kenneth Knowles  wrote:

> I have waffled on whether to have release-2.7 and only branch
> release-2.7.1 when starting that release. I think that whenever we release
> 2.7.n the branch for 2.7.(n+1) should start from exactly that point, no? Or
> perhaps on release-2.7 branch the hardcoded version strings could be
> 2.7.1-SNAPSHOT/dev and remove the SNAPSHOT/dev when cutting the new release
> branch? I guess I think either one is fine. I think starting the branch now
> is smart, so that you can accumulate cherrypicks of backports.
>
> Kenn
>
> On Thu, Jan 31, 2019 at 7:55 AM Maximilian Michels  wrote:
>
>> 2.10.0 will be done when its done. Same goes for 2.7.1, which is likely
>> going to
>> be done later since we are focusing on 2.10.0 at the moment.
>>
>> I've created the release-2.7.1 branch because there is no other place for
>> fixes
>> of future versions. It would be helpful to have a minor version branch
>> (e.g.
>> release-2.7) which can be continuously updated.
>>
>> More generally speaking, we should dedicate time for LTS releases. What
>> is the
>> point otherwise of having an LTS version?
>>
>> -Max
>>
>> On 31.01.19 16:28, Thomas Weise wrote:
>> > Since you were originally thinking of 2.9.x as target, 2.10.0 seems
>> closer both
>> > in time and upgrade path.
>> >
>> > I see no reason why a 2.7.1 release would materialize any sooner than
>> 2.10.0.
>> >
>> > Or is the intention is to just stack up fixes in the 2.7.x branch for a
>> > potential future release?
>> >
>> > Thomas
>> >
>> >
>> > On Thu, Jan 31, 2019 at 5:03 AM Maximilian Michels > > > wrote:
>> >
>> > I agree it's better to take some extra time to ensure the quality
>> of 2.10.0.
>> >
>> > I've created a 2.7.1 branch and cherry-picked the relevant
>> commits[1]. We could
>> > start collecting other fixes in case there are any.
>> >
>> > -Max
>> >
>> > [1] https://github.com/apache/beam/pull/7687
>> >
>> > On 30.01.19 20:57, Kenneth Knowles wrote:
>> >  > Sounds good to me to target 2.7.1 and 2.10.0. I will have to
>> re-roll RC2
>> > after
>> >  > confirming fixes for the latest blockers that were found. These
>> are not
>> >  > regressions from 2.9.0. But they seem severe enough that they
>> are worth
>> > taking
>> >  > an extra day or two, because 2.9.0 had enough problems that I
>> would like
>> > to make
>> >  > 2.10.0 a more attractive upgrade target for users still on very
>> old versions.
>> >  >
>> >  > Kenn
>> >  >
>> >  > On Wed, Jan 30, 2019 at 5:22 AM Maximilian Michels <
>> m...@apache.org
>> > 
>> >  > >> wrote:
>> >  >
>> >  > Hi everyone,
>> >  >
>> >  > I know we are in the midst of releasing 2.10.0, but with the
>> release
>> > process
>> >  > taking its time I consider creating a patch release for this
>> issue in the
>> >  > FlinkRunner: https://jira.apache.org/jira/browse/BEAM-5386
>> >  >
>> >  > Initially I thought it would be good to do a 2.9.1 release,
>> but since we
>> >  > have an
>> >  > LTS version, we should probably do a 2.7.1 (LTS) release
>> instead.
>> >  >
>> >  > What do you think? I could only find one Fix Version 2.7.1
>> issue in JIRA:
>> >  >
>> >
>> https://jira.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20fixVersion%20%3D%202.7.1
>> >  >
>> >  > Best,
>> >  > Max
>> >  >
>> >
>>
>

Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-15 Thread Charles Chen

+1

Note that we need to temporarily revert
https://github.com/apache/beam/pull/6683 before the release branch cut per
the discussion at
https://lists.apache.org/thread.html/78fe33dc41b04886f5355d66d50359265bfa2985580bb70f79c53545@%3Cdev.beam.apache.org%3E

On Thu, Nov 15, 2018 at 9:18 PM Tim  wrote:

> Thanks Cham
> +1
>
> On 16 Nov 2018, at 05:30, Thomas Weise  wrote:
>
> +1
>
>
> On Thu, Nov 15, 2018 at 4:34 PM Ahmet Altay  wrote:
>
>> +1 Thank you.
>>
>> On Thu, Nov 15, 2018 at 4:22 PM, Kenneth Knowles  wrote:
>>
>>> SGTM. Thanks for keeping track of the schedule.
>>>
>>> Kenn
>>>
>>> On Thu, Nov 15, 2018 at 1:59 PM Chamikara Jayalath 
>>> wrote:
>>>
 Hi All,

 According to the release calendar [1] branch cut date for Beam 2.9.0
 release is 11/21/2018. Since previous release branch was cut close to the
 respective calendar date I'd like to propose cutting release branch for
 2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and possibly
 some folks will be out so we can try to produce RC1 on Monday after
 (11/26/2018). We can attend to current blocker JIRAs [2] in the meantime.

 I'd like to volunteer to perform this release.

 WDYT ?

 Thanks,
 Cham

 [1]
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
 [2] https://s.apache.org/beam-2.9.0-burndown

>>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

2018-11-06 Thread Charles Chen

Is the proposal to do this for both Beam Schema DATETIME fields as well as
for Beam timestamps in general?  The latter likely has a bunch of
downstream consequences for all runners.

On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía  wrote:

> +1 to more precision even to the nano level, probably via Reuven's
> proposal of a different internal representation.
> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw 
> wrote:
> >
> > +1 to offering more granular timestamps in general. I think it will be
> > odd if setting the element timestamp from a row DATETIME field is
> > lossy, so we should seriously consider upgrading that as well.
> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen  wrote:
> > >
> > > One related issue that came up before is that we (perhaps
> unnecessarily) restrict the precision of timestamps in the Python SDK to
> milliseconds because of legacy reasons related to the Java runner's use of
> Joda time.  Perhaps Beam portability should natively use a more granular
> timestamp unit.
> > >
> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang  wrote:
> > >>
> > >> Thanks Reuven!
> > >>
> > >> I think Reuven gives the third option:
> > >>
> > >> Change internal representation of DATETIME field in Row. Still keep
> public ReadableDateTime getDateTime(String fieldName) API to be compatible
> with existing code. And I think we could add one more API to
> getDataTimeNanosecond. This option is different from the option one because
> option one actually maintains two implementation of time.
> > >>
> > >> -Rui
> > >>
> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax  wrote:
> > >>>
> > >>> I would vote that we change the internal representation of Row to
> something other than Joda. Java 8 times would give us at least
> microseconds, and if we want nanoseconds we could simply store it as a
> number.
> > >>>
> > >>> We should still keep accessor methods that return and take Joda
> objects, as the rest of Beam still depends on Joda.
> > >>>
> > >>> Reuven
> > >>>
> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang  wrote:
> > >>>>
> > >>>> Hi Community,
> > >>>>
> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's
> Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited
> to the precision of millisecond. It has good enough precision to represent
> timestamp of event time, but it is not enough for the real "time" data. For
> the "time" type data, we probably need to support even up to the precision
> of nanosecond.
> > >>>>
> > >>>> Unfortunately, Joda decided to keep the precision of millisecond:
> https://github.com/JodaOrg/joda-time/issues/139.
> > >>>>
> > >>>> If we want to support the precision of nanosecond, we could have
> two options:
> > >>>>
> > >>>> Option one: utilize current FieldType's metadata field, such that
> we could set something into meta data and Row could check the metadata to
> decide what's saved in DATETIME field: Joda's Datetime or an implementation
> that supports nanosecond.
> > >>>>
> > >>>> Option two: have another field (maybe called TIMESTAMP field?), to
> have an implementation to support higher precision of time.
> > >>>>
> > >>>> What do you think about the need of higher precision for time type
> and which option is preferred?
> > >>>>
> > >>>> -Rui
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

2018-11-05 Thread Charles Chen

One related issue that came up before is that we (perhaps unnecessarily)
restrict the precision of timestamps in the Python SDK to milliseconds
because of legacy reasons related to the Java runner's use of Joda time.
Perhaps Beam portability should natively use a more granular timestamp unit.

On Mon, Nov 5, 2018 at 9:34 PM Rui Wang  wrote:

> Thanks Reuven!
>
> I think Reuven gives the third option:
>
> Change internal representation of DATETIME field in Row. Still keep public
> ReadableDateTime getDateTime(String fieldName) API to be compatible with
> existing code. And I think we could add one more API to
> getDataTimeNanosecond. This option is different from the option one because
> option one actually maintains two implementation of time.
>
> -Rui
>
> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax  wrote:
>
>> I would vote that we change the internal representation of Row to
>> something other than Joda. Java 8 times would give us at least
>> microseconds, and if we want nanoseconds we could simply store it as a
>> number.
>>
>> We should still keep accessor methods that return and take Joda objects,
>> as the rest of Beam still depends on Joda.
>>
>> Reuven
>>
>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang  wrote:
>>
>>> Hi Community,
>>>
>>> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime
>>> (see Row.java#L611
>>> 
>>>  and Row.java#L169
>>> ).
>>> Joda's Datetime is limited to the precision of millisecond. It has good
>>> enough precision to represent timestamp of event time, but it is not enough
>>> for the real "time" data. For the "time" type data, we probably need to
>>> support even up to the precision of nanosecond.
>>>
>>> Unfortunately, Joda decided to keep the precision of millisecond:
>>> https://github.com/JodaOrg/joda-time/issues/139.
>>>
>>> If we want to support the precision of nanosecond, we could have two
>>> options:
>>>
>>> Option one: utilize current FieldType's metadata field
>>> ,
>>> such that we could set something into meta data and Row could check the
>>> metadata to decide what's saved in DATETIME field: Joda's Datetime or an
>>> implementation that supports nanosecond.
>>>
>>> Option two: have another field (maybe called TIMESTAMP field?), to have
>>> an implementation to support higher precision of time.
>>>
>>> What do you think about the need of higher precision for time type and
>>> which option is preferred?
>>>
>>> -Rui
>>>
>>

Re: New Edit button on beam.apache.org pages

2018-10-24 Thread Charles Chen

This is great!  Thanks!

On Wed, Oct 24, 2018 at 2:26 PM Ahmet Altay  wrote:

> Really cool! Thank you!
>
> On Wed, Oct 24, 2018 at 2:24 PM, Alan Myrvold  wrote:
>
>> To make small documentation changes easier, there is now an Edit button
>> at the top right of the pages on https://beam.apache.org. This button
>> opens the source .md file on the master branch of the beam repository in
>> the github web editor. After making changes you can create a pull request
>> to ask to have it merged.
>>
>> Thanks to Scott for the suggestion to add this in [BEAM-4431]
>> 
>>
>> Let me know if you run into any issues.
>>
>> Alan
>>
>>
>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-12 Thread Charles Chen

The current release branch (
https://github.com/apache/beam/commits/release-2.8.0) was cut after the
revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a
revert of the revert.  Regarding your comment above, I can help out with
the design / PR reviews for common Python code as you suggest.

On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise  wrote:

> Thanks, will tag you and looking forward to feedback so we can ensure that
> changes work for everyone.
>
> Looking at the PR, I see agreement from Max to revert the change on the
> release branch, but not in master. Would you mind to restore it in master?
>
> Thanks
>
> On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay  wrote:
>
>>
>>
>> On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen  wrote:
>>
>>> What I mean is that a user may find that it works for them to pass
>>> "--myarg blah" and access it as "options.myarg" without explicitly defining
>>> a "my_arg" flag due to the added logic.  This is not the intended behavior
>>> and we may want to change this implementation detail in the future.
>>> However, having this logic in a released version makes it hard to change
>>> this behavior since users may erroneously depend on this undocumented
>>> behavior.  Instead, we should namespace / scope this so that it is obvious
>>> that this is meant for runner (and not Beam user) consumption.
>>>
>>> On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise  wrote:
>>>
>>>> Can you please elaborate more what practical problems this introduces
>>>> for users?
>>>>
>>>> I can see that this change allows a user to specify a runner specific
>>>> option, which in the future may change because we decide to scope
>>>> differently. If this only affects users of the portable Flink runner (like
>>>> us), then no need to revert, because at this early stage we prefer
>>>> something that works over being blocked.
>>>>
>>>> It would also be really great if some of the core Python SDK developers
>>>> could help out with the design aspects and PR reviews of changes that
>>>> affect common Python code. Anyone who specifically wants to be tagged on
>>>> relevant JIRAs and PRs?
>>>>
>>>
>> I would be happy to be tagged, and I can also help with including other
>> relevant folks whenever possible. In general I think Robert, Charles,
>> myself are good candidates.
>>
>>
>>
>>>
>>>> Thanks
>>>>
>>>>
>>>> On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay  wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen  wrote:
>>>>>
>>>>>> For context, I made comments on
>>>>>> https://github.com/apache/beam/pull/6600 noting that the changes
>>>>>> being made were not good for Beam backwards-compatibility.  The change as
>>>>>> is allows users to use pipeline options without explicitly defining them,
>>>>>> which is not the type of usage we would like to encourage since we prefer
>>>>>> to be explicit whenever possible.  If users write pipelines with this 
>>>>>> sort
>>>>>> of pattern, they will potentially encounter pain when upgrading to a 
>>>>>> later
>>>>>> version since this is an implementation detail and not an officially
>>>>>> supported pattern.  I agree with the comments above that this is 
>>>>>> ultimately
>>>>>> a scoping issue.  I would not have a problem with these changes if they
>>>>>> were explicitly scoped under either a runner or unparsed options 
>>>>>> namespace.
>>>>>>
>>>>>> As a second note, since the 2.8.0 release is being cut right now,
>>>>>> because of these backwards-compatibility concerns, I would suggest
>>>>>> reverting these changes, at least until 2.8.0 is cut, so we can have a
>>>>>> discussion here before committing to and releasing any API-level changes.
>>>>>>
>>>>>
>>>>> +1 I would like to revert the changes in order not rush this into the
>>>>> release. Once this discussion results in an agreement changes can be
>>>>> brought back.
>>>>>
>>>>>
>>>>>>
>>>>>> On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde 
>>>>>> wrote:

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-12 Thread Charles Chen

What I mean is that a user may find that it works for them to pass "--myarg
blah" and access it as "options.myarg" without explicitly defining a
"my_arg" flag due to the added logic.  This is not the intended behavior
and we may want to change this implementation detail in the future.
However, having this logic in a released version makes it hard to change
this behavior since users may erroneously depend on this undocumented
behavior.  Instead, we should namespace / scope this so that it is obvious
that this is meant for runner (and not Beam user) consumption.

On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise  wrote:

> Can you please elaborate more what practical problems this introduces for
> users?
>
> I can see that this change allows a user to specify a runner specific
> option, which in the future may change because we decide to scope
> differently. If this only affects users of the portable Flink runner (like
> us), then no need to revert, because at this early stage we prefer
> something that works over being blocked.
>
> It would also be really great if some of the core Python SDK developers
> could help out with the design aspects and PR reviews of changes that
> affect common Python code. Anyone who specifically wants to be tagged on
> relevant JIRAs and PRs?
>
> Thanks
>
>
> On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay  wrote:
>
>>
>>
>> On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen  wrote:
>>
>>> For context, I made comments on https://github.com/apache/beam/pull/6600
>>> noting that the changes being made were not good for Beam
>>> backwards-compatibility.  The change as is allows users to use pipeline
>>> options without explicitly defining them, which is not the type of usage we
>>> would like to encourage since we prefer to be explicit whenever possible.
>>> If users write pipelines with this sort of pattern, they will potentially
>>> encounter pain when upgrading to a later version since this is an
>>> implementation detail and not an officially supported pattern.  I agree
>>> with the comments above that this is ultimately a scoping issue.  I would
>>> not have a problem with these changes if they were explicitly scoped under
>>> either a runner or unparsed options namespace.
>>>
>>> As a second note, since the 2.8.0 release is being cut right now,
>>> because of these backwards-compatibility concerns, I would suggest
>>> reverting these changes, at least until 2.8.0 is cut, so we can have a
>>> discussion here before committing to and releasing any API-level changes.
>>>
>>
>> +1 I would like to revert the changes in order not rush this into the
>> release. Once this discussion results in an agreement changes can be
>> brought back.
>>
>>
>>>
>>> On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde 
>>> wrote:
>>>
>>>> Agree that pipeline options lack some mechanism for scoping. It is also
>>>> not always possible distinguish options meant to be consumed at pipeline
>>>> construction time, by the runner, by the SDK harness, by the user code or
>>>> any combination -- and this causes confusion every now and then.
>>>>
>>>> For Dataflow, we have been using "experiments" for arbitrary
>>>> runner-specific options. It's simply a string list pipeline option that all
>>>> SDKs support and, for Go at least, is sent to portable runners. Flink can
>>>> do the same in the short term to move forward.
>>>>
>>>> Henning
>>>>
>>>>
>>>> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise  wrote:
>>>>
>>>>> [moving to the list]
>>>>>
>>>>> The requirement driving this part of the change was to allow a user to
>>>>> specify pipeline options that a runner supports without having to declare
>>>>> those in each language SDK.
>>>>>
>>>>> In the specific scenario, we have options that the Flink runner
>>>>> supports (and can validate), that are not enumerated in the Python SDK.
>>>>>
>>>>> I think we have a bigger problem scoping pipeline options. For
>>>>> example, the runner options are dumped into the SDK worker. There is also 
>>>>> a
>>>>> possibility of name collisions. So I think this would benefit from broader
>>>>> feedback.
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>>> -- Forwarded message -
>>>>> From: Charles Chen 
>>>>> Date: Fri, Oct 12, 2018 at 8:36 AM
>>>>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options
>>>>> in a list argument (#6600)
>>>>> To: apache/beam 
>>>>> Cc: Thomas Weise , Mention <
>>>>> ment...@noreply.github.com>
>>>>>
>>>>>
>>>>> CC: @tweise <https://github.com/tweise>
>>>>>
>>>>> —
>>>>> You are receiving this because you were mentioned.
>>>>> Reply to this email directly, view it on GitHub
>>>>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>, or mute
>>>>> the thread
>>>>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>>>>> .
>>>>>
>>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-12 Thread Charles Chen

For context, I made comments on https://github.com/apache/beam/pull/6600
noting that the changes being made were not good for Beam
backwards-compatibility.  The change as is allows users to use pipeline
options without explicitly defining them, which is not the type of usage we
would like to encourage since we prefer to be explicit whenever possible.
If users write pipelines with this sort of pattern, they will potentially
encounter pain when upgrading to a later version since this is an
implementation detail and not an officially supported pattern.  I agree
with the comments above that this is ultimately a scoping issue.  I would
not have a problem with these changes if they were explicitly scoped under
either a runner or unparsed options namespace.

As a second note, since the 2.8.0 release is being cut right now, because
of these backwards-compatibility concerns, I would suggest reverting these
changes, at least until 2.8.0 is cut, so we can have a discussion here
before committing to and releasing any API-level changes.

On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde  wrote:

> Agree that pipeline options lack some mechanism for scoping. It is also
> not always possible distinguish options meant to be consumed at pipeline
> construction time, by the runner, by the SDK harness, by the user code or
> any combination -- and this causes confusion every now and then.
>
> For Dataflow, we have been using "experiments" for arbitrary
> runner-specific options. It's simply a string list pipeline option that all
> SDKs support and, for Go at least, is sent to portable runners. Flink can
> do the same in the short term to move forward.
>
> Henning
>
>
> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise  wrote:
>
>> [moving to the list]
>>
>> The requirement driving this part of the change was to allow a user to
>> specify pipeline options that a runner supports without having to declare
>> those in each language SDK.
>>
>> In the specific scenario, we have options that the Flink runner supports
>> (and can validate), that are not enumerated in the Python SDK.
>>
>> I think we have a bigger problem scoping pipeline options. For example,
>> the runner options are dumped into the SDK worker. There is also a
>> possibility of name collisions. So I think this would benefit from broader
>> feedback.
>>
>> Thanks,
>> Thomas
>>
>>
>> -- Forwarded message -
>> From: Charles Chen 
>> Date: Fri, Oct 12, 2018 at 8:36 AM
>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options in
>> a list argument (#6600)
>> To: apache/beam 
>> Cc: Thomas Weise , Mention <
>> ment...@noreply.github.com>
>>
>>
>> CC: @tweise <https://github.com/tweise>
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>, or mute
>> the thread
>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>> .
>>
>

Re: Finalizing the 2.7.0 release

2018-10-09 Thread Charles Chen

On it.

On Tue, Oct 9, 2018 at 9:30 AM Jean-Baptiste Onofré  wrote:

> Sorry, by announcement, it was not the mailing list, I meant the tag
> alias, the Jira version, the artifacts update, the dist cleanup, etc.
>
> Regards
> JB
>
> On 09/10/2018 18:18, Thomas Weise wrote:
> > BTW on our github readme, the latest version in Maven is displayed as
> > 2.5.0, while 2.7.0 was published. This could be an image caching issue.
> > But instead of showing something incorrect, maybe rather remove, unless
> > it can be fixed?
> >
> >
> > On Tue, Oct 9, 2018 at 9:13 AM Thomas Weise  > > wrote:
> >
> > Announcement was
> > sent:
> https://lists.apache.org/thread.html/b970a49a59fe97754eb6483823cca78e7208f17ae486bc959ee28e25@%3Cdev.beam.apache.org%3E
> >
> > I did not check other release finalization steps, is the tag not
> > automated?
> >
> >
> > On Tue, Oct 9, 2018 at 9:06 AM Jean-Baptiste Onofré  > > wrote:
> >
> > Hi guys,
> >
> > it seems the latest steps to finalize the 2.7.0 release have to
> be
> > performed.
> >
> > Can I create the 2.7.0 git tag based on the RC3 one ?
> >
> > About the announcement, is someone planning to tackle that ?
> >
> > Thanks !
> > Regards
> > JB
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org 
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

[ANNOUNCE] Apache Beam 2.7.0 released!

2018-10-03 Thread Charles Chen

The Apache Beam team is pleased to announce the release of version 2.7.0!

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes the following major new features & improvements,
among others:
- New KuduIO, Amazon SNS sink, Amazon SqsIO,
- Dependencies upgraded to new versions.
- Experimental support for Python on local Flink runner for simple examples.
- Various bugfixes and minor improvements.

You can take a look at the Release Notes for more details:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.7.0.
--
Charles Chen, on behalf of The Apache Beam team

[HELP] Blog post for upcoming 2.7.0 release

2018-09-30 Thread Charles Chen

Hi all,

We will be announcing the Apache Beam 2.7.0 release shortly.  As part of
this, we will be doing a blog post with improvement and feature highlights.

Please add your release notes and comments to this doc:
https://docs.google.com/document/d/1jIk0pc8CxTMmtz5b7UL0gSPxmjKnyerVFS6FcpP2Ym8/edit?usp=sharing

Thank you everyone who contributed to this release.

Best,
Charles

Re: [VOTE] Release 2.7.0, release candidate #3

2018-09-28 Thread Charles Chen

Thank you everyone for your work on this release.  I'm pleased to announce
that the 2.7.0 RC3 is approved for release with 3 PMC +1 votes and no -1
votes.

On Thu, Sep 27, 2018 at 5:31 AM Łukasz Gajowy 
wrote:

> +1
>
> I once again looked at the Nexmark dashboards, it seems that there are no
> performance regressions.
>
> czw., 27 wrz 2018, 00:02 użytkownik Jean-Baptiste Onofré 
> napisał:
>
>> +1 (binding)
>>
>> Regards
>> JB
>> Le 26 sept. 2018, à 18:00, Ahmet Altay  a écrit:
>>>
>>> +1. Thank you all!
>>>
>>> On Wed, Sep 26, 2018 at 2:33 PM, Charles Chen  wrote:
>>>
>>>> +1. Performed additional validations as listed in the spreadsheet.
>>>>
>>>>
>>>> On Wed, Sep 26, 2018, 3:24 AM Robert Bradshaw < rober...@google.com>
>>>> wrote:
>>>>
>>>>> +1 (binding), same verification as before.
>>>>>
>>>>> On Wed, Sep 26, 2018 at 7:36 AM Charles Chen < c...@google.com> wrote:
>>>>>
>>>>>> To clarify, the only difference between RC2 and RC3 is the Python
>>>>>> fix  https://github.com/apache/beam/pull/6494.
>>>>>>
>>>>>> This means that the Java validations from RC2 should carry over,
>>>>>> though I reran validations with RC3 anyway, as detailed on the 
>>>>>> spreadsheet.
>>>>>>
>>>>>> On Wed, Sep 26, 2018 at 12:41 AM Charles Chen < c...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> As with before, please add any validation performed to the
>>>>>>> spreadsheet here:
>>>>>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1675964688
>>>>>>>
>>>>>>> On Wed, Sep 26, 2018 at 12:30 AM Charles Chen < c...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> Please review and vote on the release candidate #3 for the version
>>>>>>>> 2.7.0, as follows:
>>>>>>>> [ ] +1, Approve the release
>>>>>>>> [ ] -1, Do not approve the release (please provide specific
>>>>>>>> comments)
>>>>>>>>
>>>>>>>> The complete staging area is available for your review, which
>>>>>>>> includes:
>>>>>>>> * JIRA release notes [1],
>>>>>>>> * the official Apache source release to be deployed to
>>>>>>>> dist.apache.org [2], which is signed with the key with fingerprint
>>>>>>>> 45C60AAAD115F560 [3],
>>>>>>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>>>>>>> * source code tag "v2.7.0-RC3" [5],
>>>>>>>> * website pull request listing the release and publishing the API
>>>>>>>> reference manual [6].
>>>>>>>> * Java artifacts were built with Gradle 4.8 and OpenJDK
>>>>>>>> 1.8.0_181-8u181-b13-1~deb9u1.
>>>>>>>> * Python artifacts are deployed along with the source release to
>>>>>>>> the dist.apache.org [2].
>>>>>>>>
>>>>>>>> The vote will be open for at least 72 hours. It is adopted by
>>>>>>>> majority approval, with at least 3 PMC affirmative votes.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Charles
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>>>>>>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
>>>>>>>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>>>>>>>> [4]
>>>>>>>> https://repository.apache.org/content/repositories/orgapachebeam-1048/
>>>>>>>> [5] https://github.com/apache/beam/tree/v2.7.0-RC3
>>>>>>>> [6] https://github.com/apache/beam-site/pull/549
>>>>>>>>
>>>>>>>
>>>

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-28 Thread Charles Chen

 making it PR
>>>>> author's responsibility to squash fixup commits. Having that expectation
>>>>> described clearly in the Contributor's Guide, along with some simple
>>>>> step-by-step instructions for how to do so should be enough. I mainly
>>>>> support this because I've been doing the squashing myself since I saw a
>>>>> thread about it here a few months ago. It's not nearly as huge a burden on
>>>>> me as it probably is for committers who have to merge in many more PRs,
>>>>> it's very easy to learn how to do, and it's one less barrier to having my
>>>>> code merged in.
>>>>>
>>>>> Of course I wouldn't expect that committers wait for PR authors to
>>>>> squash their fixup commits, but I think leaving a message like "For future
>>>>> pull requests you should squash any small fixup commits, as described 
>>>>> here:
>>>>> " should be fine.
>>>>>
>>>>>
>>>>>> I was also thinking about the possibility of wanting to revert
>>>>>> individual commits from a merge commit. The solution you propose
>>>>>> works,
>>>>>> but only if you want to revert everything.
>>>>>
>>>>>
>>>>> Does this happen often? I might not have enough context since I'm not
>>>>> a committer, but it seems to me that often the person performing a revert
>>>>> is not the original author of a change and doesn't have the context or 
>>>>> time
>>>>> to pick out an individual commit to revert.
>>>>>
>>>>> On Wed, Sep 19, 2018 at 1:32 PM Maximilian Michels 
>>>>> wrote:
>>>>>
>>>>>> I tend to agree with you Lukasz. Of course we should try to follow
>>>>>> the
>>>>>> guide lines as much as possible but if it requires an extra back and
>>>>>> forth with the PR author for a cosmetic change, it may not be worth
>>>>>> the
>>>>>> time.
>>>>>>
>>>>>> On 19.09.18 22:17, Lukasz Cwik wrote:
>>>>>> > I have to say I'm guilty of not following the merge guidelines,
>>>>>> > sometimes doing merges without rebasing/flatten commits.
>>>>>> >
>>>>>> > I find that it is a few extra mins of my time to fix someones PR
>>>>>> history
>>>>>> > if they have more then one logical commit they want to be separate
>>>>>> and
>>>>>> > it usually takes days for the PR author to do merging  with the
>>>>>> extra
>>>>>> > burden as a committer to keep track of another PR and its state
>>>>>> (waiting
>>>>>> > for clean-up) is taxing. I really liked the idea of the mergebot
>>>>>> (even
>>>>>> > though it didn't work out in practice) because it could do all the
>>>>>> > policy work on my behalf.
>>>>>> >
>>>>>> > Anything that reduces my overhead as a committer is useful as for
>>>>>> the
>>>>>> > 100s of PRs that I have merged, I've only had to rollback a couple
>>>>>> so
>>>>>> > I'm for Charle's suggestion which makes the rollback flow slightly
>>>>>> more
>>>>>> > complicated for a significantly easier PR merge workflow.
>>>>>> >
>>>>>> > On Wed, Sep 19, 2018 at 1:13 PM Charles Chen >>>>> > <mailto:c...@google.com>> wrote:
>>>>>> >
>>>>>> > What I mean is that if you get the first-parent commit using
>>>>>> "git
>>>>>> > log --first-parent", it will incorporate any and all fix up
>>>>>> commits
>>>>>> > so we don't need to worry about missing any.
>>>>>> >
>>>>>> > On Wed, Sep 19, 2018, 1:07 PM Maximilian Michels <
>>>>>> m...@apache.org
>>>>>> > <mailto:m...@apache.org>> wrote:
>>>>>> >
>>>>>> > Generally, +1 for isolated commits which are easy to revert.
>>>>>> >
>>>>>> >  > I don't think it's actually harder to roll back a set of
>>>>>> > com

Re: [VOTE] Release 2.7.0, release candidate #3

2018-09-26 Thread Charles Chen

+1. Performed additional validations as listed in the spreadsheet.

On Wed, Sep 26, 2018, 3:24 AM Robert Bradshaw  wrote:

> +1 (binding), same verification as before.
>
> On Wed, Sep 26, 2018 at 7:36 AM Charles Chen  wrote:
>
>> To clarify, the only difference between RC2 and RC3 is the Python fix
>> https://github.com/apache/beam/pull/6494.
>>
>> This means that the Java validations from RC2 should carry over, though I
>> reran validations with RC3 anyway, as detailed on the spreadsheet.
>>
>> On Wed, Sep 26, 2018 at 12:41 AM Charles Chen  wrote:
>>
>>> As with before, please add any validation performed to the spreadsheet
>>> here:
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1675964688
>>>
>>> On Wed, Sep 26, 2018 at 12:30 AM Charles Chen  wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Please review and vote on the release candidate #3 for the version
>>>> 2.7.0, as follows:
>>>> [ ] +1, Approve the release
>>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>>
>>>> The complete staging area is available for your review, which includes:
>>>> * JIRA release notes [1],
>>>> * the official Apache source release to be deployed to dist.apache.org
>>>> [2], which is signed with the key with fingerprint 45C60AAAD115F560 [3],
>>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>>> * source code tag "v2.7.0-RC3" [5],
>>>> * website pull request listing the release and publishing the API
>>>> reference manual [6].
>>>> * Java artifacts were built with Gradle 4.8 and OpenJDK
>>>> 1.8.0_181-8u181-b13-1~deb9u1.
>>>> * Python artifacts are deployed along with the source release to the
>>>> dist.apache.org [2].
>>>>
>>>> The vote will be open for at least 72 hours. It is adopted by majority
>>>> approval, with at least 3 PMC affirmative votes.
>>>>
>>>> Thanks,
>>>> Charles
>>>>
>>>> [1]
>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
>>>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>>>> [4]
>>>> https://repository.apache.org/content/repositories/orgapachebeam-1048/
>>>> [5] https://github.com/apache/beam/tree/v2.7.0-RC3
>>>> [6] https://github.com/apache/beam-site/pull/549
>>>>
>>>

Re: [VOTE] Release 2.7.0, release candidate #3

2018-09-25 Thread Charles Chen

To clarify, the only difference between RC2 and RC3 is the Python fix
https://github.com/apache/beam/pull/6494.

This means that the Java validations from RC2 should carry over, though I
reran validations with RC3 anyway, as detailed on the spreadsheet.

On Wed, Sep 26, 2018 at 12:41 AM Charles Chen  wrote:

> As with before, please add any validation performed to the spreadsheet
> here:
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1675964688
>
> On Wed, Sep 26, 2018 at 12:30 AM Charles Chen  wrote:
>
>> Hi everyone,
>>
>> Please review and vote on the release candidate #3 for the version 2.7.0,
>> as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint 45C60AAAD115F560 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.7.0-RC3" [5],
>> * website pull request listing the release and publishing the API
>> reference manual [6].
>> * Java artifacts were built with Gradle 4.8 and OpenJDK
>> 1.8.0_181-8u181-b13-1~deb9u1.
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> Charles
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1048/
>> [5] https://github.com/apache/beam/tree/v2.7.0-RC3
>> [6] https://github.com/apache/beam-site/pull/549
>>
>

[VOTE] Release 2.7.0, release candidate #3

2018-09-25 Thread Charles Chen

Hi everyone,

Please review and vote on the release candidate #3 for the version 2.7.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint 45C60AAAD115F560 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.7.0-RC3" [5],
* website pull request listing the release and publishing the API reference
manual [6].
* Java artifacts were built with Gradle 4.8 and OpenJDK
1.8.0_181-8u181-b13-1~deb9u1.
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Charles

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
[2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
[3] https://dist.apache.org/repos/dist/dev/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1048/
[5] https://github.com/apache/beam/tree/v2.7.0-RC3
[6] https://github.com/apache/beam-site/pull/549

Re: [VOTE] Release 2.7.0, release candidate #2

2018-09-25 Thread Charles Chen

I have merged https://github.com/apache/beam/pull/6494 and
https://github.com/apache/beam/pull/6495 which revert the offending commit
in master and release-2.7.0, respectively.  I am building 2.7.0 RC3 which
will be out shortly.

On Tue, Sep 25, 2018 at 9:52 PM Charles Chen  wrote:

> Unfortunately, I have to -1 this RC.  There is a problem with the Python
> release validation.  Specifically, wordcount on the local DirectRunner
> fails with:
>
> *21:02:11* ERROR:root:Exception in _rename_batch. src: 
> /tmp/tmp.PqwH113fm4/beam-temp-wordcount_direct.txt-cb5837b8c12711e8981742010a800020/5d1be1d0-bc76-4d30-90e4-bf76f30e17a0.wordcount_direct.txt,
>  dst: wordcount_direct.txt-0-of-1, err: [Errno 2] No such file or 
> directory: ''
>
>
> https://builds.apache.org/job/beam_PostRelease_Python_Candidate/133/console
>
> I ran a git bisect and identified https://github.com/apache/beam/pull/5903
> as the culprit (it looks like the change doesn't deal well with relative
> paths):
>
> d903dcfc7ff7300355688b08b779da880f15fe9d is the first bad commit
> commit d903dcfc7ff7300355688b08b779da880f15fe9d
> Author: Ryan Williams 
> Date:   Fri Jul 27 19:46:19 2018 -0400
>
> [BEAM-4747] mkdirs if they don't exist in localfilesystem (#5903)
>
> * mkdirs if they don't exist in localfilesystem
> * make localfilesystem create ancestor directories for output paths
>
> :04 04 4f66c0745f0cb28fbc9527657da7d9126d685f99 
> b48a4cca7dd3454d1babeb1b0ed15d7b6912f046 M  sdks
> bisect run success
>
>
> On Tue, Sep 25, 2018 at 4:53 PM Charles Chen  wrote:
>
>> Hi all, please add any validation performed to the spreadsheet here:
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1675964688
>>
>> On Tue, Sep 25, 2018 at 12:11 PM Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> +1 tested spark local and a 1 node cluster, direct runner and java test
>>> tools
>>>
>>> Le mar. 25 sept. 2018 13:08, Łukasz Gajowy  a
>>> écrit :
>>>
>>>> +1
>>>>
>>>> I ran the Java - Quickstart on core runners (local) and Dataflow
>>>> Quickstarts (on Dataflow instance) either way. Looks good. I wanted to
>>>> learn how easy it is - kudos to the author of the script! :)
>>>>
>>>> wt., 25 wrz 2018 o 12:39 Robert Bradshaw 
>>>> napisał(a):
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> I verified all the signatures and hashes, as well as one of the Python
>>>>> wheels, and that we're not shipping gradle[w] but otherwise the content
>>>>> matches the git repo (except a SNAPSHOT vs version change to the source).
>>>>>
>>>>> The changes [1] look minimal compared to RC1, so most of the
>>>>> verification there should apply as well.
>>>>>
>>>>> [1] https://github.com/apache/beam/compare/v2.7.0-RC1...v2.7.0-RC2
>>>>>
>>>>> On Tue, Sep 25, 2018 at 4:59 AM Charles Chen  wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Please review and vote on the release candidate #2 for the version
>>>>>> 2.7.0, as follows:
>>>>>> [ ] +1, Approve the release
>>>>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>>>>
>>>>>> The complete staging area is available for your review, which
>>>>>> includes:
>>>>>> * JIRA release notes [1],
>>>>>> * the official Apache source release to be deployed to
>>>>>> dist.apache.org [2], which is signed with the key with fingerprint
>>>>>> 45C60AAAD115F560 [3],
>>>>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>>>>> * source code tag "v2.7.0-RC2" [5],
>>>>>> * website pull request listing the release and publishing the API
>>>>>> reference manual [6].
>>>>>> * Java artifacts were built with Gradle 4.8 and OpenJDK
>>>>>> 1.8.0_181-8u181-b13-1~deb9u1.
>>>>>> * Python artifacts are deployed along with the source release to the
>>>>>> dist.apache.org [2].
>>>>>>
>>>>>> The vote will be open for at least 72 hours. It is adopted by
>>>>>> majority approval, with at least 3 PMC affirmative votes.
>>>>>>
>>>>>> Thanks,
>>>>>> Charles
>>>>>>
>>>>>> [1]
>>>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>>>>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
>>>>>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>>>>>> [4]
>>>>>> https://repository.apache.org/content/repositories/orgapachebeam-1047/
>>>>>> [5] https://github.com/apache/beam/tree/v2.7.0-RC2
>>>>>> [6] https://github.com/apache/beam-site/pull/549
>>>>>>
>>>>>

Re: [VOTE] Release 2.7.0, release candidate #2

2018-09-25 Thread Charles Chen

Unfortunately, I have to -1 this RC.  There is a problem with the Python
release validation.  Specifically, wordcount on the local DirectRunner
fails with:

*21:02:11* ERROR:root:Exception in _rename_batch. src:
/tmp/tmp.PqwH113fm4/beam-temp-wordcount_direct.txt-cb5837b8c12711e8981742010a800020/5d1be1d0-bc76-4d30-90e4-bf76f30e17a0.wordcount_direct.txt,
dst: wordcount_direct.txt-0-of-1, err: [Errno 2] No such file
or directory: ''


https://builds.apache.org/job/beam_PostRelease_Python_Candidate/133/console

I ran a git bisect and identified https://github.com/apache/beam/pull/5903
as the culprit (it looks like the change doesn't deal well with relative
paths):

d903dcfc7ff7300355688b08b779da880f15fe9d is the first bad commit
commit d903dcfc7ff7300355688b08b779da880f15fe9d
Author: Ryan Williams 
Date:   Fri Jul 27 19:46:19 2018 -0400

[BEAM-4747] mkdirs if they don't exist in localfilesystem (#5903)

* mkdirs if they don't exist in localfilesystem
* make localfilesystem create ancestor directories for output paths

:04 04 4f66c0745f0cb28fbc9527657da7d9126d685f99
b48a4cca7dd3454d1babeb1b0ed15d7b6912f046 M  sdks
bisect run success


On Tue, Sep 25, 2018 at 4:53 PM Charles Chen  wrote:

> Hi all, please add any validation performed to the spreadsheet here:
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1675964688
>
> On Tue, Sep 25, 2018 at 12:11 PM Romain Manni-Bucau 
> wrote:
>
>> +1 tested spark local and a 1 node cluster, direct runner and java test
>> tools
>>
>> Le mar. 25 sept. 2018 13:08, Łukasz Gajowy  a
>> écrit :
>>
>>> +1
>>>
>>> I ran the Java - Quickstart on core runners (local) and Dataflow
>>> Quickstarts (on Dataflow instance) either way. Looks good. I wanted to
>>> learn how easy it is - kudos to the author of the script! :)
>>>
>>> wt., 25 wrz 2018 o 12:39 Robert Bradshaw 
>>> napisał(a):
>>>
>>>> +1 (binding)
>>>>
>>>> I verified all the signatures and hashes, as well as one of the Python
>>>> wheels, and that we're not shipping gradle[w] but otherwise the content
>>>> matches the git repo (except a SNAPSHOT vs version change to the source).
>>>>
>>>> The changes [1] look minimal compared to RC1, so most of the
>>>> verification there should apply as well.
>>>>
>>>> [1] https://github.com/apache/beam/compare/v2.7.0-RC1...v2.7.0-RC2
>>>>
>>>> On Tue, Sep 25, 2018 at 4:59 AM Charles Chen  wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Please review and vote on the release candidate #2 for the version
>>>>> 2.7.0, as follows:
>>>>> [ ] +1, Approve the release
>>>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>>>
>>>>> The complete staging area is available for your review, which includes:
>>>>> * JIRA release notes [1],
>>>>> * the official Apache source release to be deployed to dist.apache.org
>>>>> [2], which is signed with the key with fingerprint 45C60AAAD115F560 [3],
>>>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>>>> * source code tag "v2.7.0-RC2" [5],
>>>>> * website pull request listing the release and publishing the API
>>>>> reference manual [6].
>>>>> * Java artifacts were built with Gradle 4.8 and OpenJDK
>>>>> 1.8.0_181-8u181-b13-1~deb9u1.
>>>>> * Python artifacts are deployed along with the source release to the
>>>>> dist.apache.org [2].
>>>>>
>>>>> The vote will be open for at least 72 hours. It is adopted by majority
>>>>> approval, with at least 3 PMC affirmative votes.
>>>>>
>>>>> Thanks,
>>>>> Charles
>>>>>
>>>>> [1]
>>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>>>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
>>>>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>>>>> [4]
>>>>> https://repository.apache.org/content/repositories/orgapachebeam-1047/
>>>>> [5] https://github.com/apache/beam/tree/v2.7.0-RC2
>>>>> [6] https://github.com/apache/beam-site/pull/549
>>>>>
>>>>

Re: [VOTE] Release 2.7.0, release candidate #2

2018-09-25 Thread Charles Chen

Hi all, please add any validation performed to the spreadsheet here:
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1675964688

On Tue, Sep 25, 2018 at 12:11 PM Romain Manni-Bucau 
wrote:

> +1 tested spark local and a 1 node cluster, direct runner and java test
> tools
>
> Le mar. 25 sept. 2018 13:08, Łukasz Gajowy  a
> écrit :
>
>> +1
>>
>> I ran the Java - Quickstart on core runners (local) and Dataflow
>> Quickstarts (on Dataflow instance) either way. Looks good. I wanted to
>> learn how easy it is - kudos to the author of the script! :)
>>
>> wt., 25 wrz 2018 o 12:39 Robert Bradshaw 
>> napisał(a):
>>
>>> +1 (binding)
>>>
>>> I verified all the signatures and hashes, as well as one of the Python
>>> wheels, and that we're not shipping gradle[w] but otherwise the content
>>> matches the git repo (except a SNAPSHOT vs version change to the source).
>>>
>>> The changes [1] look minimal compared to RC1, so most of the
>>> verification there should apply as well.
>>>
>>> [1] https://github.com/apache/beam/compare/v2.7.0-RC1...v2.7.0-RC2
>>>
>>> On Tue, Sep 25, 2018 at 4:59 AM Charles Chen  wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Please review and vote on the release candidate #2 for the version
>>>> 2.7.0, as follows:
>>>> [ ] +1, Approve the release
>>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>>
>>>> The complete staging area is available for your review, which includes:
>>>> * JIRA release notes [1],
>>>> * the official Apache source release to be deployed to dist.apache.org
>>>> [2], which is signed with the key with fingerprint 45C60AAAD115F560 [3],
>>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>>> * source code tag "v2.7.0-RC2" [5],
>>>> * website pull request listing the release and publishing the API
>>>> reference manual [6].
>>>> * Java artifacts were built with Gradle 4.8 and OpenJDK
>>>> 1.8.0_181-8u181-b13-1~deb9u1.
>>>> * Python artifacts are deployed along with the source release to the
>>>> dist.apache.org [2].
>>>>
>>>> The vote will be open for at least 72 hours. It is adopted by majority
>>>> approval, with at least 3 PMC affirmative votes.
>>>>
>>>> Thanks,
>>>> Charles
>>>>
>>>> [1]
>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
>>>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>>>> [4]
>>>> https://repository.apache.org/content/repositories/orgapachebeam-1047/
>>>> [5] https://github.com/apache/beam/tree/v2.7.0-RC2
>>>> [6] https://github.com/apache/beam-site/pull/549
>>>>
>>>

[VOTE] Release 2.7.0, release candidate #2

2018-09-24 Thread Charles Chen

Hi everyone,

Please review and vote on the release candidate #2 for the version 2.7.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint 45C60AAAD115F560 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.7.0-RC2" [5],
* website pull request listing the release and publishing the API reference
manual [6].
* Java artifacts were built with Gradle 4.8 and OpenJDK
1.8.0_181-8u181-b13-1~deb9u1.
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Charles

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
[2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
[3] https://dist.apache.org/repos/dist/dev/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1047/
[5] https://github.com/apache/beam/tree/v2.7.0-RC2
[6] https://github.com/apache/beam-site/pull/549

Re: Python PreCommit broken

2018-09-21 Thread Charles Chen

I can speak to the breakage from https://github.com/apache/beam/pull/6151:
here, the root cause was that an interface in an unmodified file was
refactored in the interim.

On Fri, Sep 21, 2018 at 2:26 PM Ryan Williams  wrote:

>
> On Fri, Sep 21, 2018 at 5:08 PM Robert Burke  wrote:
>
>> The issue is time from commit to merge, and without manual intervention,
>> commits from other PRs aren't accounted for, if there's a lag between LGTM
>> and merge.
>>
>
> it's not between LGTM and merge, it's between [last commit to the PR's
> branch] and merge, right?
>
>
>>
>> On Fri, Sep 21, 2018, 1:52 PM Ahmet Altay  wrote:
>>
>>> I will suggest a rollback in this case, and in general as a good
>>> practice to unblock people.
>>>
>>> On Fri, Sep 21, 2018 at 1:02 PM, Charles Chen  wrote:
>>>
>>>> Relatedly, https://github.com/apache/beam/pull/6151 also recently
>>>> broke the build (
>>>> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_GradleBuild/1503/console)
>>>> because the Precommits were very out of date when merged.
>>>>
>>>
>>> Does not github change the test signals from green to yellow when new
>>> commits are added?
>>>
>>
> it does this if you add commits to your PR's branch, not if commits happen
> upstream.
>
> in the latter case, github only checks whether your PR can be merged
> cleanly (i.e. with no merge-conflicts) according to some line-diffing
> heuristic (presumably the same as git uses), but not whether the tests
> would still pass post-merge.
>
> the unwritten assumption is that a non-conflicting merge with upstream
> won't break any tests, but obviously there can be false-positives; thanks
> for documenting those cases, Valentyn and Charles.
>
> it would be nice to run precommits on potential post-merge commits, but
> i'd guess that our precommits take longer than the average time between
> upstream commits, so it would be hard to get anything merged that way!
>
> we'd need to get smarter about knowing exactly which tests need re-running
> when new commits happen upstream, and only re-run those tests, which we're
> pretty far from today.
>
> so, the possibility of this kind of breakage is always present,
> unfortunately.
>
> i would be curious to see exactly how the line-diffing heuristic was
> fooled in these cases. an easy way to construct such a case is to have one
> side rename a variable, and another side add a new reference to the
> variable's old name, in a newly-created file; each side can touch a
> disjoint set of files and have all its tests pass in isolation, and
> git{,hub} will let you merge them, but the tests will fail post-merge.
>
>
>>
>>>
>>>
>>>>
>>>> On Fri, Sep 21, 2018 at 12:50 PM Valentyn Tymofieiev <
>>>> valen...@google.com> wrote:
>>>>
>>>>> The change https://github.com/apache/beam/pull/6424 was not deemed
>>>>> particularly risky, and it's purpose was adding more tests to precommit
>>>>> test suite.
>>>>> There was a green Precommit signal on Jenkins, and I believe
>>>>> Postcommit test suite (at the same time) wouldn't catch this.
>>>>>
>>>>> The reason the breakage was introduced is that
>>>>> https://github.com/apache/beam/commit/7689f12db5 was committed after
>>>>> the PR 6424 was reviewed, but before it was merged. A combination of both
>>>>> introduced the breakage.
>>>>>
>>>>> Had we re-run the tests closer to the merge, we should have caught
>>>>> this. Can we automatically re-run precommits tests at merge time,  when
>>>>> some of the files  a PR is touching have changed since last precommit run?
>>>>>
>>>>> I suggest we proceed with https://github.com/apache/beam/pull/6464 or
>>>>> revert  https://github.com/apache/beam/pull/6424 in the mean time,
>>>>> while we are iterating on the fix.
>>>>>
>>>>> On Fri, Sep 21, 2018 at 11:41 AM Charles Chen  wrote:
>>>>>
>>>>>> Do we happen to know the root cause for why this wasn't caught during
>>>>>> review / precommit?
>>>>>>
>>>>>> In the future, can we run manually run postcommits for risky changes
>>>>>> like these?  That is, trigger it by commenting "Run Python PostCommit"?
>>>>>>
>>>>>> On Fri, Sep 21, 2018 at 10:10 AM Pablo Estrada 
>>>>>> wrote:
>&

Re: Python PreCommit broken

2018-09-21 Thread Charles Chen

Relatedly, https://github.com/apache/beam/pull/6151 also recently broke the
build (
https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_GradleBuild/1503/console)
because the Precommits were very out of date when merged.

On Fri, Sep 21, 2018 at 12:50 PM Valentyn Tymofieiev 
wrote:

> The change https://github.com/apache/beam/pull/6424 was not deemed
> particularly risky, and it's purpose was adding more tests to precommit
> test suite.
> There was a green Precommit signal on Jenkins, and I believe Postcommit
> test suite (at the same time) wouldn't catch this.
>
> The reason the breakage was introduced is that
> https://github.com/apache/beam/commit/7689f12db5 was committed after the
> PR 6424 was reviewed, but before it was merged. A combination of both
> introduced the breakage.
>
> Had we re-run the tests closer to the merge, we should have caught this.
> Can we automatically re-run precommits tests at merge time,  when some of
> the files  a PR is touching have changed since last precommit run?
>
> I suggest we proceed with https://github.com/apache/beam/pull/6464 or
> revert  https://github.com/apache/beam/pull/6424 in the mean time, while
> we are iterating on the fix.
>
> On Fri, Sep 21, 2018 at 11:41 AM Charles Chen  wrote:
>
>> Do we happen to know the root cause for why this wasn't caught during
>> review / precommit?
>>
>> In the future, can we run manually run postcommits for risky changes like
>> these?  That is, trigger it by commenting "Run Python PostCommit"?
>>
>> On Fri, Sep 21, 2018 at 10:10 AM Pablo Estrada 
>> wrote:
>>
>>> Robbe has prepared a better fix on
>>> https://github.com/apache/beam/pull/6465
>>> Hopefully that'll be the last of this breakage : )
>>> -P.
>>>
>>> On Fri, Sep 21, 2018 at 9:13 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> By the way, it fails for me on my machine as well.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 21/09/2018 18:10, Pablo Estrada wrote:
>>>> > I investigated. This failure comes from the activation of
>>>> > apache_beam.pipeline_test in Python 3 unit tests.
>>>> >
>>>> > I have https://github.com/apache/beam/pull/6464 out to fix this.
>>>> >
>>>> > In any case, it's very exciting that we have some unit tests running
>>>> on
>>>> > Py3 : )
>>>> > Best
>>>> > -P.
>>>> >
>>>> > On Fri, Sep 21, 2018 at 6:40 AM Maximilian Michels >>> > <mailto:m...@apache.org>> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > The Python PreCommit tests are currently broken. Is anybody
>>>> looking
>>>> > into
>>>> > this?
>>>> >
>>>> > Example:
>>>> >
>>>> https://builds.apache.org/job/beam_PreCommit_Python_Commit/1308/#showFailuresLink
>>>> > JIRA: https://issues.apache.org/jira/browse/BEAM-5458
>>>> >
>>>> > I'm sure this was just an accident. No big deal but let's make
>>>> sure
>>>> > PreCommit passes when merging. A broken PreCommit means that we
>>>> can't
>>>> > merge any other changes with confidence.
>>>> >
>>>> > Thanks,
>>>> > Max
>>>> >
>>>>
>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbono...@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>

Re: Python PreCommit broken

2018-09-21 Thread Charles Chen

Do we happen to know the root cause for why this wasn't caught during
review / precommit?

In the future, can we run manually run postcommits for risky changes like
these?  That is, trigger it by commenting "Run Python PostCommit"?

On Fri, Sep 21, 2018 at 10:10 AM Pablo Estrada  wrote:

> Robbe has prepared a better fix on
> https://github.com/apache/beam/pull/6465
> Hopefully that'll be the last of this breakage : )
> -P.
>
> On Fri, Sep 21, 2018 at 9:13 AM Jean-Baptiste Onofré 
> wrote:
>
>> By the way, it fails for me on my machine as well.
>>
>> Regards
>> JB
>>
>> On 21/09/2018 18:10, Pablo Estrada wrote:
>> > I investigated. This failure comes from the activation of
>> > apache_beam.pipeline_test in Python 3 unit tests.
>> >
>> > I have https://github.com/apache/beam/pull/6464 out to fix this.
>> >
>> > In any case, it's very exciting that we have some unit tests running on
>> > Py3 : )
>> > Best
>> > -P.
>> >
>> > On Fri, Sep 21, 2018 at 6:40 AM Maximilian Michels > > > wrote:
>> >
>> > Hi,
>> >
>> > The Python PreCommit tests are currently broken. Is anybody looking
>> > into
>> > this?
>> >
>> > Example:
>> >
>> https://builds.apache.org/job/beam_PreCommit_Python_Commit/1308/#showFailuresLink
>> > JIRA: https://issues.apache.org/jira/browse/BEAM-5458
>> >
>> > I'm sure this was just an accident. No big deal but let's make sure
>> > PreCommit passes when merging. A broken PreCommit means that we
>> can't
>> > merge any other changes with confidence.
>> >
>> > Thanks,
>> > Max
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-20 Thread Charles Chen

My mistake, it looks like the correct beam staging repository (
https://repository.apache.org/content/repositories/orgapachebeam-1046/) is
specified in your pom file.

On Thu, Sep 20, 2018 at 2:10 PM Charles Chen  wrote:

> Hey Romain and JB, do you have any progress on this?  One thing I would
> like to point out is that 2.7.0 isn't yet pushed to Maven Central, so
> referring to it by version is not expected to work (and it looks like this
> is what is done in your repo:
> https://github.com/rmannibucau/beam-2.7.0-fails).  Luke indicated above
> that he doesn't see any dependency changes.  Can you isolate and reproduce
> this problem so that we can develop a fix, if necessary?  I would like to
> proceed with an RC2 as soon as possible.
>
> On Wed, Sep 19, 2018 at 6:37 AM Romain Manni-Bucau 
> wrote:
>
>> Quick update on the spark issue: I didn't get enough time to identify it
>> clearly but managed to have a passing run of my test changing a bunch of
>> versions.
>> I suspect my code triggers some class conflicting between spark and my
>> shade leading to a serialization issue. I didn't test userClassPathFirst
>> option of spark but it can be an interesting thing to enable in beam runner.
>> However it is still very confusing to have it not running just upgrading
>> beam version and the spark error is very hard to understand.
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>>
>> Le mar. 18 sept. 2018 à 20:17, Lukasz Cwik  a écrit :
>>
>>> Romain hinted that this was a dependency issue but when comparing the
>>> two dependency trees I don't get much of a difference:
>>>
>>> lcwik@lcwik0: ~$ diff /tmp/260 /tmp/270
>>> < [INFO] +- org.apache.beam:beam-runners-spark:jar:2.6.0:compile
>>> < [INFO] |  +- org.apache.beam:beam-model-pipeline:jar:2.6.0:compile
>>> ---
>>> > [INFO] +- org.apache.beam:beam-runners-spark:jar:2.7.0:compile
>>> > [INFO] |  +- org.apache.beam:beam-model-pipeline:jar:2.7.0:compile
>>> 5c6
>>> < [INFO] |  +- org.apache.beam:beam-sdks-java-core:jar:2.6.0:compile
>>> ---
>>> > [INFO] |  +- org.apache.beam:beam-sdks-java-core:jar:2.7.0:compile
>>> 14,18c15,19
>>> < [INFO] |  |  \- org.tukaani:xz:jar:1.5:compile
>>> < [INFO] |  +-
>>> org.apache.beam:beam-runners-core-construction-java:jar:2.6.0:compile
>>> < [INFO] |  |  \-
>>> org.apache.beam:beam-model-job-management:jar:2.6.0:compile
>>> < [INFO] |  +- org.apache.beam:beam-runners-core-java:jar:2.6.0:compile
>>> < [INFO] |  |  \-
>>> org.apache.beam:beam-model-fn-execution:jar:2.6.0:compile
>>> ---
>>> > [INFO] |  |  \- org.tukaani:xz:jar:1.8:compile
>>> > [INFO] |  +-
>>> org.apache.beam:beam-runners-core-construction-java:jar:2.7.0:compile
>>> > [INFO] |  |  \-
>>> org.apache.beam:beam-model-job-management:jar:2.7.0:compile
>>> > [INFO] |  +- org.apache.beam:beam-runners-core-java:jar:2.7.0:compile
>>> > [INFO] |  |  \-
>>> org.apache.beam:beam-model-fn-execution:jar:2.7.0:compile
>>>
>>> Other then Beam package changes, the only other change is xz which I
>>> don't believe could be causing the issue.
>>>
>>> On Tue, Sep 18, 2018 at 8:38 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> Thanks, let me take a look.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 18/09/2018 17:36, Romain Manni-Bucau wrote:
>>>> >
>>>> >
>>>> >
>>>> > Le mar. 18 sept. 2018 à 16:44, Jean-Baptiste Onofré >>> > <mailto:j...@nanthrax.net>> a écrit :
>>>> >
>>>> > Hi,
>>>> >
>>>> > I don't have the issue ;)
>>>> >
>>>> > As said in my vote, I tested 2.7.0 RC1 on beam-samples with Spark
>>>> > without problem.
>>>> >
>>>> > I don't reproduce Romain issue as well.
>>>> >
>>>> > @Romain can you provide some details to reproduce the issue ?
>>>> >
>>>> >
>>>> > Sure, you can use this
>>>> > reproducer

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-20 Thread Charles Chen

Hey Romain and JB, do you have any progress on this?  One thing I would
like to point out is that 2.7.0 isn't yet pushed to Maven Central, so
referring to it by version is not expected to work (and it looks like this
is what is done in your repo:
https://github.com/rmannibucau/beam-2.7.0-fails).  Luke indicated above
that he doesn't see any dependency changes.  Can you isolate and reproduce
this problem so that we can develop a fix, if necessary?  I would like to
proceed with an RC2 as soon as possible.

On Wed, Sep 19, 2018 at 6:37 AM Romain Manni-Bucau 
wrote:

> Quick update on the spark issue: I didn't get enough time to identify it
> clearly but managed to have a passing run of my test changing a bunch of
> versions.
> I suspect my code triggers some class conflicting between spark and my
> shade leading to a serialization issue. I didn't test userClassPathFirst
> option of spark but it can be an interesting thing to enable in beam runner.
> However it is still very confusing to have it not running just upgrading
> beam version and the spark error is very hard to understand.
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
>
> Le mar. 18 sept. 2018 à 20:17, Lukasz Cwik  a écrit :
>
>> Romain hinted that this was a dependency issue but when comparing the two
>> dependency trees I don't get much of a difference:
>>
>> lcwik@lcwik0: ~$ diff /tmp/260 /tmp/270
>> < [INFO] +- org.apache.beam:beam-runners-spark:jar:2.6.0:compile
>> < [INFO] |  +- org.apache.beam:beam-model-pipeline:jar:2.6.0:compile
>> ---
>> > [INFO] +- org.apache.beam:beam-runners-spark:jar:2.7.0:compile
>> > [INFO] |  +- org.apache.beam:beam-model-pipeline:jar:2.7.0:compile
>> 5c6
>> < [INFO] |  +- org.apache.beam:beam-sdks-java-core:jar:2.6.0:compile
>> ---
>> > [INFO] |  +- org.apache.beam:beam-sdks-java-core:jar:2.7.0:compile
>> 14,18c15,19
>> < [INFO] |  |  \- org.tukaani:xz:jar:1.5:compile
>> < [INFO] |  +-
>> org.apache.beam:beam-runners-core-construction-java:jar:2.6.0:compile
>> < [INFO] |  |  \-
>> org.apache.beam:beam-model-job-management:jar:2.6.0:compile
>> < [INFO] |  +- org.apache.beam:beam-runners-core-java:jar:2.6.0:compile
>> < [INFO] |  |  \-
>> org.apache.beam:beam-model-fn-execution:jar:2.6.0:compile
>> ---
>> > [INFO] |  |  \- org.tukaani:xz:jar:1.8:compile
>> > [INFO] |  +-
>> org.apache.beam:beam-runners-core-construction-java:jar:2.7.0:compile
>> > [INFO] |  |  \-
>> org.apache.beam:beam-model-job-management:jar:2.7.0:compile
>> > [INFO] |  +- org.apache.beam:beam-runners-core-java:jar:2.7.0:compile
>> > [INFO] |  |  \-
>> org.apache.beam:beam-model-fn-execution:jar:2.7.0:compile
>>
>> Other then Beam package changes, the only other change is xz which I
>> don't believe could be causing the issue.
>>
>> On Tue, Sep 18, 2018 at 8:38 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Thanks, let me take a look.
>>>
>>> Regards
>>> JB
>>>
>>> On 18/09/2018 17:36, Romain Manni-Bucau wrote:
>>> >
>>> >
>>> >
>>> > Le mar. 18 sept. 2018 à 16:44, Jean-Baptiste Onofré >> > <mailto:j...@nanthrax.net>> a écrit :
>>> >
>>> > Hi,
>>> >
>>> > I don't have the issue ;)
>>> >
>>> > As said in my vote, I tested 2.7.0 RC1 on beam-samples with Spark
>>> > without problem.
>>> >
>>> > I don't reproduce Romain issue as well.
>>> >
>>> > @Romain can you provide some details to reproduce the issue ?
>>> >
>>> >
>>> > Sure, you can use this
>>> > reproducer: https://github.com/rmannibucau/beam-2.7.0-fails
>>> > It shows that it suceeds on 2.6 and fails on 2.7.
>>> >
>>> >
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 17/09/2018 19:17, Charles Chen wrote:
>>> > > Luke, Maximillian, Raghu, can you please propose cherry-pick PRs
>>> > to the
>>> > > release-2.7.0 for your issues and add me as a reviewer
>>> > (@charlesccychen)?
>>> > >
>>> > > Romain, J

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-19 Thread Charles Chen

What I mean is that if you get the first-parent commit using "git log
--first-parent", it will incorporate any and all fix up commits so we don't
need to worry about missing any.

On Wed, Sep 19, 2018, 1:07 PM Maximilian Michels  wrote:

> Generally, +1 for isolated commits which are easy to revert.
>
> > I don't think it's actually harder to roll back a set of commits that
> are merged together.
> I think Thomas was mainly concerned about "fixup" commits to land in
> master (as part of a merge). These indeed make reverting commits more
> difficult because you have to check whether you missed a "fixup".
>
> > Ideally every commit should compile and pass tests though, right?
>
> That is definitely what we should strive for when doing a merge against
> master.
>
> > Perhaps the bigger issue is that we need better documentation and a
> playbook on how to do this these common tasks in git.
>
> We do actually have basic documentation about this but most people don't
> read it. For example, the commit message of a Merge commit should be:
>
> Merge pull request #: [BEAM-] Issue title
>
> But most merge commits don't comply with this rule :) See
> https://beam.apache.org/contribute/committer-guide/#merging-it
>
> On 19.09.18 21:34, Reuven Lax wrote:
> > Ideally every commit should compile and pass tests though, right?
> >
> > On Wed, Sep 19, 2018 at 12:15 PM Ankur Goenka  > <mailto:goe...@google.com>> wrote:
> >
> > I agree with the cleanliness of the Commit history.
> > "Fixup!", "Address comments", "Address even more comments" type of
> > comments does not convey meaningful information and are not very
> > useful. Its a good idea to squash them.
> > However, I think its ok to keep separate commits for different
> > logical pieces of the code which make reviewing and revisiting code
> > easier.
> > Example PR: Support X in the pipeline
> > Commit 1: Restructuring a bunch of code without any logical change.
> > Commit 2: Changing validation logic for pipeline.
> > Commit 3: Supporting new field "X" for pipeline.
> >
> > On Wed, Sep 19, 2018 at 11:27 AM Charles Chen  > <mailto:c...@google.com>> wrote:
> >
> > To be concrete, it is very easy to revert a commit in any case:
> >
> >  1. First, use "git log --first-parent" to find the first-parent
> > commit corresponding to a PR merge (this is a one-to-one
> > correspondence).
> >  2. Use "git revert -m 1 " to revert the commit; this
> > selects the first parent as the base for a merge commit (in
> > the case where a single commit needs to be reverted, just
> > use "git revert " without the "-m 1" flag).
> >
> > In any case, as a general good engineering practice, I do agree
> > that it is highly desirable to have small independent PRs
> > instead of large jumbo PRs whenever possible.
> >
> > On Wed, Sep 19, 2018 at 11:20 AM Charles Chen  > <mailto:c...@google.com>> wrote:
> >
> > I don't think it's actually harder to roll back a set of
> > commits that are merged together.  Git has the notion of
> > first-parent commits (you can see, for example, "git log
> > --first-parent", which filters out the intermediate
> > commits).  In this sense, PRs still get merged as one unit
> > and this is preserved even if intermediate commits are
> > kept.  Perhaps the bigger issue is that we need better
> > documentation and a playbook on how to do this these common
> > tasks in git.
> >
> > On Wed, Sep 19, 2018 at 9:27 AM Thomas Weise  > <mailto:t...@apache.org>> wrote:
> >
> > Wanted to bring this up as reminder as well as
> > opportunity to discuss potential changes to our
> > committer guide. It has been a while since last related
> > discussion and we welcomed several new committers since
> > then.
> >
> > Finishing up pull requests pre-merge:
> >
> >
> https://beam.apache.org/contribute/committer-guide/#finishing-touches
> >
> > PRs are worked on over time and may accumulate many
> > commits. Sometimes because scope expands, sometimes just
&

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-19 Thread Charles Chen

To be concrete, it is very easy to revert a commit in any case:

   1. First, use "git log --first-parent" to find the first-parent commit
   corresponding to a PR merge (this is a one-to-one correspondence).
   2. Use "git revert -m 1 " to revert the commit; this selects
   the first parent as the base for a merge commit (in the case where a single
   commit needs to be reverted, just use "git revert " without the
   "-m 1" flag).

In any case, as a general good engineering practice, I do agree that it is
highly desirable to have small independent PRs instead of large jumbo PRs
whenever possible.

On Wed, Sep 19, 2018 at 11:20 AM Charles Chen  wrote:

> I don't think it's actually harder to roll back a set of commits that are
> merged together.  Git has the notion of first-parent commits (you can see,
> for example, "git log --first-parent", which filters out the intermediate
> commits).  In this sense, PRs still get merged as one unit and this is
> preserved even if intermediate commits are kept.  Perhaps the bigger issue
> is that we need better documentation and a playbook on how to do this these
> common tasks in git.
>
> On Wed, Sep 19, 2018 at 9:27 AM Thomas Weise  wrote:
>
>> Wanted to bring this up as reminder as well as opportunity to discuss
>> potential changes to our committer guide. It has been a while since last
>> related discussion and we welcomed several new committers since then.
>>
>> Finishing up pull requests pre-merge:
>>
>> https://beam.apache.org/contribute/committer-guide/#finishing-touches
>>
>> PRs are worked on over time and may accumulate many commits. Sometimes
>> because scope expands, sometimes just to separate independent changes but
>> most of the time the commits are just fixups that are added as review
>> progresses.
>>
>> It is important that the latter get squashed prior to PR merge, as
>> otherwise we lost the ability to roll back changes by reverting a single
>> commit and also generally cause a lot of noise in the commit history that
>> does not help other contributors. To be clear, I refer to the "Fixup!",
>> "Address comments", "Address even more comments" type of entries :)
>>
>> I would also propose that every commit gets tagged with a JIRA (except
>> those fixups that will be squashed). Having the JIRA and possibly other
>> tags makes it easier for others not involved in the PR to identify changes
>> after they were merged, for example when looking at the revision history or
>> annotated source.
>>
>> As for other scenarios of jumbo PRs with many commits, there are probably
>> situations where work needs to be broken down into smaller units, making
>> life better for both, contributor and reviewer(s). Ideally, every PR would
>> have only one commit, but that may be a bit much to mandate? Is the general
>> expectation something we need to document more clearly?
>>
>> Thanks,
>> Thomas
>>
>>

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-19 Thread Charles Chen

I don't think it's actually harder to roll back a set of commits that are
merged together.  Git has the notion of first-parent commits (you can see,
for example, "git log --first-parent", which filters out the intermediate
commits).  In this sense, PRs still get merged as one unit and this is
preserved even if intermediate commits are kept.  Perhaps the bigger issue
is that we need better documentation and a playbook on how to do this these
common tasks in git.

On Wed, Sep 19, 2018 at 9:27 AM Thomas Weise  wrote:

> Wanted to bring this up as reminder as well as opportunity to discuss
> potential changes to our committer guide. It has been a while since last
> related discussion and we welcomed several new committers since then.
>
> Finishing up pull requests pre-merge:
>
> https://beam.apache.org/contribute/committer-guide/#finishing-touches
>
> PRs are worked on over time and may accumulate many commits. Sometimes
> because scope expands, sometimes just to separate independent changes but
> most of the time the commits are just fixups that are added as review
> progresses.
>
> It is important that the latter get squashed prior to PR merge, as
> otherwise we lost the ability to roll back changes by reverting a single
> commit and also generally cause a lot of noise in the commit history that
> does not help other contributors. To be clear, I refer to the "Fixup!",
> "Address comments", "Address even more comments" type of entries :)
>
> I would also propose that every commit gets tagged with a JIRA (except
> those fixups that will be squashed). Having the JIRA and possibly other
> tags makes it easier for others not involved in the PR to identify changes
> after they were merged, for example when looking at the revision history or
> annotated source.
>
> As for other scenarios of jumbo PRs with many commits, there are probably
> situations where work needs to be broken down into smaller units, making
> life better for both, contributor and reviewer(s). Ideally, every PR would
> have only one commit, but that may be a bit much to mandate? Is the general
> expectation something we need to document more clearly?
>
> Thanks,
> Thomas
>
>

Re: Proposal for Beam Python User State and Timer APIs

2018-09-18 Thread Charles Chen

An update: the reference DirectRunner implementation of (and common
execution code for) the Python user state and timers API has been merged:
https://github.com/apache/beam/pull/6304

On Thu, Aug 30, 2018 at 1:48 AM Charles Chen  wrote:

> Another update: the reference DirectRunner implementation of the Python
> user state and timers API is out for review:
> https://github.com/apache/beam/pull/6304
>
> On Mon, Jul 9, 2018 at 2:18 PM Charles Chen  wrote:
>
>> An update: https://github.com/apache/beam/pull/5691 has been merged.  I
>> hope to send out a reference implementation in the DirectRunner soon.  On
>> the roadmap after that is work on the relevant portability interfaces here
>> so we can get this working on runners like Beam Python on Flink.
>>
>> On Wed, Jun 20, 2018 at 10:00 AM Charles Chen  wrote:
>>
>>> An update on the implementation: I recently sent out the user-facing
>>> pipeline construction part of the API implementation out for review:
>>> https://github.com/apache/beam/pull/5691.
>>>
>>> On Tue, Jun 5, 2018 at 5:26 PM Charles Chen  wrote:
>>>
>>>> Thanks everyone for contributing here.  We've reached rough consensus
>>>> on the approach we should take with this API, and I've summarized this in
>>>> the new "Community consensus" sections I added to the doc (
>>>> https://s.apache.org/beam-python-user-state-and-timers).  I will begin
>>>> initial implementation of this API soon.
>>>>
>>>> On Wed, May 23, 2018 at 8:08 PM Thomas Weise  wrote:
>>>>
>>>>> Nice proposal; it's exciting to see this about to be added to the SDK
>>>>> as it enables a set of more complex use cases.
>>>>>
>>>>> I also think that some of the content can later be repurposed as user
>>>>> documentation.
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>>> On Wed, May 23, 2018 at 11:49 AM, Charles Chen  wrote:
>>>>>
>>>>>> Thanks everyone for the detailed comments and discussions.  It looks
>>>>>> like by now, we mostly agree with the requirements and overall direction
>>>>>> needed for the API, though there is continuing discussion on specific
>>>>>> details.  I want to highlight two new sections of the doc, which address
>>>>>> some discussions that have come up:
>>>>>>
>>>>>>- *Existing state and transactionality*: this section addresses
>>>>>>how we will address an existing transactionality inconsistency in the
>>>>>>existing Java API.  (
>>>>>>
>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>)
>>>>>>- *State for merging windows*: this section addresses how we will
>>>>>>deal with non-combinable state in conjunction with merging windows.  (
>>>>>>
>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ctxkcgabtzpy
>>>>>>)
>>>>>>
>>>>>> Let me know any further comments and suggestions.
>>>>>>
>>>>>> On Tue, May 22, 2018 at 9:29 AM Kenneth Knowles 
>>>>>> wrote:
>>>>>>
>>>>>>> Nice. I know that Java users have found it helpful to have this
>>>>>>> lower-level way of writing pipelines when the high-level primitives 
>>>>>>> don't
>>>>>>> quite have the tight control they are looking for. I hope it will be a 
>>>>>>> big
>>>>>>> draw for Python, too.
>>>>>>>
>>>>>>> (commenting on the doc)
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, May 21, 2018 at 5:15 PM Charles Chen  wrote:
>>>>>>>
>>>>>>>> I want to share a proposal for adding user state and timer support
>>>>>>>> to the Beam Python SDK and get the community's thoughts on how such an 
>>>>>>>> API
>>>>>>>> should look: https://s.apache.org/beam-python-user-state-and-timers
>>>>>>>>
>>>>>>>> Let me know what you think and please add any comments and
>>>>>>>> suggestions you may have.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Charles
>>>>>>>>
>>>>>>>
>>>>>

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-17 Thread Charles Chen

Luke, Maximillian, Raghu, can you please propose cherry-pick PRs to the
release-2.7.0 for your issues and add me as a reviewer (@charlesccychen)?

Romain, JB: is there any way I can help with debugging the issue you're
facing so we can unblock the release?

On Fri, Sep 14, 2018 at 1:49 PM Raghu Angadi  wrote:

> I would like propose one more cherrypick for RC2 :
> https://github.com/apache/beam/pull/6391
> This is a KafkaIO bug fix. Once a user hits this bug, there is no easy
> work around for them, especially on Dataflow. Only work around in Dataflow
> is to restart or reload the job.
>
> The fix itself fairly safe and is tested.
> Raghu.
>
> On Fri, Sep 14, 2018 at 12:52 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Perhaps it could help, but I run simple WordCount (built with Beam 2.7)
>> on YARN/Spark (HDP Sandbox) cluster and it worked fine for me.
>>
>> On 14 Sep 2018, at 06:56, Romain Manni-Bucau 
>> wrote:
>>
>> Hi Charles,
>>
>> I didn't get enough time to check deeply but it is clearly a dependency
>> issue and it is not in beam spark runner itself but in another transitive
>> module of beam. It does not happen in existing spark test cause none of
>> them are in a cluster (even just with 1 worker) but this seems to be a
>> regression since 2.6 works OOTB.
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com/> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>>
>> Le jeu. 13 sept. 2018 à 22:15, Charles Chen  a écrit :
>>
>>> Romain and JB, can you please add the results of your investigations
>>> into the errors you've seen above?  Given that the existing SparkRunner
>>> tests pass for this RC, and that the integration test you ran is in another
>>> repo that is not continuously tested with Beam, it is not clear how we
>>> should move forward and whether this is a blocking issue, unless we can
>>> find a root cause in Beam.
>>>
>>> On Wed, Sep 12, 2018 at 2:08 AM Etienne Chauchot 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> on a performance and functional regression stand point I see no
>>>> regression:
>>>>
>>>> I looked at nexmark graphs "output pcollection size" and "execution
>>>> time" around release cut date on dataflow, spark, flink and direct runner
>>>> in batch and streaming modes. There seems to be no regression.
>>>>
>>>> Etienne
>>>>
>>>> Le mardi 11 septembre 2018 à 12:25 -0700, Charles Chen a écrit :
>>>>
>>>> The SparkRunner validation test (here:
>>>> https://beam.apache.org/contribute/release-guide/#run-validation-tests)
>>>> passes on my machine.  It looks like we are likely missing test coverage
>>>> where Romain is hitting issues.
>>>>
>>>> On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay  wrote:
>>>>
>>>> Could anyone else help with looking at these issues earlier?
>>>>
>>>> On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>> Im running this main [1] through this IT [2]. Was working fine since ~1
>>>> year but 2.7.0 broke it. Didnt investigate more but can have a look later
>>>> this month if it helps.
>>>>
>>>> [1]
>>>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
>>>> [2]
>>>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>>>>
>>>> Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :
>>>>
>>>> Romain: can you give more details on the failure you're encountering,
>>>> i.e. how you are performing this validation?
>>>>
>>>> On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> weird, I didn't have it on Beam samples. Let me try to reproduce and I
>>>> will create th

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-13 Thread Charles Chen

Romain and JB, can you please add the results of your investigations into
the errors you've seen above?  Given that the existing SparkRunner tests
pass for this RC, and that the integration test you ran is in another repo
that is not continuously tested with Beam, it is not clear how we should
move forward and whether this is a blocking issue, unless we can find a
root cause in Beam.

On Wed, Sep 12, 2018 at 2:08 AM Etienne Chauchot 
wrote:

> Hi all,
>
> on a performance and functional regression stand point I see no regression:
>
> I looked at nexmark graphs "output pcollection size" and "execution time"
> around release cut date on dataflow, spark, flink and direct runner in
> batch and streaming modes. There seems to be no regression.
>
> Etienne
>
> Le mardi 11 septembre 2018 à 12:25 -0700, Charles Chen a écrit :
>
> The SparkRunner validation test (here:
> https://beam.apache.org/contribute/release-guide/#run-validation-tests)
> passes on my machine.  It looks like we are likely missing test coverage
> where Romain is hitting issues.
>
> On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay  wrote:
>
> Could anyone else help with looking at these issues earlier?
>
> On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
> Im running this main [1] through this IT [2]. Was working fine since ~1
> year but 2.7.0 broke it. Didnt investigate more but can have a look later
> this month if it helps.
>
> [1]
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
> [2]
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>
> Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :
>
> Romain: can you give more details on the failure you're encountering, i.e.
> how you are performing this validation?
>
> On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
> wrote:
>
> Hi,
>
> weird, I didn't have it on Beam samples. Let me try to reproduce and I
> will create the Jira.
>
> Regards
> JB
>
> On 11/09/2018 11:44, Romain Manni-Bucau wrote:
> > -1, seems spark integration is broken (tested with spark 2.3.1 and
> 2.2.1):
> >
> > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
> 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot assign
> instance of scala.collection.immutable.List$SerializationProxy to
> fieldorg.apache.spark.rdd.RDD.org 
> <http://org.apache.spark.rdd.RDD.org>$apache$spark$rdd$RDD$$dependencies_
> of type scala.collection.Seq in instance of
> org.apache.spark.rdd.MapPartitionsRDD
> >   at
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
> >
> >
> > Also the issue Lukasz identified is important even if workarounds can be
> > put in place so +1 to fix it as well if possible.
> >
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> > Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik  > <mailto:lc...@google.com>> a écrit :
> >
> > I found an issue where we are no longer packaging the pom.xml within
> > the artifact jars at META-INF/maven/groupId/artifactId. More details
> > in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
> > consider this a blocker but it was an easy fix
> > (https://github.com/apache/beam/pull/6358) and users may rely on the
> > pom.xml.
> >
> > Should we recut the release candidate to include this?
> >
> > On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>> wrote:
> >
> > +1 (binding)
> >
> > Tested successfully on Beam Samples.
> >
> > Thanks !
> >
> > Regards
> > JB
> >
> > On 07/09/2018 23:56, Charles Chen wrote:
> >  > Hi everyone,
> >  >
> >  > Please review and vote on the release candidate #1 for the
> > version
> >  > 2.7.0, as follows:
> >

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Charles Chen

The SparkRunner validation test (here:
https://beam.apache.org/contribute/release-guide/#run-validation-tests)
passes on my machine.  It looks like we are likely missing test coverage
where Romain is hitting issues.

On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay  wrote:

> Could anyone else help with looking at these issues earlier?
>
> On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Im running this main [1] through this IT [2]. Was working fine since ~1
>> year but 2.7.0 broke it. Didnt investigate more but can have a look later
>> this month if it helps.
>>
>> [1]
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
>> [2]
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>>
>> Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :
>>
>>> Romain: can you give more details on the failure you're encountering,
>>> i.e. how you are performing this validation?
>>>
>>> On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> weird, I didn't have it on Beam samples. Let me try to reproduce and I
>>>> will create the Jira.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 11/09/2018 11:44, Romain Manni-Bucau wrote:
>>>> > -1, seems spark integration is broken (tested with spark 2.3.1 and
>>>> 2.2.1):
>>>> >
>>>> > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0
>>>> (TID 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot
>>>> assign instance of scala.collection.immutable.List$SerializationProxy to
>>>> fieldorg.apache.spark.rdd.RDD.org 
>>>> <http://org.apache.spark.rdd.RDD.org>$apache$spark$rdd$RDD$$dependencies_
>>>> of type scala.collection.Seq in instance of
>>>> org.apache.spark.rdd.MapPartitionsRDD
>>>> >   at
>>>> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
>>>> >
>>>> >
>>>> > Also the issue Lukasz identified is important even if workarounds can
>>>> be
>>>> > put in place so +1 to fix it as well if possible.
>>>> >
>>>> > Romain Manni-Bucau
>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>> > <http://rmannibucau.wordpress.com> | Github
>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>> > <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> > Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik >>> > <mailto:lc...@google.com>> a écrit :
>>>> >
>>>> > I found an issue where we are no longer packaging the pom.xml
>>>> within
>>>> > the artifact jars at META-INF/maven/groupId/artifactId. More
>>>> details
>>>> > in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
>>>> > consider this a blocker but it was an easy fix
>>>> > (https://github.com/apache/beam/pull/6358) and users may rely on
>>>> the
>>>> > pom.xml.
>>>> >
>>>> > Should we recut the release candidate to include this?
>>>> >
>>>> > On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
>>>> > mailto:j...@nanthrax.net>> wrote:
>>>> >
>>>> > +1 (binding)
>>>> >
>>>> > Tested successfully on Beam Samples.
>>>> >
>>>> > Thanks !
>>>> >
>>>> > Regards
>>>> > JB
>>>> >
>>>> > On 07/09/2018 23:56, Charles Chen wrote:
>>>> >  > Hi everyone,
>>>> >  >
>>>> >  > Please review and vote on the release candidate #1 for the
>>>> > version
>>>> >  > 2

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Charles Chen

Romain: can you give more details on the failure you're encountering, i.e.
how you are performing this validation?

On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> weird, I didn't have it on Beam samples. Let me try to reproduce and I
> will create the Jira.
>
> Regards
> JB
>
> On 11/09/2018 11:44, Romain Manni-Bucau wrote:
> > -1, seems spark integration is broken (tested with spark 2.3.1 and
> 2.2.1):
> >
> > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
> 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot assign
> instance of scala.collection.immutable.List$SerializationProxy to
> fieldorg.apache.spark.rdd.RDD.org 
> <http://org.apache.spark.rdd.RDD.org>$apache$spark$rdd$RDD$$dependencies_
> of type scala.collection.Seq in instance of
> org.apache.spark.rdd.MapPartitionsRDD
> >   at
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
> >
> >
> > Also the issue Lukasz identified is important even if workarounds can be
> > put in place so +1 to fix it as well if possible.
> >
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> > Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik  > <mailto:lc...@google.com>> a écrit :
> >
> > I found an issue where we are no longer packaging the pom.xml within
> > the artifact jars at META-INF/maven/groupId/artifactId. More details
> > in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
> > consider this a blocker but it was an easy fix
> > (https://github.com/apache/beam/pull/6358) and users may rely on the
> > pom.xml.
> >
> > Should we recut the release candidate to include this?
> >
> > On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>> wrote:
> >
> > +1 (binding)
> >
> > Tested successfully on Beam Samples.
> >
> > Thanks !
> >
> > Regards
> > JB
> >
> > On 07/09/2018 23:56, Charles Chen wrote:
> >  > Hi everyone,
> >  >
> >  > Please review and vote on the release candidate #1 for the
> > version
> >  > 2.7.0, as follows:
> >  > [ ] +1, Approve the release
> >  > [ ] -1, Do not approve the release (please provide specific
> > comments)
> >  >
> >  > The complete staging area is available for your review, which
> > includes:
> >  > * JIRA release notes [1],
> >  > * the official Apache source release to be deployed to
> > dist.apache.org <http://dist.apache.org>
> >  > <http://dist.apache.org> [2], which is signed with the key
> with
> >  > fingerprint 45C60AAAD115F560 [3],
> >  > * all artifacts to be deployed to the Maven Central
> > Repository [4],
> >  > * source code tag "v2.7.0-RC1" [5],
> >  > * website pull request listing the release and publishing the
> API
> >  > reference manual [6].
> >  > * Java artifacts were built with Gradle 4.8 and OpenJDK
> >  > 1.8.0_181-8u181-b13-1~deb9u1-b13.
> >  > * Python artifacts are deployed along with the source release
> > to the
> >  > dist.apache.org <http://dist.apache.org>
> > <http://dist.apache.org> [2].
> >  >
> >  > The vote will be open for at least 72 hours. It is adopted by
> > majority
> >  > approval, with at least 3 PMC affirmative votes.
> >  >
> >  > Thanks,
> >  > Charles
> >  >
> >  > [1]
> >  >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
> >  > [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
> >  > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
> >  > [4]
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1046/
> >  > [5] https://github.com/apache/beam/tree/v2.7.0-RC1
> >  > [6] https://github.com/apache/beam-site/pull/549
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org <mailto:jbono...@apache.org>
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

[VOTE] Release 2.7.0, release candidate #1

2018-09-07 Thread Charles Chen

Hi everyone,

Please review and vote on the release candidate #1 for the version 2.7.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint 45C60AAAD115F560 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.7.0-RC1" [5],
* website pull request listing the release and publishing the API reference
manual [6].
* Java artifacts were built with Gradle 4.8 and OpenJDK
1.8.0_181-8u181-b13-1~deb9u1-b13.
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Charles

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
[2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
[3] https://dist.apache.org/repos/dist/dev/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1046/
[5] https://github.com/apache/beam/tree/v2.7.0-RC1
[6] https://github.com/apache/beam-site/pull/549

Re: [PROPOSAL] Prepare Beam 2.7.0 release

2018-09-06 Thread Charles Chen

Making steady progress, but there is an issue compiling with Protobuf for
Python which I am investigating right now.

On Wed, Sep 5, 2018 at 11:13 PM Charles Chen  wrote:

> Thanks!  I talked to Boyuan who indicated that this may be an issue
> specific to the JVM version I am running, since she saw a similar issue
> while performing the 2.5.0 release.  Next step on my end is to set up
> another machine with a different JVM version to run these steps.
>
> On Wed, Sep 5, 2018 at 3:03 AM Alexey Romanenko 
> wrote:
>
>> Charles,
>> Does it happen when you run this task separately or as a part of the
>> whole project build? I run it independently and don’t see any issues on my
>> side. Also, I don’t remember that it failed on Jenkins recently. So,
>> probably this segfault is not related to exactly this task.
>>
>> On 5 Sep 2018, at 06:46, Charles Chen  wrote:
>>
>> I attempted to cut the 2.7.0 RC1 today, but encountered an issue where
>> the JVM consistently segfaulted at
>> the :beam-sdks-java-io-hadoop-input-format:test task.  I will investigate
>> further tomorrow.  Let me know if this anyone knows of this issue and has
>> any insight.
>>
>> On Wed, Aug 29, 2018 at 12:21 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> It sounds good to me. I'm late on my PRs, but not release blocker
>>> anyway. I will wait the next release cycle ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 29/08/2018 21:18, Charles Chen wrote:
>>> > I have cut the release branch for 2.7.0, since there continue to be no
>>> > blockers listed in JIRA.  I will build the first release candidate
>>> soon.
>>> >
>>> > On Mon, Aug 27, 2018 at 10:38 PM Charles Chen >> > <mailto:c...@google.com>> wrote:
>>> >
>>> > Hey everyone, I want to highlight again to those who missed it that
>>> > if you are aware of any 2.7.0 release blockers, you should add it
>>> as
>>> > a blocker in JIRA with a target version of 2.7.0.  It is very
>>> > helpful to know this information in advance.  As of right now,
>>> there
>>> > are no such blockers listed in JIRA, and I will be cutting the
>>> > release branch on Wednesday 8/29.
>>> >
>>> > Best,
>>> > Charles
>>> >
>>> > On Fri, Aug 24, 2018 at 2:38 PM Charles Chen >> > <mailto:c...@google.com>> wrote:
>>> >
>>> > Thanks everyone.  Again, we will proceed with the initial
>>> > release cut on August 29.
>>> >
>>> > A reminder to please tag any blocking issues as "Priority:
>>> > Blocker" and "Fix version: 2.7.0" in JIRA.  We recently
>>> > resolved https://issues.apache.org/jira/browse/BEAM-5180, and
>>> > there are no other blocking bugs at the moment.
>>> >
>>> > On Thu, Aug 23, 2018 at 8:19 PM Griselda Cuevas <
>>> g...@google.com
>>> > <mailto:g...@google.com>> wrote:
>>> >
>>> > +1 Thanks for volunteering and keeping us in schedule!
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, 23 Aug 2018 at 11:58, Udi Meiri >> > <mailto:eh...@google.com>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Mon, Aug 20, 2018 at 3:33 PM Boyuan Zhang
>>> > mailto:boyu...@google.com>>
>>> wrote:
>>> >
>>> > +1
>>> > Thanks for volunteering, Charles!
>>> >
>>> > On Mon, Aug 20, 2018 at 3:22 PM Rafael Fernandez
>>> > mailto:rfern...@google.com>>
>>> > wrote:
>>> >
>>> > +1, thanks for volunteering, Charles!
>>> >
>>> > On Mon, Aug 20, 2018 at 12:09 PM Charles Chen
>>> > mailto:c...@google.com>>
>>> wrote:
>>> >
>>> > Thank you Andrew for pointing out my
>>> > mistake.  We should follow the calendar and
>>> > aim to cut on 8/29, not 9/7 as I
>>> incorrectly
>>> > wrote earlier.
>>>

Re: [PROPOSAL] Prepare Beam 2.7.0 release

2018-09-06 Thread Charles Chen

Thanks!  I talked to Boyuan who indicated that this may be an issue
specific to the JVM version I am running, since she saw a similar issue
while performing the 2.5.0 release.  Next step on my end is to set up
another machine with a different JVM version to run these steps.

On Wed, Sep 5, 2018 at 3:03 AM Alexey Romanenko 
wrote:

> Charles,
> Does it happen when you run this task separately or as a part of the whole
> project build? I run it independently and don’t see any issues on my side.
> Also, I don’t remember that it failed on Jenkins recently. So, probably
> this segfault is not related to exactly this task.
>
> On 5 Sep 2018, at 06:46, Charles Chen  wrote:
>
> I attempted to cut the 2.7.0 RC1 today, but encountered an issue where the
> JVM consistently segfaulted at
> the :beam-sdks-java-io-hadoop-input-format:test task.  I will investigate
> further tomorrow.  Let me know if this anyone knows of this issue and has
> any insight.
>
> On Wed, Aug 29, 2018 at 12:21 PM Jean-Baptiste Onofré 
> wrote:
>
>> It sounds good to me. I'm late on my PRs, but not release blocker
>> anyway. I will wait the next release cycle ;)
>>
>> Regards
>> JB
>>
>> On 29/08/2018 21:18, Charles Chen wrote:
>> > I have cut the release branch for 2.7.0, since there continue to be no
>> > blockers listed in JIRA.  I will build the first release candidate soon.
>> >
>> > On Mon, Aug 27, 2018 at 10:38 PM Charles Chen > > <mailto:c...@google.com>> wrote:
>> >
>> > Hey everyone, I want to highlight again to those who missed it that
>> > if you are aware of any 2.7.0 release blockers, you should add it as
>> > a blocker in JIRA with a target version of 2.7.0.  It is very
>> > helpful to know this information in advance.  As of right now, there
>> >     are no such blockers listed in JIRA, and I will be cutting the
>> > release branch on Wednesday 8/29.
>> >
>> > Best,
>> > Charles
>> >
>> > On Fri, Aug 24, 2018 at 2:38 PM Charles Chen > > <mailto:c...@google.com>> wrote:
>> >
>> > Thanks everyone.  Again, we will proceed with the initial
>> > release cut on August 29.
>> >
>> > A reminder to please tag any blocking issues as "Priority:
>> > Blocker" and "Fix version: 2.7.0" in JIRA.  We recently
>> > resolved https://issues.apache.org/jira/browse/BEAM-5180, and
>> > there are no other blocking bugs at the moment.
>> >
>> > On Thu, Aug 23, 2018 at 8:19 PM Griselda Cuevas <
>> g...@google.com
>> > <mailto:g...@google.com>> wrote:
>> >
>> > +1 Thanks for volunteering and keeping us in schedule!
>> >
>> >
>> >
>> >
>> > On Thu, 23 Aug 2018 at 11:58, Udi Meiri > > <mailto:eh...@google.com>> wrote:
>> >
>> > +1
>> >
>> >     On Mon, Aug 20, 2018 at 3:33 PM Boyuan Zhang
>> > mailto:boyu...@google.com>> wrote:
>> >
>> > +1
>> > Thanks for volunteering, Charles!
>> >
>> > On Mon, Aug 20, 2018 at 3:22 PM Rafael Fernandez
>> > mailto:rfern...@google.com>>
>> > wrote:
>> >
>> > +1, thanks for volunteering, Charles!
>> >
>> > On Mon, Aug 20, 2018 at 12:09 PM Charles Chen
>> > mailto:c...@google.com>> wrote:
>> >
>> > Thank you Andrew for pointing out my
>> > mistake.  We should follow the calendar and
>> > aim to cut on 8/29, not 9/7 as I incorrectly
>> > wrote earlier.
>> >
>> > On Mon, Aug 20, 2018 at 12:02 PM Andrew
>> > Pilloud > > <mailto:apill...@google.com>> wrote:
>> >
>> > +1 Thanks for volunteering! The calendar
>> > I have puts the cut date at August 29th,
>> > which looks to be 6 weeks from when
>> > 2.6.0 was cut. Do I have the wrong
>> calendar?
>> >
>> >

Re: Python 3: final step

2018-09-05 Thread Charles Chen

This is great!  Feel free to add me as a reviewer.

On Wed, Sep 5, 2018 at 9:38 AM Andrew Pilloud  wrote:

> Cool! I know very little about Python 3, but happy to help review.
>
> Andrew
>
> On Wed, Sep 5, 2018 at 9:21 AM Ahmet Altay  wrote:
>
>> Thank you Robbe, this is great news!
>>
>> On Wed, Sep 5, 2018 at 9:11 AM, Robbe Sneyders 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> With the merging of [1], we now have Python 3 tests running on Jenkins,
>>> which allows us to move forward with the last step of the Python 3 porting.
>>>
>>> You can follow the progress on the Jira Kanban Board [2]. If you're
>>> interested in helping by porting a module, you can assign one of the issues
>>> to yourself and start coding. You can find the different steps outlined in
>>> the design document [3].
>>>
>>> We could also use some extra reviewers. If you're interested, let us
>>> know, and we'll tag you in our PRs.
>>>
>>> [1] https://github.com/apache/beam/pull/6266
>>> [2] https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245
>>> [3] https://s.apache.org/beam-python-3
>>>
>>> kind regards,
>>> Robbe
>>> --
>>>
>>> [image: https://ml6.eu] 
>>>
>>> * Robbe Sneyders*
>>>
>>> ML6 Gent
>>> 
>>>
>>> M: +32 474 71 31 08
>>>
>>
>>

Re: [PROPOSAL] Prepare Beam 2.7.0 release

2018-09-04 Thread Charles Chen

I attempted to cut the 2.7.0 RC1 today, but encountered an issue where the
JVM consistently segfaulted at
the :beam-sdks-java-io-hadoop-input-format:test task.  I will investigate
further tomorrow.  Let me know if this anyone knows of this issue and has
any insight.

On Wed, Aug 29, 2018 at 12:21 PM Jean-Baptiste Onofré 
wrote:

> It sounds good to me. I'm late on my PRs, but not release blocker
> anyway. I will wait the next release cycle ;)
>
> Regards
> JB
>
> On 29/08/2018 21:18, Charles Chen wrote:
> > I have cut the release branch for 2.7.0, since there continue to be no
> > blockers listed in JIRA.  I will build the first release candidate soon.
> >
> > On Mon, Aug 27, 2018 at 10:38 PM Charles Chen  > <mailto:c...@google.com>> wrote:
> >
> > Hey everyone, I want to highlight again to those who missed it that
> > if you are aware of any 2.7.0 release blockers, you should add it as
> > a blocker in JIRA with a target version of 2.7.0.  It is very
> > helpful to know this information in advance.  As of right now, there
> > are no such blockers listed in JIRA, and I will be cutting the
> > release branch on Wednesday 8/29.
> >
> > Best,
> > Charles
> >
> > On Fri, Aug 24, 2018 at 2:38 PM Charles Chen  > <mailto:c...@google.com>> wrote:
> >
> > Thanks everyone.  Again, we will proceed with the initial
> > release cut on August 29.
> >
> > A reminder to please tag any blocking issues as "Priority:
> > Blocker" and "Fix version: 2.7.0" in JIRA.  We recently
> > resolved https://issues.apache.org/jira/browse/BEAM-5180, and
> > there are no other blocking bugs at the moment.
> >
> > On Thu, Aug 23, 2018 at 8:19 PM Griselda Cuevas  > <mailto:g...@google.com>> wrote:
> >
> > +1 Thanks for volunteering and keeping us in schedule!
> >
> >
> >
> >
> > On Thu, 23 Aug 2018 at 11:58, Udi Meiri  > <mailto:eh...@google.com>> wrote:
> >
> > +1
> >
> > On Mon, Aug 20, 2018 at 3:33 PM Boyuan Zhang
> > mailto:boyu...@google.com>> wrote:
> >
> > +1
> >         Thanks for volunteering, Charles!
> >
> > On Mon, Aug 20, 2018 at 3:22 PM Rafael Fernandez
> > mailto:rfern...@google.com>>
> > wrote:
> >
> > +1, thanks for volunteering, Charles!
> >
> > On Mon, Aug 20, 2018 at 12:09 PM Charles Chen
> > mailto:c...@google.com>> wrote:
> >
> > Thank you Andrew for pointing out my
> > mistake.  We should follow the calendar and
> > aim to cut on 8/29, not 9/7 as I incorrectly
> > wrote earlier.
> >
> > On Mon, Aug 20, 2018 at 12:02 PM Andrew
> > Pilloud  > <mailto:apill...@google.com>> wrote:
> >
> > +1 Thanks for volunteering! The calendar
> > I have puts the cut date at August 29th,
> > which looks to be 6 weeks from when
> > 2.6.0 was cut. Do I have the wrong
> calendar?
> >
> > See:
> >
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
> >
> > Andrew
> >
> > On Mon, Aug 20, 2018 at 11:43 AM Connell
> > O'Callaghan  > <mailto:conne...@google.com>> wrote:
> >
> > +1 Charles thank you for taking this
> > up and helping us maintain this
> > schedule.
> >
> > On Mon, Aug 20, 2018 at 11:29 AM
> > Charles Chen  > <mailto:c...@google.com>> wrote:
> >
> > Hey everyone,
> >
> > Our release calendar in

Re: delayed emit (timer) in py-beam?

2018-08-30 Thread Charles Chen

FYI: the reference DirectRunner implementation of the Python user state and
timers API is out for review: https://github.com/apache/beam/pull/6304

On Mon, Jul 30, 2018 at 3:57 PM Austin Bennett 
wrote:

> Fantastic; thanks, Charles!
>
>
>
> On Mon, Jul 30, 2018 at 3:49 PM, Charles Chen  wrote:
>
>> Hey Austin,
>>
>> This API is not yet implemented in the Python SDK.  I am working on this
>> feature:  the next step from my end is to finish a reference implementation
>> in the local DirectRunner.  As you note, the doc at
>> https://s.apache.org/beam-python-user-state-and-timers describes the
>> design.
>>
>> You can track progress on the mailing list thread here:
>> https://lists.apache.org/thread.html/51ba1a00027ad8635bc1d2c0df805ce873995170c75d6a08dfe21997@%3Cdev.beam.apache.org%3E
>>
>> Best,
>> Charles
>>
>> On Mon, Jul 30, 2018 at 3:34 PM Austin Bennett <
>> whatwouldausti...@gmail.com> wrote:
>>
>>> What's going on with timers and python?
>>>
>>> Am looking at building a pipeline (assuming another group in my company
>>> will grant access to the Kafka topic):
>>>
>>> Kafka -> beam -> have beam wait 24 hours -> do transform(s) and emit a
>>> record.  If I read things correctly that's not currently possible in python
>>> on beam.  What all is needed?  (trying to figure out whether that is
>>> something that I am capable of and there is room for me to help with).
>>> Looking for similar functionality to
>>> https://www.rabbitmq.com/blog/2015/04/16/scheduling-messages-with-rabbitmq/
>>> (though don't need alternate routing, nor is that example in python).
>>>
>>>
>>> For example, I see:
>>> https://beam.apache.org/blog/2017/08/28/timely-processing.html
>>>
>>> and tickets like:  https://issues.apache.org/jira/browse/BEAM-4594
>>>
>>>
>>>
>

Re: Proposal for Beam Python User State and Timer APIs

2018-08-30 Thread Charles Chen

Another update: the reference DirectRunner implementation of the Python
user state and timers API is out for review:
https://github.com/apache/beam/pull/6304

On Mon, Jul 9, 2018 at 2:18 PM Charles Chen  wrote:

> An update: https://github.com/apache/beam/pull/5691 has been merged.  I
> hope to send out a reference implementation in the DirectRunner soon.  On
> the roadmap after that is work on the relevant portability interfaces here
> so we can get this working on runners like Beam Python on Flink.
>
> On Wed, Jun 20, 2018 at 10:00 AM Charles Chen  wrote:
>
>> An update on the implementation: I recently sent out the user-facing
>> pipeline construction part of the API implementation out for review:
>> https://github.com/apache/beam/pull/5691.
>>
>> On Tue, Jun 5, 2018 at 5:26 PM Charles Chen  wrote:
>>
>>> Thanks everyone for contributing here.  We've reached rough consensus on
>>> the approach we should take with this API, and I've summarized this in the
>>> new "Community consensus" sections I added to the doc (
>>> https://s.apache.org/beam-python-user-state-and-timers).  I will begin
>>> initial implementation of this API soon.
>>>
>>> On Wed, May 23, 2018 at 8:08 PM Thomas Weise  wrote:
>>>
>>>> Nice proposal; it's exciting to see this about to be added to the SDK
>>>> as it enables a set of more complex use cases.
>>>>
>>>> I also think that some of the content can later be repurposed as user
>>>> documentation.
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> On Wed, May 23, 2018 at 11:49 AM, Charles Chen  wrote:
>>>>
>>>>> Thanks everyone for the detailed comments and discussions.  It looks
>>>>> like by now, we mostly agree with the requirements and overall direction
>>>>> needed for the API, though there is continuing discussion on specific
>>>>> details.  I want to highlight two new sections of the doc, which address
>>>>> some discussions that have come up:
>>>>>
>>>>>- *Existing state and transactionality*: this section addresses
>>>>>how we will address an existing transactionality inconsistency in the
>>>>>existing Java API.  (
>>>>>
>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>)
>>>>>- *State for merging windows*: this section addresses how we will
>>>>>deal with non-combinable state in conjunction with merging windows.  (
>>>>>
>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ctxkcgabtzpy
>>>>>)
>>>>>
>>>>> Let me know any further comments and suggestions.
>>>>>
>>>>> On Tue, May 22, 2018 at 9:29 AM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> Nice. I know that Java users have found it helpful to have this
>>>>>> lower-level way of writing pipelines when the high-level primitives don't
>>>>>> quite have the tight control they are looking for. I hope it will be a 
>>>>>> big
>>>>>> draw for Python, too.
>>>>>>
>>>>>> (commenting on the doc)
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Mon, May 21, 2018 at 5:15 PM Charles Chen  wrote:
>>>>>>
>>>>>>> I want to share a proposal for adding user state and timer support
>>>>>>> to the Beam Python SDK and get the community's thoughts on how such an 
>>>>>>> API
>>>>>>> should look: https://s.apache.org/beam-python-user-state-and-timers
>>>>>>>
>>>>>>> Let me know what you think and please add any comments and
>>>>>>> suggestions you may have.
>>>>>>>
>>>>>>> Best,
>>>>>>> Charles
>>>>>>>
>>>>>>
>>>>

Re: [PROPOSAL] Prepare Beam 2.7.0 release

2018-08-29 Thread Charles Chen

I have cut the release branch for 2.7.0, since there continue to be no
blockers listed in JIRA.  I will build the first release candidate soon.

On Mon, Aug 27, 2018 at 10:38 PM Charles Chen  wrote:

> Hey everyone, I want to highlight again to those who missed it that if you
> are aware of any 2.7.0 release blockers, you should add it as a blocker in
> JIRA with a target version of 2.7.0.  It is very helpful to know this
> information in advance.  As of right now, there are no such blockers listed
> in JIRA, and I will be cutting the release branch on Wednesday 8/29.
>
> Best,
> Charles
>
> On Fri, Aug 24, 2018 at 2:38 PM Charles Chen  wrote:
>
>> Thanks everyone.  Again, we will proceed with the initial release cut on
>> August 29.
>>
>> A reminder to please tag any blocking issues as "Priority: Blocker" and
>> "Fix version: 2.7.0" in JIRA.  We recently resolved
>> https://issues.apache.org/jira/browse/BEAM-5180, and there are no other
>> blocking bugs at the moment.
>>
>> On Thu, Aug 23, 2018 at 8:19 PM Griselda Cuevas  wrote:
>>
>>> +1 Thanks for volunteering and keeping us in schedule!
>>>
>>>
>>>
>>>
>>> On Thu, 23 Aug 2018 at 11:58, Udi Meiri  wrote:
>>>
>>>> +1
>>>>
>>>> On Mon, Aug 20, 2018 at 3:33 PM Boyuan Zhang 
>>>> wrote:
>>>>
>>>>> +1
>>>>> Thanks for volunteering, Charles!
>>>>>
>>>>> On Mon, Aug 20, 2018 at 3:22 PM Rafael Fernandez 
>>>>> wrote:
>>>>>
>>>>>> +1, thanks for volunteering, Charles!
>>>>>>
>>>>>> On Mon, Aug 20, 2018 at 12:09 PM Charles Chen  wrote:
>>>>>>
>>>>>>> Thank you Andrew for pointing out my mistake.  We should follow the
>>>>>>> calendar and aim to cut on 8/29, not 9/7 as I incorrectly wrote earlier.
>>>>>>>
>>>>>>> On Mon, Aug 20, 2018 at 12:02 PM Andrew Pilloud 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 Thanks for volunteering! The calendar I have puts the cut date
>>>>>>>> at August 29th, which looks to be 6 weeks from when 2.6.0 was cut. Do I
>>>>>>>> have the wrong calendar?
>>>>>>>>
>>>>>>>> See:
>>>>>>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> On Mon, Aug 20, 2018 at 11:43 AM Connell O'Callaghan <
>>>>>>>> conne...@google.com> wrote:
>>>>>>>>
>>>>>>>>> +1 Charles thank you for taking this up and helping us maintain
>>>>>>>>> this schedule.
>>>>>>>>>
>>>>>>>>> On Mon, Aug 20, 2018 at 11:29 AM Charles Chen 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey everyone,
>>>>>>>>>>
>>>>>>>>>> Our release calendar indicates that the process for the 2.7.0
>>>>>>>>>> Beam release should start on September 7.
>>>>>>>>>>
>>>>>>>>>> I volunteer to perform this release and propose the following
>>>>>>>>>> schedule:
>>>>>>>>>>
>>>>>>>>>>- We start triaging issues in JIRA this week.
>>>>>>>>>>- I will cut the initial 2.7.0 release branch on September 7.
>>>>>>>>>>- After September 7, any blockers will need to be manually
>>>>>>>>>>cherry-picked into the release branch.
>>>>>>>>>>- After tests pass and blockers are fully addressed, I will
>>>>>>>>>>move on and perform other release tasks.
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Charles
>>>>>>>>>>
>>>>>>>>>

Re: [PROPOSAL] Prepare Beam 2.7.0 release

2018-08-27 Thread Charles Chen

Hey everyone, I want to highlight again to those who missed it that if you
are aware of any 2.7.0 release blockers, you should add it as a blocker in
JIRA with a target version of 2.7.0.  It is very helpful to know this
information in advance.  As of right now, there are no such blockers listed
in JIRA, and I will be cutting the release branch on Wednesday 8/29.

Best,
Charles

On Fri, Aug 24, 2018 at 2:38 PM Charles Chen  wrote:

> Thanks everyone.  Again, we will proceed with the initial release cut on
> August 29.
>
> A reminder to please tag any blocking issues as "Priority: Blocker" and
> "Fix version: 2.7.0" in JIRA.  We recently resolved
> https://issues.apache.org/jira/browse/BEAM-5180, and there are no other
> blocking bugs at the moment.
>
> On Thu, Aug 23, 2018 at 8:19 PM Griselda Cuevas  wrote:
>
>> +1 Thanks for volunteering and keeping us in schedule!
>>
>>
>>
>>
>> On Thu, 23 Aug 2018 at 11:58, Udi Meiri  wrote:
>>
>>> +1
>>>
>>> On Mon, Aug 20, 2018 at 3:33 PM Boyuan Zhang  wrote:
>>>
>>>> +1
>>>> Thanks for volunteering, Charles!
>>>>
>>>> On Mon, Aug 20, 2018 at 3:22 PM Rafael Fernandez 
>>>> wrote:
>>>>
>>>>> +1, thanks for volunteering, Charles!
>>>>>
>>>>> On Mon, Aug 20, 2018 at 12:09 PM Charles Chen  wrote:
>>>>>
>>>>>> Thank you Andrew for pointing out my mistake.  We should follow the
>>>>>> calendar and aim to cut on 8/29, not 9/7 as I incorrectly wrote earlier.
>>>>>>
>>>>>> On Mon, Aug 20, 2018 at 12:02 PM Andrew Pilloud 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 Thanks for volunteering! The calendar I have puts the cut date at
>>>>>>> August 29th, which looks to be 6 weeks from when 2.6.0 was cut. Do I 
>>>>>>> have
>>>>>>> the wrong calendar?
>>>>>>>
>>>>>>> See:
>>>>>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Mon, Aug 20, 2018 at 11:43 AM Connell O'Callaghan <
>>>>>>> conne...@google.com> wrote:
>>>>>>>
>>>>>>>> +1 Charles thank you for taking this up and helping us maintain
>>>>>>>> this schedule.
>>>>>>>>
>>>>>>>> On Mon, Aug 20, 2018 at 11:29 AM Charles Chen 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey everyone,
>>>>>>>>>
>>>>>>>>> Our release calendar indicates that the process for the 2.7.0 Beam
>>>>>>>>> release should start on September 7.
>>>>>>>>>
>>>>>>>>> I volunteer to perform this release and propose the following
>>>>>>>>> schedule:
>>>>>>>>>
>>>>>>>>>- We start triaging issues in JIRA this week.
>>>>>>>>>- I will cut the initial 2.7.0 release branch on September 7.
>>>>>>>>>- After September 7, any blockers will need to be manually
>>>>>>>>>cherry-picked into the release branch.
>>>>>>>>>- After tests pass and blockers are fully addressed, I will
>>>>>>>>>move on and perform other release tasks.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Charles
>>>>>>>>>
>>>>>>>>

Re: [PROPOSAL] Prepare Beam 2.7.0 release

2018-08-24 Thread Charles Chen

Thanks everyone.  Again, we will proceed with the initial release cut on
August 29.

A reminder to please tag any blocking issues as "Priority: Blocker" and
"Fix version: 2.7.0" in JIRA.  We recently resolved
https://issues.apache.org/jira/browse/BEAM-5180, and there are no other
blocking bugs at the moment.

On Thu, Aug 23, 2018 at 8:19 PM Griselda Cuevas  wrote:

> +1 Thanks for volunteering and keeping us in schedule!
>
>
>
>
> On Thu, 23 Aug 2018 at 11:58, Udi Meiri  wrote:
>
>> +1
>>
>> On Mon, Aug 20, 2018 at 3:33 PM Boyuan Zhang  wrote:
>>
>>> +1
>>> Thanks for volunteering, Charles!
>>>
>>> On Mon, Aug 20, 2018 at 3:22 PM Rafael Fernandez 
>>> wrote:
>>>
>>>> +1, thanks for volunteering, Charles!
>>>>
>>>> On Mon, Aug 20, 2018 at 12:09 PM Charles Chen  wrote:
>>>>
>>>>> Thank you Andrew for pointing out my mistake.  We should follow the
>>>>> calendar and aim to cut on 8/29, not 9/7 as I incorrectly wrote earlier.
>>>>>
>>>>> On Mon, Aug 20, 2018 at 12:02 PM Andrew Pilloud 
>>>>> wrote:
>>>>>
>>>>>> +1 Thanks for volunteering! The calendar I have puts the cut date at
>>>>>> August 29th, which looks to be 6 weeks from when 2.6.0 was cut. Do I have
>>>>>> the wrong calendar?
>>>>>>
>>>>>> See:
>>>>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Mon, Aug 20, 2018 at 11:43 AM Connell O'Callaghan <
>>>>>> conne...@google.com> wrote:
>>>>>>
>>>>>>> +1 Charles thank you for taking this up and helping us maintain this
>>>>>>> schedule.
>>>>>>>
>>>>>>> On Mon, Aug 20, 2018 at 11:29 AM Charles Chen 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> Our release calendar indicates that the process for the 2.7.0 Beam
>>>>>>>> release should start on September 7.
>>>>>>>>
>>>>>>>> I volunteer to perform this release and propose the following
>>>>>>>> schedule:
>>>>>>>>
>>>>>>>>- We start triaging issues in JIRA this week.
>>>>>>>>- I will cut the initial 2.7.0 release branch on September 7.
>>>>>>>>- After September 7, any blockers will need to be manually
>>>>>>>>cherry-picked into the release branch.
>>>>>>>>- After tests pass and blockers are fully addressed, I will
>>>>>>>>move on and perform other release tasks.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Charles
>>>>>>>>
>>>>>>>

Re: BEAM-5180 for 2.7.0 ?

2018-08-24 Thread Charles Chen

Thank you for getting the partial rollback in.  I will close
https://issues.apache.org/jira/browse/BEAM-5180 as fixed.  Ankur: if you
have a more nuanced fix in mind, please open a new JIRA ticket to track and
update us on this thread.

On Fri, Aug 24, 2018 at 10:42 AM Ankur Goenka  wrote:

> Replies on the Jira and PR.
> For now we should go ahead with rollback to unblock 2.7
>
> On Fri, Aug 24, 2018 at 10:21 AM Udi Meiri  wrote:
>
>> +Ankur Goenka  (Kenneth is out of office)
>>
>> On Fri, Aug 24, 2018 at 3:20 AM Tim Robertson 
>> wrote:
>>
>>> Thanks Jozef for bringing this to dev@ and your work in reporting Jiras
>>> and offering fixes.
>>>
>>> I propose we consider BEAM-5180, BEAM-2277 blockers on 2.7.0. They break
>>> word count and file IO writing on HDFS unless the workaround is used (see
>>> BEAM-2277 commentary).
>>>
>>> In addition the performance of writing to HDFS is bad (to the point of
>>> being unusable) due to BEAM-5036 (blocked by BEAM-4861) which you have also
>>> observed. I don’t think we’ll make the cut for 2.7.0 on those because I
>>> anticipate significant testing is needed. I will focus on that as soon as I
>>> have time but I propose we make those blockers on 2.8.0 though.
>>>
>>>
>>>
>>> On Fri, Aug 24, 2018 at 7:53 AM Jozef Vilcek 
>>> wrote:
>>>
 Hello,

 does this JIRA have a change to be part of 2.7.0 release?
 https://issues.apache.org/jira/browse/BEAM-5180

 It is rather simple ask, but was not decided yet if attached PR is a
 correct way of fixing it.

 Thanks,
 Jozef

>>>

Re: [PROPOSAL] Prepare Beam 2.7.0 release

2018-08-20 Thread Charles Chen

Thank you Andrew for pointing out my mistake.  We should follow the
calendar and aim to cut on 8/29, not 9/7 as I incorrectly wrote earlier.

On Mon, Aug 20, 2018 at 12:02 PM Andrew Pilloud  wrote:

> +1 Thanks for volunteering! The calendar I have puts the cut date at
> August 29th, which looks to be 6 weeks from when 2.6.0 was cut. Do I have
> the wrong calendar?
>
> See:
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>
> Andrew
>
> On Mon, Aug 20, 2018 at 11:43 AM Connell O'Callaghan 
> wrote:
>
>> +1 Charles thank you for taking this up and helping us maintain this
>> schedule.
>>
>> On Mon, Aug 20, 2018 at 11:29 AM Charles Chen  wrote:
>>
>>> Hey everyone,
>>>
>>> Our release calendar indicates that the process for the 2.7.0 Beam
>>> release should start on September 7.
>>>
>>> I volunteer to perform this release and propose the following schedule:
>>>
>>>- We start triaging issues in JIRA this week.
>>>- I will cut the initial 2.7.0 release branch on September 7.
>>>- After September 7, any blockers will need to be manually
>>>cherry-picked into the release branch.
>>>- After tests pass and blockers are fully addressed, I will move on
>>>and perform other release tasks.
>>>
>>> What do you think?
>>>
>>> Best,
>>> Charles
>>>
>>

[PROPOSAL] Prepare Beam 2.7.0 release

2018-08-20 Thread Charles Chen

Hey everyone,

Our release calendar indicates that the process for the 2.7.0 Beam release
should start on September 7.

I volunteer to perform this release and propose the following schedule:

   - We start triaging issues in JIRA this week.
   - I will cut the initial 2.7.0 release branch on September 7.
   - After September 7, any blockers will need to be manually cherry-picked
   into the release branch.
   - After tests pass and blockers are fully addressed, I will move on and
   perform other release tasks.

What do you think?

Best,
Charles

Re: [Discuss] Add EXTERNAL keyword to CREATE TABLE statement

2018-08-15 Thread Charles Chen

+1 for CREATE EXTERNAL TABLE.  It is a good balance between the general SQL
expectation of having tables as an abstraction and reinforcing that Beam
does not store your data.

On Wed, Aug 15, 2018 at 1:58 PM Rui Wang  wrote:

> >  I think users will be more confused to find that 'CREATE TABLE' doesn't
> exist then to learn that it might not always create a table.
>
> >> I think that having CREATE TABLE do something unexpected or not do
> something expected (or do the opposite things depending on the table type
> or some flag) is worse than having users look up the correct way of
> creating a data source in Beam SQL without expecting something we don't
> promise.
>
> I agree on this. Enforcing users to look up documentation for the correct
> way is better than letting them use an ambiguous way that could fail their
> expectation.
>
>
> -Rui
>
> On Wed, Aug 15, 2018 at 1:46 PM Anton Kedin  wrote:
>
>> I think that something unique along the lines of `REGISTER EXTERNAL DATA
>> SOURCE` is probably fine, as it doesn't conflict with existing behaviors of
>> other dialects.
>>
>> > There is a lot of value in making sure our common operations closely
>> map to the equivalent common operations in other SQL dialects.
>>
>> We're trying to make opposite points using the same arguments :) A lot of
>> popular dialects make difference between CREATE TABLE and CREATE EXTERNAL
>> TABLE (or similar):
>>  - T-SQL:
>>   create:
>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql
>>   create external:
>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
>>   external datasource:
>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
>>  - PL/SQL:
>>   create:
>> https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#i1106369
>>   create external:
>> https://docs.oracle.com/cd/B19306_01/server.102/b14215/et_concepts.htm#i1009127
>>  - postgres:
>>   import foreign schema:
>> https://www.postgresql.org/docs/9.5/static/sql-importforeignschema.html
>>   create table:
>> https://www.postgresql.org/docs/9.1/static/sql-createtable.html
>>  - redshift:
>>   create external schema:
>> https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
>>   create table:
>> https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
>>  - hive internal and external:
>> https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables
>>
>> My understanding is that the behavior of create table is somewhat similar
>> in all of the above dialects, from the high-level perspective it usually
>> creates a persistent table in the current storage context (database).
>> That's not what Beam SQL's create table does right now, and my opinion is
>> that it should not be called create table for this reason.
>>
>> >  I think users will be more confused to find that 'CREATE TABLE'
>> doesn't exist then to learn that it might not always create a table.
>>
>> I think that having CREATE TABLE do something unexpected or not do
>> something expected (or do the opposite things depending on the table type
>> or some flag) is worse than having users look up the correct way of
>> creating a data source in Beam SQL without expecting something we don't
>> promise.
>>
>> >  (For example, a user guessing at the syntax of CREATE TABLE would have
>> a better experience with the error being "field LOCATION not specified"
>> rather than "operation CREATE TABLE not found".)
>>
>> They have to look it up anyway (what format is location for a Pubsub
>> topic? or is it a subscription?), and when doing so I think it would be
>> less confusing to read that to get data from Pubsub/Kafka/... in Beam SQL
>> you have to do something like `REGISTER EXTERNAL DATA SOURCE` than `CREATE
>> TABLE`.
>>
>> External tables and schemas don't have a standard approach and I don't
>> have a strong preference between any one from the above.
>>
>> On Wed, Aug 15, 2018 at 1:08 PM Rui Wang  wrote:
>>
>>> Adding dev@ back now.
>>>
>>> -Rui
>>>
>>> On Wed, Aug 15, 2018 at 1:01 PM Andrew Pilloud 
>>> wrote:
>>>
 Did we drop the dev list from this on purpose? (I haven't added it
 back, but we probably should.)

 I'm in favor of sticking with the simple 'CREATE TABLE' and 'CREATE
 SCHEMA' if there is only to be one option. Sticking with those names
 minimizes both our deviation from other implementations and user surprise.
 There is a lot of value in making sure our common operations closely map to
 the equivalent common operations in other SQL dialects. I think users will
 be more confused to find that 'CREATE TABLE' doesn't exist then to learn
 that it might not always create a table. This minimizes the overhead of
 learning our dialect of SQL and maximizes the odds that a user will be able
 to guess at the syntax of something

Re: [Discussion] Clarify the support story for released Beam versions

2018-08-13 Thread Charles Chen

(sending to the dev@ list thread as this is more relevant here than users@)

Will we be using a different / potentially more rigorous process for
releasing LTS releases?  Or do we feel that any validations that could
possibly be done should already be incorporated into each release?

On Mon, Aug 13, 2018 at 10:57 AM Ahmet Altay  wrote:

> Update:
>
> I sent out an email to user@ to collect their feedback [1]. I will
> encourage everyone here to collect feedback from the other channels
> available to you. To facilitate the discussion I drafted my proposal in a
> PR [2].
>
> Ahmet
>
> [1]
> https://lists.apache.org/thread.html/7d890d6ed221c722a95d9c773583450767b79ee0c0c78f48a56c7eba@%3Cuser.beam.apache.org%3E
> [2] https://github.com/apache/beam-site/pull/537
>
> On Fri, Aug 10, 2018 at 5:20 PM, Lukasz Cwik  wrote:
>
>> Thanks, I can see the reasoning for LTS releases based upon some
>> enterprise customers needs.
>>
>> Forgot about the 2.1.1 Python release. Thanks for pointing that out.
>>
>> On Fri, Aug 10, 2018 at 4:47 PM Ahmet Altay  wrote:
>>
>>>
>>> On Fri, Aug 10, 2018 at 12:33 PM, Lukasz Cwik  wrote:
>>>
 I like the ideas that your proposing but am wondering what value if any
 do supporting LTS releases add? We maintain semantic versioning and I would
 expect that most users would be using the latest released version if not
 the release just before that. There is likely a long tail of users who will
 use a specific version and are unlikely to ever upgrade.

>>>
>>> I believe there is a category of enterprise users who would continue to
>>> use a specific version as long as they know they can get support for it.
>>> This usually stems from the need to have a stable environment. There is
>>> also the aspect of validating new product before using. I know some
>>> companies have validation cycles longer than 6 weeks. They will still
>>> upgrade but they would like to upgrade much less frequently.
>>>
>>> I was hoping that defining LTS releases will signal these types of users
>>> what releases are worth upgrading to if they have a high cost of upgrading.
>>>
>>> This comes from my anecdotal evidence and I may be wrong.
>>>
>>>

 I believe it would be valuable to ask our users what is most important
 to them with respect to the policy (after we have discussed it a little
 bit) as well since ultimately our goal is to help our users.

>>>
>>> I agree with this. Since I am referring to enterprise users primarily I
>>> think some of it will require the companies here to collect that feedback.
>>>
>>>
 This could then be documented and we could provide guidance to
 customers as to how to reach out to the group for big bugs. Also note that
 Apache has a security policy[1] in place which we should direct users to.

>>>
>>> I think document what could be expected of Beam in terms of support
>>> would be very valuable by itself. It will also help us figure out what we
>>> could drop. For example in the recent discussion to drop old API docs,
>>> there was no clear guidance on which SDKs are still supported and should
>>> have their API docs hosted.
>>>
>>> I think we reference to the Apache security policy on our website. If
>>> not I agree, we should add a reference to it.
>>>
>>>

 Also, we don't have any experience in patching a release as we haven't
 yet done one patch version bump. All issues that have been brought up were
 always fixed in the next minor version bump.

>>>
>>> I agree. There was the Python 2.1.1 but that is the only example I could
>>> remember.
>>>
>>>

 1: http://www.apache.org/security/




 On Fri, Aug 10, 2018 at 11:50 AM Pablo Estrada 
 wrote:

> I think this all sounds reasonable, and I think it would be a good
> story for our users. We don't have much experience with patching releases,
> but I guess it's a matter of learning and improving over time.
> -P.
>
> On Wed, Aug 8, 2018 at 9:04 PM Ahmet Altay  wrote:
>
>> Hi all,
>>
>> I would like us to clarify the life cycle of Beam releases a little
>> bit more for our users. I think we increased the predictability
>> significantly by agreeing to a release cadence and kudos to everyone on
>> that. As a follow up to that I would like to address the following 
>> problem:
>>
>> It is unclear for a user of Beam how long an existing version will be
>> supported. And if it will be supported at all, what does that support 
>> mean.
>> (This is especially an important problem for users who would like to use
>> stable versions and care less about being on the cutting edge.)
>>
>> Our current state is:
>>
>> - With our agreed release cadence Beam makes 8 releases a year.
>> - We have precedence for patching released versions for major issues.
>> - Patching all existing releases at any point (even patching a year
>> full of 8

Re: [VOTE] Community Examples Repository

2018-08-08 Thread Charles Chen

It looks like the main claim is that 1 and 2 have the benefit of increasing
visibility for examples on the Beam site.  I agree with Robert's comments
on the doc which claim that this is orthogonal to whether a separate
repository is created (the comments are unresolved:
https://docs.google.com/a/google.com/document/d/1vhcKJlP0qH1C7NZPDjohT2PUbOD-k71avv1CjEYapdw/edit?disco=BzifZxY
).

I would add that the maintenance and testing burden has not been adequately
addressed in the proposal (i.e. are we creating new Jenkins jobs?; will
postcommits on the main Beam repo run examples tests?; are we releasing
artifacts--if so, is this together with the main package or separately in
new packages?).  If we go with the half-way solution in (2), there is also
the issue of where the threshold is--for example, if a user-contributed
example is particularly useful, do we move it to the main repo?

On Wed, Aug 8, 2018 at 1:35 PM Griselda Cuevas  wrote:

> I'd vote for 2.
>
> Giving independence to an example repository and creating the right
> infrastructure to maintain them will give visibility to the efforts our
> users are creating to solve their uses cases with Beam. I also want to make
> the process of sharing common work more easily.
>
> Re:The examples that will remain in core, I agree that it's crucial to
> keep some examples for testing.
>
>
> On Wed, 8 Aug 2018 at 11:44, Lukasz Cwik  wrote:
>
>> I would vote for 3.
>>
>> My reasoning is that Java has a good mechanism to get a starter/example
>> project going by using the the maven archetypes already. Our quickstart
>> guide for Apache Beam for the Java SDK already covers generating the
>> examples archetype.
>> We could point users to the starter project at the end of the java
>> quickstart.
>>
>> If python/go have a similar mechanism that is commonly used, I would go
>> with those over creating a separate repo for examples and adding the
>> maintenance burden involved.
>>
>>
>>
>> On Wed, Aug 8, 2018 at 11:01 AM Rui Wang  wrote:
>>
>>> 2 - examples that rely on experimental API can still stay in where they
>>> are because such examples could be changed.
>>>
>>> -Rui
>>>
>>> On Wed, Aug 8, 2018 at 10:52 AM Charles Chen  wrote:
>>>
>>>> 3 - We benefit from increased test coverage by having examples together
>>>> with the rest of the code.  As Robert mentions in the doc, hosting the Beam
>>>> examples in the main repository is the best way to keep the examples
>>>> visible, tested and maintained.  Given that we recently moved to a single
>>>> repository for the website since that previously caused a lot of pain, it
>>>> makes sense to be consistent here.
>>>>
>>>> On Wed, Aug 8, 2018 at 10:27 AM Ahmet Altay  wrote:
>>>>
>>>>> 2 - Similar to Huygaa, I see value in keeping a core set of examples
>>>>> tested and maintained against head. At the same time I understand the 
>>>>> value
>>>>> of a growing set of community grown examples that are targeted against a
>>>>> pre-defined versions of Beam and not necessarily updated at every release.
>>>>>
>>>>> On Wed, Aug 8, 2018 at 10:22 AM, Huygaa Batsaikhan 
>>>>> wrote:
>>>>>
>>>>>> 2 - I like the idea of having a separate repo where we can have more
>>>>>> freedom to check in examples. However, we benefit from having immediate
>>>>>> core examples in Beam for testing purposes.
>>>>>>
>>>>>> On Wed, Aug 8, 2018 at 9:38 AM David Cavazos 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone!
>>>>>>>
>>>>>>> We discussed several options as well as some of the implications of
>>>>>>> each option. Please vote for your favorite option, feel free to back it 
>>>>>>> up
>>>>>>> with any reasons that make you feel that way.
>>>>>>>
>>>>>>> 1) Move *all* samples to a *new *examples* repository*
>>>>>>> 2) Move *some* samples to a *new *examples* repository*
>>>>>>> 3) Leave samples where they are
>>>>>>>
>>>>>>> Some implications to creating a new repository:
>>>>>>> - Every example would be independent from every other example, so
>>>>>>> tests can be run in parallel
>>>>>>> - Examples would now show how to use Beam *externally*
>>>>>>> - The examples repository would need a testing infrastructure
>>>>>>> - Decoupling makes examples easier to test on different versions
>>>>>>> - Easier to copy-paste an existing example and start from there,
>>>>>>> almost like a template
>>>>>>> - Smaller size for the core Beam library
>>>>>>> - Two different repositories to maintain
>>>>>>> - Versioning could mirror Beam's current version
>>>>>>>
>>>>>>> Link to proposal
>>>>>>> <https://docs.google.com/document/d/1vhcKJlP0qH1C7NZPDjohT2PUbOD-k71avv1CjEYapdw/edit?usp=sharing>
>>>>>>>
>>>>>>
>>>>>

Re: [VOTE] Community Examples Repository

2018-08-08 Thread Charles Chen

3 - We benefit from increased test coverage by having examples together
with the rest of the code.  As Robert mentions in the doc, hosting the Beam
examples in the main repository is the best way to keep the examples
visible, tested and maintained.  Given that we recently moved to a single
repository for the website since that previously caused a lot of pain, it
makes sense to be consistent here.

On Wed, Aug 8, 2018 at 10:27 AM Ahmet Altay  wrote:

> 2 - Similar to Huygaa, I see value in keeping a core set of examples
> tested and maintained against head. At the same time I understand the value
> of a growing set of community grown examples that are targeted against a
> pre-defined versions of Beam and not necessarily updated at every release.
>
> On Wed, Aug 8, 2018 at 10:22 AM, Huygaa Batsaikhan 
> wrote:
>
>> 2 - I like the idea of having a separate repo where we can have more
>> freedom to check in examples. However, we benefit from having immediate
>> core examples in Beam for testing purposes.
>>
>> On Wed, Aug 8, 2018 at 9:38 AM David Cavazos  wrote:
>>
>>> Hi everyone!
>>>
>>> We discussed several options as well as some of the implications of each
>>> option. Please vote for your favorite option, feel free to back it up with
>>> any reasons that make you feel that way.
>>>
>>> 1) Move *all* samples to a *new *examples* repository*
>>> 2) Move *some* samples to a *new *examples* repository*
>>> 3) Leave samples where they are
>>>
>>> Some implications to creating a new repository:
>>> - Every example would be independent from every other example, so tests
>>> can be run in parallel
>>> - Examples would now show how to use Beam *externally*
>>> - The examples repository would need a testing infrastructure
>>> - Decoupling makes examples easier to test on different versions
>>> - Easier to copy-paste an existing example and start from there, almost
>>> like a template
>>> - Smaller size for the core Beam library
>>> - Two different repositories to maintain
>>> - Versioning could mirror Beam's current version
>>>
>>> Link to proposal
>>> 
>>>
>>
>

Re: Community Examples Repository

2018-08-03 Thread Charles Chen

We should separate out the decision for (1) whether examples should be
packaged separately upon release and (2) where the example will live
code-wise, i.e. whether we want another repo.  With respect to the first
item, I think the proposal needs more detail before we can decide here--for
example, if we separate out the packaging for the examples, we need to
change our build process and potentially release additional PyPI packages
and this should be thought about before we can make a decision.

On Fri, Aug 3, 2018 at 3:23 PM Pablo Estrada  wrote:

> Hello all,
> I see a number of mixed responses. I think it would be helpful to push for
> a decision by calling for a vote.
>
> Also, the proposal has a number of parts, so perhaps we could ask David
> and other contributors of the proposal to outline a couple alternatives the
> we can all vote on. (e.g. #1 no examples repo, #2 all examples to new repo,
> #3 examples repo, but some examples remain in main repo).
>
> The outcome may be no change at all, or some change, but at least we'll
> have a definite decision from the community.
>
> Does that sound reasonable?
> -P.
>
> On Thu, Aug 2, 2018 at 11:09 AM Ankur Goenka  wrote:
>
>> I like he initiative but I feel that fragmenting the codebase will make
>> it harder to discover examples. Having examples in a separate repo makes it
>> easier to forget that examples should get the same love as the rest of the
>> codebase.
>> The other challenge is the tooling and integration which is harder with
>> multiple repo.
>> It makes sense to isolate the examples and make them more obvious.
>> A sub project of examples as mentioned in the discussion might be
>> sufficient without having much overhead.
>>
>> Thanks,
>> Ankur
>>
>>
>> On Thu, Aug 2, 2018 at 10:52 AM Kai Jiang  wrote:
>>
>>> Agreed with Rui. We could also add more SQL examples (like, different
>>> IOs ) for everyone to get started with.
>>>
>>> Best,
>>> Kai
>>>
>>> On 2018/08/02 17:40:32, Rui Wang  wrote:
>>> > I might miss it: are examples to be moved including those which are not
>>> > under example/? For example there are some BeamSQL examples in
>>> > org/apache/beam/sdk/extensions/sql/example
>>> > <
>>> https://github.com/apache/beam/tree/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/example
>>> >
>>> > .
>>> >
>>> >
>>> > It's better to keep BeamSQL examples in where it is because related API
>>> > might still change.
>>> >
>>> > -Rui
>>> >
>>> > On Thu, Aug 2, 2018 at 8:58 AM Ahmet Altay  wrote:
>>> >
>>> > > Robert, I agree with you in general. However there is also a second
>>> > > motivation. There is an increase in new PRs that are coming to add
>>> new
>>> > > examples. This is great however the core code (including
>>> distributions) is
>>> > > not a great place to host such examples. An examples repo would help
>>> in
>>> > > this case. It could also serve as an entry point for new
>>> contributors.
>>> > >
>>> > >
>>> > >
>>> > > On Thu, Aug 2, 2018 at 12:40 AM, Robert Bradshaw <
>>> rober...@google.com>
>>> > > wrote:
>>> > >
>>> > >> I have to admit I'm generally -1 on moving examples to a separate
>>> > >> repository. In particular, I think it would actually inhibit the
>>> > >> stated goals of increasing visibility and better keeping them up to
>>> > >> date, and for all the reasons we just migrated the beam-site
>>> directory
>>> > >> in. It seems the primary motivation is that it's difficult in Java
>>> to
>>> > >> have a portion of the repo that depends on another as if it were
>>> > >> "external" (i.e. the way others would use Beam) rather than being a
>>> > >> sub-project of Beam. Is this not doable?
>>> > >> On Wed, Aug 1, 2018 at 10:59 PM Charles Chen 
>>> wrote:
>>> > >> >
>>> > >> > I would also prefer that examples be linked to releases so that
>>> we can
>>> > >> build and test them during development; i.e. if your commit breaks
>>> > >> wordcount, we want to know right away so we can revert.  Perhaps we
>>> can
>>> > >> keep these in the repo but more clearly modula

Re: Community Examples Repository

2018-08-01 Thread Charles Chen

I would also prefer that examples be linked to releases so that we can
build and test them during development; i.e. if your commit breaks
wordcount, we want to know right away so we can revert.  Perhaps we can
keep these in the repo but more clearly modularize the artifacts we release?

For the Python SDK, if we separate this out in any way, there is the
separate issue of dealing with namespace packages (which are kind of broken
and poorly supported:
https://github.com/pypa/python-packaging-user-guide/issues/265), if we want
to keep the examples under the apache_beam.examples module path.  See also
https://packaging.python.org/guides/packaging-namespace-packages/.

On Wed, Aug 1, 2018 at 9:29 PM j...@nanthrax.net  wrote:

> Hi,
>
> I don't have problem to move the examples in a dedicated repository.
> However, IMHO, we have to:
>
> 1. Keep a build of examples linked to latest core release/SNAPSHOT
> 2. Include the examples in the distribution (convenient for the users)
>
> On another topic, I think it would be better to avoid usage of Google Doc
> for such kind of discussion and directly share on the mailing list (at
> least a summary/light details).
>
> Regards
> JB
>
> On Thursday, August 02, 2018 00:12 CEST, David Cavazos <
> dcava...@google.com> wrote:
>
>
> Hi everyone!
>
> We wanted to migrate the examples from the core repository to a new Beam
> community examples repository. As the number of examples grow, it makes
> sense to modularize and decouple the core functionality from the examples.
>
> We will also create some guidelines with the best practices for new
> examples to be submitted.
>
> For more details, feel free to take a look and comment on the proposal
> 
> .
>
> Cheers,
> David
>
>
>
>
>

Re: Community Examples Repository

2018-08-01 Thread Charles Chen

The examples we have right now serve both as examples to users and along
with their unit tests, as tests of functionality.  If we move the examples
out, what is a good way to make sure that we continue to have visibility
and test coverage?  Can we address this in a section of the doc?

On Wed, Aug 1, 2018 at 3:12 PM David Cavazos  wrote:

> Hi everyone!
>
> We wanted to migrate the examples from the core repository to a new Beam
> community examples repository. As the number of examples grow, it makes
> sense to modularize and decouple the core functionality from the examples.
>
> We will also create some guidelines with the best practices for new
> examples to be submitted.
>
> For more details, feel free to take a look and comment on the proposal
> 
> .
>
> Cheers,
> David
>

Re: delayed emit (timer) in py-beam?

2018-07-30 Thread Charles Chen

Hey Austin,

This API is not yet implemented in the Python SDK.  I am working on this
feature:  the next step from my end is to finish a reference implementation
in the local DirectRunner.  As you note, the doc at
https://s.apache.org/beam-python-user-state-and-timers describes the design.

You can track progress on the mailing list thread here:
https://lists.apache.org/thread.html/51ba1a00027ad8635bc1d2c0df805ce873995170c75d6a08dfe21997@%3Cdev.beam.apache.org%3E

Best,
Charles

On Mon, Jul 30, 2018 at 3:34 PM Austin Bennett 
wrote:

> What's going on with timers and python?
>
> Am looking at building a pipeline (assuming another group in my company
> will grant access to the Kafka topic):
>
> Kafka -> beam -> have beam wait 24 hours -> do transform(s) and emit a
> record.  If I read things correctly that's not currently possible in python
> on beam.  What all is needed?  (trying to figure out whether that is
> something that I am capable of and there is room for me to help with).
> Looking for similar functionality to
> https://www.rabbitmq.com/blog/2015/04/16/scheduling-messages-with-rabbitmq/
> (though don't need alternate routing, nor is that example in python).
>
>
> For example, I see:
> https://beam.apache.org/blog/2017/08/28/timely-processing.html
>
> and tickets like:  https://issues.apache.org/jira/browse/BEAM-4594
>
>
>

Re: Proposal for Beam Python User State and Timer APIs

2018-06-20 Thread Charles Chen

An update on the implementation: I recently sent out the user-facing
pipeline construction part of the API implementation out for review:
https://github.com/apache/beam/pull/5691.

On Tue, Jun 5, 2018 at 5:26 PM Charles Chen  wrote:

> Thanks everyone for contributing here.  We've reached rough consensus on
> the approach we should take with this API, and I've summarized this in the
> new "Community consensus" sections I added to the doc (
> https://s.apache.org/beam-python-user-state-and-timers).  I will begin
> initial implementation of this API soon.
>
> On Wed, May 23, 2018 at 8:08 PM Thomas Weise  wrote:
>
>> Nice proposal; it's exciting to see this about to be added to the SDK as
>> it enables a set of more complex use cases.
>>
>> I also think that some of the content can later be repurposed as user
>> documentation.
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, May 23, 2018 at 11:49 AM, Charles Chen  wrote:
>>
>>> Thanks everyone for the detailed comments and discussions.  It looks
>>> like by now, we mostly agree with the requirements and overall direction
>>> needed for the API, though there is continuing discussion on specific
>>> details.  I want to highlight two new sections of the doc, which address
>>> some discussions that have come up:
>>>
>>>- *Existing state and transactionality*: this section addresses how
>>>we will address an existing transactionality inconsistency in the 
>>> existing
>>>Java API.  (
>>>
>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>)
>>>- *State for merging windows*: this section addresses how we will
>>>deal with non-combinable state in conjunction with merging windows.  (
>>>
>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ctxkcgabtzpy
>>>)
>>>
>>> Let me know any further comments and suggestions.
>>>
>>> On Tue, May 22, 2018 at 9:29 AM Kenneth Knowles  wrote:
>>>
>>>> Nice. I know that Java users have found it helpful to have this
>>>> lower-level way of writing pipelines when the high-level primitives don't
>>>> quite have the tight control they are looking for. I hope it will be a big
>>>> draw for Python, too.
>>>>
>>>> (commenting on the doc)
>>>>
>>>> Kenn
>>>>
>>>> On Mon, May 21, 2018 at 5:15 PM Charles Chen  wrote:
>>>>
>>>>> I want to share a proposal for adding user state and timer support to
>>>>> the Beam Python SDK and get the community's thoughts on how such an API
>>>>> should look: https://s.apache.org/beam-python-user-state-and-timers
>>>>>
>>>>> Let me know what you think and please add any comments and suggestions
>>>>> you may have.
>>>>>
>>>>> Best,
>>>>> Charles
>>>>>
>>>>
>>

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-15 Thread Charles Chen

Thank you and sorry for the delay.  Been testing the fix the past few
hours.  This CP PR fixes the issue: https://github.com/apache/beam/pull/5658
.

On Thu, Jun 14, 2018 at 10:25 PM Jean-Baptiste Onofré 
wrote:

> OK, I started the RC2, but I'm stopping the process to cut a new one.
>
> Is it ok from your side ?
>
> Regards
> JB
>
> On 15/06/2018 01:54, Charles Chen wrote:
> > Looks like there is something wrong with PR 5636
> > <https://github.com/apache/beam/pull/5636> which we cherry-picked
> > above.  It breaks leaderboard examples which previously passed.  I've
> > reopened the issue and will update this thread shortly.
> >
> > On Thu, Jun 14, 2018 at 12:55 PM Jean-Baptiste Onofré  > <mailto:j...@nanthrax.net>> wrote:
> >
> > Sure, just in time ;)
> >
> > Regards
> > JB
> >
> > On 14/06/2018 20:58, Charles Chen wrote:
> > > Can you also merge the CP https://github.com/apache/beam/pull/5636
>  for
> > > https://issues.apache.org/jira/browse/BEAM-4549?
> > >
> > > On Thu, Jun 14, 2018 at 6:52 AM Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>
> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote:
> > >
> > > FYI, I'm starting RC2 right now.
> > >
> > > Stay tuned !
> > >
> > > Regards
> > > JB
> > >
> > > On 06/06/2018 10:44, Jean-Baptiste Onofré wrote:
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #1 for the
> > version
> > > > 2.5.0, as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > >
> > > > NB: this is the first release using Gradle, so don't be too
> > harsh ;) A
> > > > PR about the release guide will follow thanks to this
> release.
> > > >
> > > > The complete staging area is available for your review, which
> > > includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release to be deployed to
> > > dist.apache.org <http://dist.apache.org> <
> http://dist.apache.org>
> > > > [2], which is signed with the key with fingerprint C8282E76
> [3],
> > > > * all artifacts to be deployed to the Maven Central
> > Repository [4],
> > > > * source code tag "v2.5.0-RC1" [5],
> > > > * website pull request listing the release and publishing
> > the API
> > > > reference manual [6].
> > > > * Java artifacts were built with Gradle 4.7 (wrapper) and
> > > OpenJDK/Oracle
> > > > JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
> > > > * Python artifacts are deployed along with the source
> > release to the
> > > > dist.apache.org <http://dist.apache.org>
> > <http://dist.apache.org> [2].
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted
> > by majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > JB
> > > >
> > > > [1]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12342847
> > > > [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> > > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > > > [4]
> > >
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1041/
> > > > [5] https://github.com/apache/beam/tree/v2.5.0-RC1
> > > > [6] https://github.com/apache/beam-site/pull/463
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org <mailto:jbono...@apache.org>
> > <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org <mailto:jbono...@apache.org>
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-14 Thread Charles Chen

Looks like there is something wrong with PR 5636
<https://github.com/apache/beam/pull/5636> which we cherry-picked above.
It breaks leaderboard examples which previously passed.  I've reopened the
issue and will update this thread shortly.

On Thu, Jun 14, 2018 at 12:55 PM Jean-Baptiste Onofré 
wrote:

> Sure, just in time ;)
>
> Regards
> JB
>
> On 14/06/2018 20:58, Charles Chen wrote:
> > Can you also merge the CP https://github.com/apache/beam/pull/5636 for
> > https://issues.apache.org/jira/browse/BEAM-4549?
> >
> > On Thu, Jun 14, 2018 at 6:52 AM Jean-Baptiste Onofré  > <mailto:j...@nanthrax.net>> wrote:
> >
> > FYI, I'm starting RC2 right now.
> >
> > Stay tuned !
> >
> > Regards
> > JB
> >
> > On 06/06/2018 10:44, Jean-Baptiste Onofré wrote:
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #1 for the version
> > > 2.5.0, as follows:
> > >
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > >
> > > NB: this is the first release using Gradle, so don't be too harsh
> ;) A
> > > PR about the release guide will follow thanks to this release.
> > >
> > > The complete staging area is available for your review, which
> > includes:
> > > * JIRA release notes [1],
> > > * the official Apache source release to be deployed to
> > dist.apache.org <http://dist.apache.org>
> > > [2], which is signed with the key with fingerprint C8282E76 [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "v2.5.0-RC1" [5],
> > > * website pull request listing the release and publishing the API
> > > reference manual [6].
> > > * Java artifacts were built with Gradle 4.7 (wrapper) and
> > OpenJDK/Oracle
> > > JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
> > > * Python artifacts are deployed along with the source release to
> the
> > > dist.apache.org <http://dist.apache.org> [2].
> > >
> > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > > Thanks,
> > > JB
> > >
> > > [1]
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12342847
> > > [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > > [4]
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1041/
> > > [5] https://github.com/apache/beam/tree/v2.5.0-RC1
> > > [6] https://github.com/apache/beam-site/pull/463
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org <mailto:jbono...@apache.org>
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-14 Thread Charles Chen

Can you also merge the CP https://github.com/apache/beam/pull/5636 for
https://issues.apache.org/jira/browse/BEAM-4549?

On Thu, Jun 14, 2018 at 6:52 AM Jean-Baptiste Onofré 
wrote:

> FYI, I'm starting RC2 right now.
>
> Stay tuned !
>
> Regards
> JB
>
> On 06/06/2018 10:44, Jean-Baptiste Onofré wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> > 2.5.0, as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > NB: this is the first release using Gradle, so don't be too harsh ;) A
> > PR about the release guide will follow thanks to this release.
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> > [2], which is signed with the key with fingerprint C8282E76 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.5.0-RC1" [5],
> > * website pull request listing the release and publishing the API
> > reference manual [6].
> > * Java artifacts were built with Gradle 4.7 (wrapper) and OpenJDK/Oracle
> > JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org [2].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > JB
> >
> > [1]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12342847
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1041/
> > [5] https://github.com/apache/beam/tree/v2.5.0-RC1
> > [6] https://github.com/apache/beam-site/pull/463
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-07 Thread Charles Chen

+1. It would be very helpful to have dev-facing walkthroughs / technical
documentation for relevant aspects of the codebase that aren't user-facing.

On Thu, Jun 7, 2018, 1:23 PM Kenneth Knowles  wrote:

> Hi all,
>
> I've been in half a dozen conversations recently about whether to have a
> wiki and what to use it for. Some things I've heard:
>
>  - "why is all this stuff that users don't care about here?"
>  - "can we have a lighter weight place to put technical references for
> contributors"
>
> So I want to consider as a community starting up our wiki. Ideas for what
> could go there:
>
>  - Collection of links to design docs like
> https://beam.apache.org/contribute/design-documents/
>  - Specialized walkthroughs like
> https://beam.apache.org/contribute/docker-images/
>  - Best-effort notes that just try to help out like
> https://beam.apache.org/contribute/intellij/
>  - Docs on in-progress stuff like
> https://beam.apache.org/documentation/runners/jstorm/
>  - Expanded instructions for committers, more than
> https://beam.apache.org/contribute/committer-guide/
>  - BIPs / summaries of collections of JIRA
>  - Docs sitting in markdown in the repo like
> https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md and
> https://github.com/apache/beam-site/blob/asf-site/README.md (which will
> soon not be a toplevel README)
>
> What do you think?
>
> (a) should we do it?
> (b) what should go there?
> (c) what should not go there?
>
> Kenn
>

Re: Proposal for Beam Python User State and Timer APIs

2018-06-05 Thread Charles Chen

Thanks everyone for contributing here.  We've reached rough consensus on
the approach we should take with this API, and I've summarized this in the
new "Community consensus" sections I added to the doc (
https://s.apache.org/beam-python-user-state-and-timers).  I will begin
initial implementation of this API soon.

On Wed, May 23, 2018 at 8:08 PM Thomas Weise  wrote:

> Nice proposal; it's exciting to see this about to be added to the SDK as
> it enables a set of more complex use cases.
>
> I also think that some of the content can later be repurposed as user
> documentation.
>
> Thanks,
> Thomas
>
>
> On Wed, May 23, 2018 at 11:49 AM, Charles Chen  wrote:
>
>> Thanks everyone for the detailed comments and discussions.  It looks like
>> by now, we mostly agree with the requirements and overall direction needed
>> for the API, though there is continuing discussion on specific details.  I
>> want to highlight two new sections of the doc, which address some
>> discussions that have come up:
>>
>>- *Existing state and transactionality*: this section addresses how
>>we will address an existing transactionality inconsistency in the existing
>>Java API.  (
>>
>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>)
>>- *State for merging windows*: this section addresses how we will
>>deal with non-combinable state in conjunction with merging windows.  (
>>
>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ctxkcgabtzpy
>>)
>>
>> Let me know any further comments and suggestions.
>>
>> On Tue, May 22, 2018 at 9:29 AM Kenneth Knowles  wrote:
>>
>>> Nice. I know that Java users have found it helpful to have this
>>> lower-level way of writing pipelines when the high-level primitives don't
>>> quite have the tight control they are looking for. I hope it will be a big
>>> draw for Python, too.
>>>
>>> (commenting on the doc)
>>>
>>> Kenn
>>>
>>> On Mon, May 21, 2018 at 5:15 PM Charles Chen  wrote:
>>>
>>>> I want to share a proposal for adding user state and timer support to
>>>> the Beam Python SDK and get the community's thoughts on how such an API
>>>> should look: https://s.apache.org/beam-python-user-state-and-timers
>>>>
>>>> Let me know what you think and please add any comments and suggestions
>>>> you may have.
>>>>
>>>> Best,
>>>> Charles
>>>>
>>>
>

Re: Existing transactionality inconsistency in the Beam Java State API

2018-06-05 Thread Charles Chen

lementation strategy.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers 
>>>> wrote:
>>>>
>>>>> I think Kenn's second option accurately reflects my memory of the
>>>>> original intentions:
>>>>>
>>>>> 1. I remember we we considered either using the Future interface or
>>>>> calling the ReadableState interface a future, and explicitly said "no,
>>>>> future implies asynchrony and that the value returned by `get` won't 
>>>>> change
>>>>> over multiple calls, but we want the latest value each time". So, I
>>>>> remember us explicitly considering and rejecting Future, thus the name
>>>>> "ReadableState".
>>>>>
>>>>> 2. The intuition behind the implementation was analogous to a
>>>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>>>> ReadableState is just a pointer to the the reference cell. Calling read
>>>>> returns the value currently in the cell. If we have 100 ReadableStates
>>>>> pointing at the same cell, they all get the same value regardless of when
>>>>> they were created. This avoids needing to duplicate/snapshot values at any
>>>>> point in time.
>>>>>
>>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching
>>>>> the associated value. This was added after benchmarks showed 10x (if I
>>>>> remember correctly) performance improvements in things like
>>>>> GroupAlsoByWindows by minimizing round-trips asking for more state. The
>>>>> intuition being -- if we need to make an RPC to load one state value, we
>>>>> are better off making an RPC to load all the values we need.
>>>>>
>>>>> Overall, I too lean towards maintaining the second interpretation
>>>>> since it seems to be consistent and I believe we had additional reasons 
>>>>> for
>>>>> preferring it over futures.
>>>>>
>>>>> Given the confusion, I think strengthening the class documentation
>>>>> makes sense -- I note the only hint of the current behavior is that
>>>>> ReadableState indicates it gets the *current* value (emphasis mine). We
>>>>> should emphasize that and perhaps even mention that the ReadableState
>>>>> should be understood as just a reference or handle to the underlying 
>>>>> state,
>>>>> and thus its value will reflect the latest write.
>>>>>
>>>>> Charles, if it helps, the plan I remember regarding prefetching was
>>>>> something like:
>>>>>
>>>>> interface ReadableMapState {
>>>>>ReadableState get(K key);
>>>>>ReadableState> getIterable();
>>>>>ReadableState> get();
>>>>>// ... more things ...
>>>>> }
>>>>>
>>>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>>>
>>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen  wrote:
>>>>>
>>>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>>>> should allow for some sort of prefetching / batching / background I/O for
>>>>>> state; and (2) it should be clear what the semantics are for reading 
>>>>>> (e.g.
>>>>>> so we don't have confusing read after write behavior).
>>>>>>
>>>>>> The approach I'm leaning towards for (1) is to allow a
>>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map
>>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a
>>>>>> specific KV in the map.  Issue (2) seems to be okay in either of Kenn's
>>>>>> positions.
>>>>>>
>>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw 
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards
>>>>>>> the second option, despite its drawbacks. (In particular, readLater
>>>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles 
>

Re: [VOTE] Code Review Process

2018-06-01 Thread Charles Chen

+1

On Fri, Jun 1, 2018 at 11:20 AM Valentyn Tymofieiev 
wrote:

> +1
>
> On Fri, Jun 1, 2018 at 10:40 AM, Ahmet Altay  wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 10:37 AM, Kenneth Knowles  wrote:
>>
>>> +1
>>>
>>> On Fri, Jun 1, 2018 at 10:25 AM Thomas Groh  wrote:
>>>
 As we seem to largely have consensus in "Reducing Committer Load for
 Code Reviews"[1], this is a vote to change the Beam policy on Code Reviews
 to require that

 (1) At least one committer is involved with the code review, as either
 a reviewer or as the author
 (2) A contributor has approved the change

 prior to merging any change.

 This changes our policy from its current requirement that at least one
 committer *who is not the author* has approved the change prior to merging.
 We believe that changing this process will improve code review throughput,
 reduce committer load, and engage more of the community in the code review
 process.

 Please vote:
 [ ] +1: Accept the above proposal to change the Beam code review/merge
 policy
 [ ] -1: Leave the Code Review policy unchanged

 Thanks,

 Thomas

 [1]
 https://lists.apache.org/thread.html/7c1fde3884fbefacc252b6d4b434f9a9c2cf024f381654aa3e47df18@%3Cdev.beam.apache.org%3E

>>>
>>
>

Re: [ANNOUNCEMENT] New committers, May 2018 edition!

2018-06-01 Thread Charles Chen

Congratulations everyone!

On Thu, May 31, 2018, 10:14 PM Pablo Estrada  wrote:

> Thanks to the PMC! Very humbled and excited to keep taking part in this
> great community.
> :)
> -P.
>
>
> On Thu, May 31, 2018, 10:10 PM Tim  wrote:
>
>> Congratulations!
>>
>>
>> Tim
>>
>> On 1 Jun 2018, at 07:05, Andrew Psaltis  wrote:
>>
>> Congrats!
>>
>> On Fri, Jun 1, 2018 at 12:26 AM, Thomas Weise  wrote:
>>
>>> Congrats!
>>>
>>>
>>> On Thu, May 31, 2018 at 9:25 PM, Alan Myrvold 
>>> wrote:
>>>
 Congrats Gris+Pablo+Jason. Well deserved.

 On Thu, May 31, 2018 at 9:15 PM Jason Kuster 
 wrote:

> Thank you to Davor and the PMC; I'm excited to be able to help Beam in
> this new capacity. Bring on the PRs. :D
>
> On Thu, May 31, 2018 at 8:55 PM Xin Wang 
> wrote:
>
>> Congrats!
>>
>> - Xin Wang
>>
>> 2018-06-01 11:52 GMT+08:00 Rui Wang :
>>
>>> Congrats!
>>>
>>> -Rui
>>>
>>> On Thu, May 31, 2018 at 8:23 PM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>>
 Congrats !

 Regards
 JB

 On 01/06/2018 04:08, Davor Bonaci wrote:
 > Please join me and the rest of Beam PMC in welcoming the following
 > contributors as our newest committers. They have significantly
 > contributed to the project in different ways, and we look forward
 to
 > many more contributions in the future.
 >
 > * Griselda Cuevas
 > * Pablo Estrada
 > * Jason Kuster
 >
 > (Apologizes for a delayed announcement, and the lack of the usual
 > paragraph summarizing individual contributions.)
 >
 > Congratulations to all three! Welcome!

>>>
>>
>>
>> --
>> Thanks,
>> Xin
>>
>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>
> See something? Say something. go/jasonkuster-feedback
> 
>

>>>
>> --
> Got feedback? go/pabloem-feedback
> 
>

Re: Existing transactionality inconsistency in the Beam Java State API

2018-05-23 Thread Charles Chen

Thanks Kenn.  I think there are two issues to highlight: (1) the API should
allow for some sort of prefetching / batching / background I/O for state;
and (2) it should be clear what the semantics are for reading (e.g. so we
don't have confusing read after write behavior).

The approach I'm leaning towards for (1) is to allow a state.prefetch()
method (to prefetch a value, iterable or [entire] map state) and maybe
something like state.prefetch_key(key) to prefetch a specific KV in the
map.  Issue (2) seems to be okay in either of Kenn's positions.

On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <rober...@google.com> wrote:

> Thanks for laying this out so well, Kenn. I'm also leaning towards the
> second option, despite its drawbacks. (In particular, readLater should
> not influence what's returned at read(), it's just a hint.)
>
> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <k...@google.com> wrote:
>
>> Great idea to bring it to dev@. I think it is better to focus here than
>> long doc comment threads.
>>
>> I had strong opinions that I think were a bit confused and wrong. Sorry
>> for that. I stated this position:
>>
>>  - XYZState class is a handle to a mutable location
>>  - its methods like isEmpty() or contents() should return immutable
>> future values (implicitly means their contents are semantically frozen when
>> they are created)
>>  - the fact that you created the future is a hint that all necessary
>> fetching/computation should be kicked off
>>  - later forced with get()
>>  - when it was designed, pure async style was not a viable option
>>
>> I see now that the actual position of some of its original designers is:
>>
>>  - XYZState class is a view on a mutable location
>>  - its methods return new views on that mutable location
>>  - calling readLater() is a hint that some fetching/computation should be
>> kicked off
>>  - later read() will combine whatever readLater() did with additional
>> local info to give the current value
>>  - async style not applicable nor desirable as per Beam's focus on naive
>> straight-line coding + autoscaling
>>
>> These are both internally consistent I think. In fact, I like the second
>> perspective better than the one I have been promoting. There are some
>> weaknesses: readLater() is pretty tightly coupled to a particular
>> implementation style, and futures are decades old so you can get good APIs
>> and performance without inventing anything. But I still like the non-future
>> version a little better.
>>
>> Kenn
>>
>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <c...@google.com> wrote:
>>
>>> During the design of the Beam Python State API, we noticed some
>>> transactionality inconsistencies in the existing Beam Java State API (these
>>> are the unresolved bugs BEAM-2980
>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are therefore
>>> having a discussion about this API which can have implications for its
>>> future development in all Beam languages:
>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>
>>> If you have an opinion on the possible design approaches, it would be
>>> very helpful to bring up in the doc or on this thread.  Thanks!
>>>
>>> Best,
>>> Charles
>>>
>>

Existing transactionality inconsistency in the Beam Java State API

2018-05-23 Thread Charles Chen

During the design of the Beam Python State API, we noticed some
transactionality inconsistencies in the existing Beam Java State API (these
are the unresolved bugs BEAM-2980
 and BEAM-2975
).  We are therefore
having a discussion about this API which can have implications for its
future development in all Beam languages:
https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b

If you have an opinion on the possible design approaches, it would be very
helpful to bring up in the doc or on this thread.  Thanks!

Best,
Charles

Re: Proposal for Beam Python User State and Timer APIs

2018-05-23 Thread Charles Chen

Thanks everyone for the detailed comments and discussions.  It looks like
by now, we mostly agree with the requirements and overall direction needed
for the API, though there is continuing discussion on specific details.  I
want to highlight two new sections of the doc, which address some
discussions that have come up:

   - *Existing state and transactionality*: this section addresses how we
   will address an existing transactionality inconsistency in the existing
   Java API.  (
   
https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
   )
   - *State for merging windows*: this section addresses how we will deal
   with non-combinable state in conjunction with merging windows.  (
   
https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ctxkcgabtzpy
   )

Let me know any further comments and suggestions.

On Tue, May 22, 2018 at 9:29 AM Kenneth Knowles <k...@google.com> wrote:

> Nice. I know that Java users have found it helpful to have this
> lower-level way of writing pipelines when the high-level primitives don't
> quite have the tight control they are looking for. I hope it will be a big
> draw for Python, too.
>
> (commenting on the doc)
>
> Kenn
>
> On Mon, May 21, 2018 at 5:15 PM Charles Chen <c...@google.com> wrote:
>
>> I want to share a proposal for adding user state and timer support to the
>> Beam Python SDK and get the community's thoughts on how such an API should
>> look: https://s.apache.org/beam-python-user-state-and-timers
>>
>> Let me know what you think and please add any comments and suggestions
>> you may have.
>>
>> Best,
>> Charles
>>
>

Proposal for Beam Python User State and Timer APIs

2018-05-21 Thread Charles Chen

I want to share a proposal for adding user state and timer support to the
Beam Python SDK and get the community's thoughts on how such an API should
look: https://s.apache.org/beam-python-user-state-and-timers

Let me know what you think and please add any comments and suggestions you
may have.

Best,
Charles

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-04 Thread Charles Chen

I have added https://issues.apache.org/jira/browse/BEAM-4236 as a blocker.

On Fri, May 4, 2018 at 1:19 PM Ahmet Altay  wrote:

> Hi JB,
>
> We found an issue related to using side inputs in streaming mode using
> python SDK. Charles is currently trying to find the root cause. Would you
> be able to give him some additional time to investigate the issue?
>
> Charles, do you have a JIRA issue on the blocker list?
>
> Thank you everyone for understanding.
>
> Ahmet
>
> On Fri, May 4, 2018 at 8:52 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi
>>
>> I have couple of PRs I would like to include. I would like also to take
>> the weekend for new builds and tests.
>>
>> If it works for everyone I propose to start the release process Tuesday.
>>
>> Thoughts ?
>>
>> Regards
>> JB
>> Le 4 mai 2018, à 17:49, Scott Wegner  a écrit:
>>>
>>> Hi JB, any idea when you will begin the release? Boyuan has a couple
>>> Python PRs [1] [2] that are ready to merge, but we'd like to wait until
>>> after the release branch is cut in case there is some performance
>>> regression.
>>>
>>> [1] https://github.com/apache/beam/pull/4741
>>> [2] https://github.com/apache/beam/pull/4925
>>>
>>> On Tue, May 1, 2018 at 9:25 AM Scott Wegner  wrote:
>>>
 Sounds good, thanks J.B. Feel free to ping if you need anything.

 On Mon, Apr 30, 2018 at 10:12 PM Jean-Baptiste Onofré 
 wrote:

> That's a good idea ! I think using Slack to ping/ask is a good way as
> it's async.
>
> Regards
> JB
>
> On 05/01/2018 06:51 AM, Reuven Lax wrote:
> > I think it makes sense to have someone who hadn't done the Gradle
> migration to
> > run the release. However would it make sense for someone who did
> work on the
> > migration to partner with you JB? There may be issues that are
> simply due to
> > things that were not documented well. In that case the partner can
> quickly help
> > resolve, and can then be the one who makes sure that the
> documentation is updated.
> >
> > Reuven
> >
> > On Mon, Apr 30, 2018 at 9:36 PM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > > wrote:
> >
> > Hi Scott,
> >
> > Thanks for the update. The Gradle build crashed on my machine
> (not related to
> > Gradle). I launched a new one.
> >
> > I'm volunteer to cut the release: I think I know Gradle
> decently, and even if I
> > didn't work on the gradle "migration" during the last two weeks,
> I think it's
> > actually better: I have an "external" view on the latest changes.
> >
> > Thoughts ?
> >
> > Regards
> > JB
> >
> > On 05/01/2018 02:05 AM, Scott Wegner wrote:
> > > Welcome back JB!
> > >
> > > I just sent a separate update about Gradle [1]-- the build
> migration is
> > complete
> > > and the release documentation has been updated.
> > >
> > > I recommend we produce the 2.5.0 release using Gradle. Having
> a successful
> > > release should be the final validation before declaring the
> Gradle migration
> > > complete. So the sooner we can have a Gradle release, the
> sooner we can
> > get back
> > > to a single build system :)
> > >
> > > If it would be helpful, I suggest that somebody who's been
> working on the
> > Gradle
> > > migration to manage the 2.5.0 release. That way if we
> encounter any issues
> > from
> > > the build system, they should have sufficient expertise to fix
> it.
> > >
> > >
> > [1]
> https://lists.apache.org/thread.html/e543b3850bfc4950d57bc18624e1d4131324c6cf691fd10034947cad@%3Cdev.beam.apache.org%3E
>
> > >
> > > On Mon, Apr 30, 2018 at 11:38 AM Romain Manni-Bucau <
> rmannibu...@gmail.com
> > 
> > > >>
> wrote:
> > >
> > >
> > >
> > > Le 30 avr. 2018 19:39, "Jean-Baptiste Onofré" <
> j...@nanthrax.net
> > 
> > > >> a
> écrit :
> > >
> > > Hi guys,
> > >
> > > now that I'm back from vacations, I bring back 2.5.0
> release on
> > the table ;)
> > >
> > > This is also related to the current status of build
> (Maven/Gradle).
> > >
> > > FYI, I gonna start the Jira triage tomorrow and I
> launched couple of
> > > build on my
> > > machine (both Maven and Gradle) to get an update on
> the current
> >

Re: Pubsub on directrunner: direct_runner.py and transform_evaluator.py

2018-04-29 Thread Charles Chen

The write can be done as a normal ParDo / DoFn. The read needs to expose
some watermark logic, which at the time of writing wasn't available, since
no unbounded source API was available. We may be able to write the read /
source as a SplittableDoFn since that API was introduced as an unbounded
source API.

On Sat, Apr 28, 2018, 5:56 AM Udi Meiri  wrote:

> Hi,
> I'm having trouble understanding why there's an extra level of indirection
> when doing pubsub reads via directrunner vs writes.
>
> For reads, we have these translations:
> beam_pubsub.ReadFromPubSub ->
> direct_runner._DirectReadFromPubSub ->
> transform_evaluator._PubSubReadEvaluator
>
> For writes, this is abbreviated:
> beam_pubsub.WriteStringsToPubSub ->
> _DirectWriteToPubSub
>
> What is the role of transform_evaluator._TransformEvaluator?
> Why do we need it for reads and not for writes?
>
>

Re: [VOTE] Release 2.4.0, release candidate #3

2018-03-19 Thread Charles Chen

+1.  Verified the Python Quickstart on local and Dataflow (Mac / Linux).
Also verified that the Mac / Linux wheels were built correctly with fast /
compiled Cython coder support.

On Mon, Mar 19, 2018 at 1:49 PM Robert Bradshaw  wrote:

> Thanks!
>
> BTW, in case anyone's wondering where the md5 files went, they're now
> discouraged: http://www.apache.org/dev/release-distribution#sigs-and-sums
>
>
> On Mon, Mar 19, 2018 at 12:53 PM Lukasz Cwik  wrote:
>
>> +1 (binding), verified Java quickstart on Apex local, DirectRunner,
>> Dataflow, Flink local, Spark local.
>>
>>
>> On Mon, Mar 19, 2018 at 3:54 AM Romain Manni-Bucau 
>> wrote:
>>
>>> -0 (cause of the teardown issue which is still a blocker), otherwise
>>> spark/direct runners work in my projects
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>> 2018-03-17 10:19 GMT+01:00 Robert Bradshaw :
>>>
 Hi everyone,

 Please review and vote on the release candidate #3 for the version
 2.4.0,
 as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 The complete staging area is available for your review, which includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to dist.apache.org
 [2],
 which is signed with the key with fingerprint BDC9 89B0 1BD2 A463 6010
 A1CA
 8F15 5E09 610D 69FB [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.4.0-RC3" [5],
 * website pull request listing the release and publishing the API
 reference
 manual [6].
 * Java artifacts were built with Maven 3.2.5 and OpenJDK 1.8.0_112.
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2].

 The validation spreadsheet is available at

 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?ts=5a1c7310#gid=1663314475

 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.

 Thanks,
 - Robert

 [1]

 https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342682=12319527
 [2] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
 [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1031/
 [5] https://github.com/apache/beam/tree/v2.4.0-RC3
 [6] https://github.com/apache/beam-site/pull/398

>>>
>>>

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-09 Thread Charles Chen

Thank you Valentyn for reporting this.  I have traced the issue back to
https://github.com/apache/beam/pull/4666, so I have sent out a PR to fix:
https://github.com/apache/beam/pull/4846.

On Fri, Mar 9, 2018 at 2:17 PM, Valentyn Tymofieiev 
wrote:

> -1.
>
> Checked Python Quickstarts (Passed) and Python MobileGaming on
> DirectRunner. I observe an issue in BQ sink for hourly teams score example:
>
> https://issues.apache.org/jira/browse/BEAM-3824
>
> On Fri, Mar 9, 2018 at 10:49 AM, Lukasz Cwik  wrote:
>
>> I checked that word count quickstarts (except Dataflow) worked for RC2 to
>> hopefully prevent an RC4.
>>
>> On Fri, Mar 9, 2018 at 10:29 AM, Robert Bradshaw 
>> wrote:
>>
>>> Thanks, Alan, for pointing this out. I see this now, and it looks like I
>>> need to finish building the dataflow workers so they have something to
>>> point to. I will do this and release an RC3 once that's ready.
>>>
>>> In the meantime, it'd be great if we could validate everything else about
>>> this RC such that when this on-line, dataflow-only change is out we won't
>>> have any further surprises. I see Luke went through the Java Quickstart
>>> examples, thanks!
>>>
>>>
>>> On Thu, Mar 8, 2018 at 3:48 PM Lukasz Cwik  wrote:
>>>
>>> > Yes, the release guide has a segment "Update release specific
>>> configurations" that has a tidbit about this.
>>>
>>> > On Thu, Mar 8, 2018 at 3:45 PM, Alan Myrvold 
>>> wrote:
>>>
>>> >> The dataflow java worker version wasn't updated on the branch as in
>>> past
>>> releases ... should it be?
>>> >> https://issues.apache.org/jira/browse/BEAM-3815
>>>
>>>
>>> >> On Thu, Mar 8, 2018 at 1:40 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com>
>>> wrote:
>>>
>>> >>> Can still be provided as a generic one (like the an offset or key
>>> based
>>> one) but good enough for now, right, was just surprising to not see it
>>> when
>>> checking the breakage.
>>>
>>> >>> Le 8 mars 2018 22:05, "Eugene Kirpichov"  a
>>> écrit
>>> :
>>>
>>> >>> All SDF-related method annotations in DoFn are marked @Experimental.
>>> I
>>> guess that should apply to RestrictionTracker too, but I wouldn't be too
>>> worried about that, since it only makes sense to use in the context of
>>> those methods.
>>>
>>> >>> On Thu, Mar 8, 2018 at 12:36 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>  Hmm, does sdf api misses some @Experimental then?
>>>
>>>  To clarify: for waitUntilFinish I'm fine with the 2.4 as this but
>>> cant
>>> +1 or +0 since none of my tests pass reliably in current state without a
>>> retry strategy making the call useless.
>>>
>>>  Le 8 mars 2018 21:02, "Reuven Lax"  a écrit :
>>>
>>> > Does Nexmark use SerializableCoder?
>>>
>>>
>>> > On Thu, Mar 8, 2018 at 10:42 AM Robert Bradshaw <
>>> rober...@google.com>
>>> wrote:
>>>
>>> >> I put the validation checklist spreadsheet is up at
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkS
>>> ZTR8AGqyNUM-oLFo_ZXBpJw/edit?ts=5a1c7310#gid=1663314475
>>>
>>> >> Regarding the direct runner regression on query 10, this is
>>> understandable given how mutation detection has been changed for
>>> serializable coders (and should be tracked, probably fixed by avoiding
>>> SerializableCoder). It should not affect other runners. Could you file a
>>> bug?
>>>
>>> >> Regarding waitUntilFinish, this is a bug but not a blocker--it's
>>> been this way since teardown was introduced. There are many nice-to-haves
>>> that one could merge from master to the release branch, but we've seen
>>> where that trend leads.
>>>
>>> >> Regarding the backwards incompatible changes in restriction
>>> tracker,
>>> this is (as I understand it) a change to the experimental SDF API.
>>> Eugene,
>>> do you want to comment on this?
>>>
>>>
>>>
>>> >> On Thu, Mar 8, 2018 at 2:07 AM Ismaël Mejía 
>>> wrote:
>>>
>>> >>> I confirm that the new release fixes both problems reported
>>> previously:
>>>
>>> >>> - python package name
>>> >>> - nexmark query 10 mutability issue with the direct runner.
>>>
>>> >>> One extra regression is that the the fix produced a way longer
>>> >>> execution time on the query.
>>> >>> Not sure if a blocker but worth tracking.
>>>
>>> >>> Query 10 - Batch/Bounded
>>> >>> Version  Runtime(sec)   Events(/sec)Results
>>> >>>2.3.0   3.627609.1  1
>>> >>>2.4.0  30.8 3244.3  1
>>>
>>> >>> Query 10 - Streaming/Unbounded
>>> >>> Version  Runtime(sec)   Events(/sec)Results
>>> >>>2.3.0   6.315873.0  1
>>> >>>2.4.0 101.1  989.4  1
>>>
>>> >>> On Thu, Mar 8, 2018 at 8:54 AM, Romain Manni-Bucau
>>> >>>  wrote:
>>> >>> > -1:
>>> >>> > a) still consider

Re: github reviews weirdness

2018-02-27 Thread Charles Chen

I noticed that GitHub sometimes has two "copies" of a comment thread--the
first copy, the one that appears first on the page with the original
commenter, is the only one that allows comments; a second "copy" is created
when people do reviews.  So maybe you can scroll up to find the right
"copy" of the thread to comment on.

On Tue, Feb 27, 2018, 11:34 PM Romain Manni-Bucau 
wrote:

> Hey guys,
>
> Noticed on some PR it is not possible to comment anymore the reviews
> comments. This is quite bothering cause it makes the review comments split
> and at the end quite hard to follow when you get > 1 iteration.
>
> Anything important I miss here? Should I use another tool - if so, why not
> github which is mainstream? Is it a "test" thing?
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>

Re: Beam 2.4.0

2018-02-20 Thread Charles Chen

I would like to +1 the faster release cycle process JB and Robert have been
advocating and implementing, and thank JB for releasing 2.3.0 smoothly.
When we block for specific features and increase the time between releases,
we increase the urgency for PR authors to push for their change to go into
an upcoming release, which is a feedback loop that results in our releases
taking months instead of weeks.  We should however try to get pending PRs
wrapped up.

On Tue, Feb 20, 2018 at 2:15 PM Romain Manni-Bucau 
wrote:

> Kind of agree but rythm was supposed to be 6 weeks IIRC, 2.3 is just out
> so 1 week is a bit fast IMHO.
>
> Le 20 févr. 2018 23:13, "Robert Bradshaw"  a écrit :
>
>> One of the main shifts that I think helped this release was explicitly
>> not being feature driven, rather releasing what's already in the
>> branch. That doesn't mean it's not a good call to action to try and
>> get long-pending PRs or similar wrapped up.
>>
>> On Tue, Feb 20, 2018 at 2:10 PM, Romain Manni-Bucau
>>  wrote:
>> > There are a lot of long pending PR, would be good to merge them before
>> 2.4.
>> > Some are bringing tests for the 2.3 release which can be critical to
>> > include.
>> >
>> > Maybe we should list the pr and jira we want it before picking a date?
>> >
>> > Le 20 févr. 2018 22:02, "Konstantinos Katsiapis" 
>> a
>> > écrit :
>> >>
>> >> +1 since tf.transform 0.6 depends on Beam 2.4 and Tensorflow 1.6 (and
>> the
>> >> latter already has an RC out, so we will likely be blocked on Beam).
>> >>
>> >> On Tue, Feb 20, 2018 at 12:50 PM, Robert Bradshaw > >
>> >> wrote:
>> >>>
>> >>> Now that Beam 2.3.0 went out (and in record time, kudos to all that
>> >>> made this happen!) It'd be great to keep the ball rolling for a
>> >>> similarly well-executed 2.4. A lot has gone in [1] since we made the
>> >>> 2.3 cut, and to keep our cadence up I would propose a time-based cut
>> >>> date early next week (say the 28th).
>> >>>
>> >>> I'll volunteer to do this release.
>> >>>
>> >>> [1] https://github.com/apache/beam/compare/release-2.3.0...master
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Gus Katsiapis | Software Engineer | katsia...@google.com |
>> 650-918-7487 <(650)%20918-7487>
>>
>

Re: A 15x speed-up in local Python DirectRunner execution

2018-02-16 Thread Charles Chen

I hope those interested have had time to test this out.  I have sent out
https://github.com/apache/beam/pull/4696 to switch to using this fast
runner as the default DirectRunner for local execution.  Let me know if
there are any concerns.

On Tue, Feb 13, 2018 at 12:17 PM Charles Chen <c...@google.com> wrote:

> This is now checked into master.  You can use it by setting
> --runner=SwitchingDirectRunner.  Please let us know if you run into any
> issues.
>
>
> On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Very interesting! Sounds like a sane way for beam future and I'm very
>> happy it is consistent with the current Java experience: no need to
>> interlace runners at the end, it makes design, code and user experience way
>> better than trying to put everything in the direct runner :).
>>
>> Le 8 févr. 2018 19:20, "María García Herrero" <mari...@google.com> a
>> écrit :
>>
>>> Amazing improvement, Charles.
>>> Thanks for the effort!
>>>
>>>
>>> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Sounds awesome, congratulations and thanks for making this happen!
>>>>
>>>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <rang...@google.com>
>>>> wrote:
>>>>
>>>>> This is terrific news! Thanks Charles.
>>>>>
>>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
>>>>>
>>>>>> Local execution of Beam pipelines on the Python DirectRunner
>>>>>> currently suffers from performance issues, which makes it hard for 
>>>>>> pipeline
>>>>>> authors to iterate, especially on medium to large size datasets.  We 
>>>>>> would
>>>>>> like to optimize and make this a better experience for Beam users.
>>>>>>
>>>>>> The FnApiRunner was written as a way of leveraging the portability
>>>>>> framework execution code path for local portability development. We've
>>>>>> found it also provides great speedups in batch execution with no user
>>>>>> changes required, so we propose to switch to use this runner by default 
>>>>>> in
>>>>>> batch pipelines.  For example, WordCount on the Shakespeare dataset with 
>>>>>> a
>>>>>> single CPU core now takes 50 seconds to run, compared to 12 minutes 
>>>>>> before;
>>>>>> this is a 15x performance improvement that users can get for free,
>>>>>> with no user pipeline changes.
>>>>>>
>>>>>> The JIRA for this change is here (
>>>>>> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate
>>>>>> patch is available here (https://github.com/apache/beam/pull/4634).
>>>>>> I have been working over the last month on making this an automatic 
>>>>>> drop-in
>>>>>> replacement for the current DirectRunner when applicable.  Before it
>>>>>> becomes the default, you can try this runner now by manually specifying
>>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the
>>>>>> runner.
>>>>>>
>>>>>> Even with this change, local Python pipeline execution can only
>>>>>> effectively use one core because of the Python GIL.  A natural next step 
>>>>>> to
>>>>>> further improve performance will be to refactor the FnApiRunner to allow
>>>>>> for multi-process execution.  This is being tracked here (
>>>>>> https://issues.apache.org/jira/browse/BEAM-3645).
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Charles
>>>>>>
>>>>>
>>>
>>> --
>>>
>>> Impact is the effect that wouldn’t have happened if you hadn’t done what you
>>> did.
>>>
>>>

Re: A 15x speed-up in local Python DirectRunner execution

2018-02-13 Thread Charles Chen

This is now checked into master.  You can use it by setting
--runner=SwitchingDirectRunner.  Please let us know if you run into any
issues.


On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Very interesting! Sounds like a sane way for beam future and I'm very
> happy it is consistent with the current Java experience: no need to
> interlace runners at the end, it makes design, code and user experience way
> better than trying to put everything in the direct runner :).
>
> Le 8 févr. 2018 19:20, "María García Herrero" <mari...@google.com> a
> écrit :
>
>> Amazing improvement, Charles.
>> Thanks for the effort!
>>
>>
>> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>> Sounds awesome, congratulations and thanks for making this happen!
>>>
>>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <rang...@google.com> wrote:
>>>
>>>> This is terrific news! Thanks Charles.
>>>>
>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
>>>>
>>>>> Local execution of Beam pipelines on the Python DirectRunner currently
>>>>> suffers from performance issues, which makes it hard for pipeline authors
>>>>> to iterate, especially on medium to large size datasets.  We would like to
>>>>> optimize and make this a better experience for Beam users.
>>>>>
>>>>> The FnApiRunner was written as a way of leveraging the portability
>>>>> framework execution code path for local portability development. We've
>>>>> found it also provides great speedups in batch execution with no user
>>>>> changes required, so we propose to switch to use this runner by default in
>>>>> batch pipelines.  For example, WordCount on the Shakespeare dataset with a
>>>>> single CPU core now takes 50 seconds to run, compared to 12 minutes 
>>>>> before;
>>>>> this is a 15x performance improvement that users can get for free,
>>>>> with no user pipeline changes.
>>>>>
>>>>> The JIRA for this change is here (
>>>>> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate
>>>>> patch is available here (https://github.com/apache/beam/pull/4634). I
>>>>> have been working over the last month on making this an automatic drop-in
>>>>> replacement for the current DirectRunner when applicable.  Before it
>>>>> becomes the default, you can try this runner now by manually specifying
>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the
>>>>> runner.
>>>>>
>>>>> Even with this change, local Python pipeline execution can only
>>>>> effectively use one core because of the Python GIL.  A natural next step 
>>>>> to
>>>>> further improve performance will be to refactor the FnApiRunner to allow
>>>>> for multi-process execution.  This is being tracked here (
>>>>> https://issues.apache.org/jira/browse/BEAM-3645).
>>>>>
>>>>> Best,
>>>>>
>>>>> Charles
>>>>>
>>>>
>>
>> --
>>
>> Impact is the effect that wouldn’t have happened if you hadn’t done what you
>> did.
>>
>>

Proposal: build Python wheel distributions for Apache Beam releases

2018-02-12 Thread Charles Chen

Currently, Apache Beam distributes Python packages through pip and PyPI.
On PyPI, developers can release either source tarballs, and / or
precompiled "wheel" distributions for each platform, which would be used if
available for a particular platform.  Currently, we only distribute the
source tarballs, so any user who installs Beam using "pip install
apache_beam" has to have a compiler and toolchain installed to take
advantage of Cython optimizations in Beam (which require compiled C code).
If such a compiler is not available, Beam is currently configured to
install anyway, but will use slower Python codepaths instead of the more
optimized ones (for example, for Coder encoding / decoding).

I would like to propose that we start distributing binary wheel
distributions for our releases, for common platforms like Windows / Mac /
Linux.  We could potentially use a method similar to this one (
https://github.com/MacPython/cython-wheels) for building these wheel
distributions.  Thoughts?

Best,
Charles

Re: A 15x speed-up in local Python DirectRunner execution

2018-02-07 Thread Charles Chen

The existing DirectRunner will be needed for the foreseeable future since
it is currently the only local runner that supports streaming execution.

On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada <pabl...@google.com> wrote:

> Very cool Charles! Have you considered whether you'll want to remove the
> direct runner code afterwards?
> Best
> -P.
>
> On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> That is pretty awesome.
>>
>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
>>
>>> Local execution of Beam pipelines on the Python DirectRunner currently
>>> suffers from performance issues, which makes it hard for pipeline authors
>>> to iterate, especially on medium to large size datasets.  We would like to
>>> optimize and make this a better experience for Beam users.
>>>
>>> The FnApiRunner was written as a way of leveraging the portability
>>> framework execution code path for local portability development. We've
>>> found it also provides great speedups in batch execution with no user
>>> changes required, so we propose to switch to use this runner by default in
>>> batch pipelines.  For example, WordCount on the Shakespeare dataset with a
>>> single CPU core now takes 50 seconds to run, compared to 12 minutes before;
>>> this is a 15x performance improvement that users can get for free, with
>>> no user pipeline changes.
>>>
>>> The JIRA for this change is here (
>>> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch
>>> is available here (https://github.com/apache/beam/pull/4634). I have
>>> been working over the last month on making this an automatic drop-in
>>> replacement for the current DirectRunner when applicable.  Before it
>>> becomes the default, you can try this runner now by manually specifying
>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the
>>> runner.
>>>
>>> Even with this change, local Python pipeline execution can only
>>> effectively use one core because of the Python GIL.  A natural next step to
>>> further improve performance will be to refactor the FnApiRunner to allow
>>> for multi-process execution.  This is being tracked here (
>>> https://issues.apache.org/jira/browse/BEAM-3645).
>>>
>>> Best,
>>>
>>> Charles
>>>
>>
>>
>
> --
> Got feedback? go/pabloem-feedback
> <https://goto.google.com/pabloem-feedback>
>

A 15x speed-up in local Python DirectRunner execution

2018-02-07 Thread Charles Chen

Local execution of Beam pipelines on the Python DirectRunner currently
suffers from performance issues, which makes it hard for pipeline authors
to iterate, especially on medium to large size datasets.  We would like to
optimize and make this a better experience for Beam users.

The FnApiRunner was written as a way of leveraging the portability
framework execution code path for local portability development. We've
found it also provides great speedups in batch execution with no user
changes required, so we propose to switch to use this runner by default in
batch pipelines.  For example, WordCount on the Shakespeare dataset with a
single CPU core now takes 50 seconds to run, compared to 12 minutes before;
this is a 15x performance improvement that users can get for free, with no
user pipeline changes.

The JIRA for this change is here (
https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch is
available here (https://github.com/apache/beam/pull/4634). I have been
working over the last month on making this an automatic drop-in replacement
for the current DirectRunner when applicable.  Before it becomes the
default, you can try this runner now by manually specifying
apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner.

Even with this change, local Python pipeline execution can only effectively
use one core because of the Python GIL.  A natural next step to further
improve performance will be to refactor the FnApiRunner to allow for
multi-process execution.  This is being tracked here (
https://issues.apache.org/jira/browse/BEAM-3645).

Best,

Charles

Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

2018-02-02 Thread Charles Chen

Thanks Kenn.  We already do the Runner API roundtripping (I believe Robert
implemented this).  With this change, we would start doing exactly what
you're suggesting, where we apply overrides to a post-deserialization
pipeline.

On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles <k...@google.com> wrote:

> +1 for removing apply_*
>
> For the Java SDK, removing specialized intercepts was an important first
> step towards the portability framework. I wonder if there is a way for the
> Python SDK to leapfrog, taking advantage of some of the lessons that Java
> learned a bit more painfully. Most pertinent I think is that if an SDK's
> role is to construct a pipeline and ship the proto to a runner (service)
> then overrides apply to a post-deserialization pipeline. The Java
> DirectRunner does a proto round-trip to avoid accidentally depending on
> things that are not really part of the pipeline. I would this crisp
> abstraction enforcement would add even more value to Python.
>
> Kenn
>
> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <c...@google.com> wrote:
>
>> In the Python DirectRunner, we currently use apply_* overrides to
>> override the operation of the default .expand() operation for certain
>> transforms. For example, GroupByKey has a special implementation in the
>> DirectRunner, so we use an apply_* override hook to replace the
>> implementation of GroupByKey.expand().
>>
>> However, this strategy has drawbacks. Because this override operation
>> happens eagerly during graph construction, the pipeline graph is
>> specialized and modified before a specific runner is bound to the
>> pipeline's execution. This makes the pipeline graph non-portable and blocks
>> full migration to using the Runner API pipeline representation in the
>> DirectRunner.
>>
>> By contrast, the SDK's PTransformOverride mechanism allows the expression
>> of matchers that operate on the unspecialized graph, replacing PTransforms
>> as necessary to produce a DirectRunner-specialized pipeline graph for
>> execution.
>>
>> I therefore propose to replace these eager apply_* overrides with
>> PTransformOverrides that operate on the completely constructed graph.
>>
>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
>> I've prepared a candidate patch at
>> https://github.com/apache/incubator-beam/pull/4529.
>>
>> Best,
>> Charles
>>
>
>

Replacing Python DirectRunner apply_* hooks with PTransformOverrides

2018-02-01 Thread Charles Chen

In the Python DirectRunner, we currently use apply_* overrides to override
the operation of the default .expand() operation for certain transforms.
For example, GroupByKey has a special implementation in the DirectRunner,
so we use an apply_* override hook to replace the implementation of
GroupByKey.expand().

However, this strategy has drawbacks. Because this override operation
happens eagerly during graph construction, the pipeline graph is
specialized and modified before a specific runner is bound to the
pipeline's execution. This makes the pipeline graph non-portable and blocks
full migration to using the Runner API pipeline representation in the
DirectRunner.

By contrast, the SDK's PTransformOverride mechanism allows the expression
of matchers that operate on the unspecialized graph, replacing PTransforms
as necessary to produce a DirectRunner-specialized pipeline graph for
execution.

I therefore propose to replace these eager apply_* overrides with
PTransformOverrides that operate on the completely constructed graph.

The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and I've
prepared a candidate patch at
https://github.com/apache/incubator-beam/pull/4529.

Best,
Charles

Re: Removing the PValueCache from the Beam Python DirectRunner

2018-01-25 Thread Charles Chen

Yes, that is correct.  The scope of the attached fix is for in-process
runners.  For remote runners, we should think about how to make PCollection
contents available after pipeline execution.  We may also need to better
design eager / interactive execution for that use case, since our current
use of eager mode is geared towards testing transforms locally.

On Thu, Jan 25, 2018 at 4:07 PM Robert Bradshaw <rober...@google.com> wrote:

> Sounds good. I assume there will still need to be runner-specific
> support for any runner that chooses to implement this (e.g. writing to
> remote files then reading them in?)
>
> On Thu, Jan 25, 2018 at 3:25 PM, Charles Chen <c...@google.com> wrote:
> > Currently, the Python SDK supports an eager execution mode.  For
> example, a
> > list can be directly passed into a PTransform to obtain its result:
> >
> > result = [1, 2, 3] | MyPTransform()
> >
> > To support this use, the Python DirectRunner has an option to cache its
> > intermediate results into a PValueCache.  The above line, when run,
> > implicitly creates an ephemeral pipeline and runs it with the
> DirectRunner.
> > This, however, adds a lot of complexity to the DirectRunner, and is not
> > generalizable to other in-process Python runners (like the in-process
> Python
> > FnApiRunner, which runs batch pipelines more efficiently than the current
> > Python DirectRunner).
> >
> > To improve this, I will be removing this DirectRunner-specific
> > implementation and add functionality that allows all in-process Python
> > runners to be run in eager mode.
> >
> > Jira issue: https://issues.apache.org/jira/browse/BEAM-3537
> > Candidate fix: https://github.com/apache/beam/pull/4492
> >
> > Best,
> > Charles
>

Removing the PValueCache from the Beam Python DirectRunner

2018-01-25 Thread Charles Chen

Currently, the Python SDK supports an eager execution mode.  For example, a
list can be directly passed into a PTransform to obtain its result:

result = [1, 2, 3] | MyPTransform()

To support this use, the Python DirectRunner has an option to cache its
intermediate results into a PValueCache.  The above line, when run,
implicitly creates an ephemeral pipeline and runs it with the
DirectRunner.  This, however, adds a lot of complexity to the DirectRunner,
and is not generalizable to other in-process Python runners (like the
in-process Python FnApiRunner, which runs batch pipelines more efficiently
than the current Python DirectRunner).

To improve this, I will be removing this DirectRunner-specific
implementation and add functionality that allows all in-process Python
runners to be run in eager mode.

Jira issue: https://issues.apache.org/jira/browse/BEAM-3537
Candidate fix: https://github.com/apache/beam/pull/4492

Best,
Charles

Re: Some interesting use case

2018-01-16 Thread Charles Chen

This sounds similar to the use case for tf.Transform, a library that
depends on Beam: https://github.com/tensorflow/transform

On Tue, Jan 16, 2018 at 5:51 PM Ron Gonzalez  wrote:

> Hi,
>   I was wondering if anyone has encountered or used Beam in the following
> manner:
>
>   1. During machine learning training, use Beam to create the event table.
> The flow may consist of some joins, aggregations, row-based
> transformations, etc...
>   2. Once the model is created, deploy the model to some scoring service
> via PMML (or some other scoring service).
>   3. Enable the SAME transformations used in #1 by using a separate engine
> but thereby guaranteeing that it will transform the data identically as the
> engine used in #1.
>
>   I think this is a pretty interesting use case where Beam is used to
> guarantee portability across engines and deployment (batch to true
> streaming, not micro-batch). What's not clear to me is with respect to how
> batch joins would translate during one-by-one scoring (probably lookups) or
> how aggregations given that some kind of history would need to be stored
> (and how much is kept is configurable too).
>
>   Thoughts?
>
> Thanks,
> Ron
>

Re: [VOTE] Release 2.2.0, release candidate #3

2017-11-15 Thread Charles Chen

Could you send the command you used that produced this error?  I can't
reproduce it at the tip of the release-2.2.0 branch.

On Wed, Nov 15, 2017 at 5:34 AM Reuven Lax  wrote:

> I'm trying to do the last CP and cut RC4, but I'm getting a compilation
> failure in Python - "ImportError: No module named site"
>
> Did we possibly break the release branch on one of the Python CPs?
>
> Reuven
>
> On Sun, Nov 12, 2017 at 5:12 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Reuven,
> >
> > +1 for RC4, and don't worry: it's part of the process. I prefer to have a
> > long release process than a crappy a release ;) That's exactly the
> purpose
> > of review & vote.
> >
> > I definitely think that having releases more often will reduce such kind
> > of issue.
> >
> > Regards
> > JB
> >
> >
> > On 11/12/2017 09:04 AM, Reuven Lax wrote:
> >
> >> I definitely appreciate the frustration about how long this release is
> >> taking. It's verging on the point of ridiculous at this point, and we
> need
> >> to fix some of the things that caused us to get to this state (for one
> >> thing our infrastructure was so busted at one point, that Valentyn
> spent 2
> >> weeks trying to get on PR merged into the release branch).
> >>
> >> At this point, let's try and fix this Monday. Unfortunately this is not
> >> the
> >> sole issue requiring RC4. Python verification failed as well, and we
> need
> >> an RC4 regardless to merge those PRs. I'm hoping that RC4 is our final
> RC,
> >> and we can finish voting next week.
> >>
> >> Reuven
> >>
> >> On Sat, Nov 11, 2017 at 6:24 AM, Romain Manni-Bucau <
> >> rmannibu...@gmail.com>
> >> wrote:
> >>
> >> Le 11 nov. 2017 09:52, "Jean-Baptiste Onofré"  a
> écrit :
> >>>
> >>> If the purpose is to release 2.2.1 in one week, why not just to a RC4 ?
> >>>
> >>> It's not a regression because WriteFiles is new and extend the previous
> >>> FileSource. So it could consider as a severe bug, especially on
> >>> WriteFiles
> >>> which is important.
> >>>
> >>>
> >>> Fair enough.
> >>>
> >>>
> >>> The core issue is the time we spent already on this release: roughly 1
> >>> month !!! It's clearly too long due to different causes.
> >>> When I did the previous releases, it took 3 or 4 days. It's clearly the
> >>> target as, as said, I would like to have a release pace of a release
> >>> every
> >>> 6 weeks.
> >>>
> >>>
> >>>
> >>> Agree and this is why 2.2.0 must be out now IMHO. If you are confident
> >>> next
> >>> week is sufficient just go ahead and ignore my comment but my point was
> >>> the
> >>> same: it shouldnt last so long if there is no regression :(.
> >>>
> >>>
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 11/11/2017 08:41 AM, Romain Manni-Bucau wrote:
> >>>
> >>> You can see it differently: is there a critical bug? Yes! Is there a
>  regression? No!
> 
>  So no need to wait another week (keep in mind 2 days + 3 days of vote
>  makes easily 1 working week). This vote could be closed already and
> next
>  week 2.2.1 could fix this bug, no? Overall idea is to not hold the
>  community more than needed if there is no regression compared to last
>  few
>  releases.
> 
>  Le 11 nov. 2017 07:46, "Jean-Baptiste Onofré"  a
> écrit
> 
> >>> :
> >>>
> 
>  -1 (binding)
> 
> >
> > I agree with Eugene, data loss is severe.
> >
> > As Eugene seems confident to fix that quickly, I think it's worth to
> >
>  cut a
> >>>
>  RC4.
> >
> > However, I would introduce a deadline. As I would like to propose a
> > release cycle of a release every 6 weeks (whatever it contains, but
> it
> > really important to keep  a regular pace in releases), a release
> should
> >
>  be
> >>>
>  cut in couple of days. So, maybe we can give us 2 business days to fix
> > that
> > and propose a RC4. Basically, if this issue is not fix on Tuesday
> > night,
> > then, we move forward anyway.
> >
> > Regards
> > JB
> >
> > On 11/10/2017 07:42 PM, Eugene Kirpichov wrote:
> >
> > Unfortunately I think I found a data loss bug - it was there since
> > 2.0.0
> >
> >> but I think it's serious enough that delaying a fix until the next
> >> release
> >> would be irresponsible.
> >> See https://issues.apache.org/jira/browse/BEAM-3169
> >>
> >> On Thu, Nov 9, 2017 at 3:57 PM Robert Bradshaw
> >> 
> >> wrote:
> >>
> >> Our release notes look like nothing more than a query for the closed
> >>
> >> jira issues. Do we have a top-level summary to highlight the big
> >>> ticket items in the release? And in particular somewhere to mention
> >>> that this is likely the last release to support Java 7 that'll get
> >>> widely read?
> >>>
> >>> On Thu, Nov 9, 2017 at 3:39 PM, Reuven Lax
>

Re: [DISCUSS] Move away from Apache Maven as build tool

2017-10-31 Thread Charles Chen

As a contributor to the Beam Python SDK, I noticed that many of the points
above regarding Maven and Gradle pertain mostly to Java SDK development.
For Python development, Maven is much less natural, and we end up just
shelling out to perform builds and tests.  For Python SDK (and upcoming Go
SDK development), an option to use Bazel would be quite useful.

On Tue, Oct 31, 2017 at 10:42 AM Robert Bradshaw
 wrote:

> +1, Maven is both a build tool and a repository, and the latter is
> essential to keep. Both Gradel and Bazel can interface with this
> repository.
>
> I am, however, very supportive of moving away from Maven to a tool
> that supports correct incremental, hermetic, dependency-driven,
> multi-langauge, and hopefully fast builds for our own development.
>
> On Tue, Oct 31, 2017 at 10:00 AM, Kenneth Knowles
>  wrote:
> > Echoing what JB and Reuven said, we absolutely must provide maven central
> > artifacts for Java users, just as we provide pypi artifacts for Python
> > users.
> >
> > I see Maven as still a viable tool for single-module Java builds,
> > especially considering its rich plugin ecosystem.
> >
> > On Mon, Oct 30, 2017 at 11:27 PM, Reuven Lax 
> > wrote:
> >
> >> I think that's a very good point. No matter what build system we use for
> >> our own personal development, we still need to release Maven artifacts
> and
> >> releases as we need to support our users using Maven.
> >>
> >> On Mon, Oct 30, 2017 at 11:26 PM, Jean-Baptiste Onofré  >
> >> wrote:
> >>
> >> > Generally speaking, it's interesting to evaluate alternatives,
> especially
> >> > Gradle. My point is also to keep Maven artifacts and "releases" as
> most
> >> of
> >> > our users will use Maven.
> >> > For incremental build, afair, there's some enhancements on Maven but I
> >> > have to take a look.
> >> >
> >> > Regards
> >> > JB
> >> >
> >> > On Oct 31, 2017, 07:22, at 07:22, Eugene Kirpichov
> >> >  wrote:
> >> > >Hi!
> >> > >
> >> > >Many of these points sound valid, but AFAICT Maven doesn't really do
> >> > >incremental builds [1]. The best it can do is, it seems, recompile
> only
> >> > >changed files, but Java compilation is a tiny part of the overall
> >> > >build.
> >> > >
> >> > >Almost all time is taken by other plugins, such as unit testing or
> >> > >findbugs
> >> > >- and Maven does not seem to currently support features such as "do
> not
> >> > >rerun unit tests of a module if the code didn't change".
> >> > >
> >> > >The fact that the surefire plugin has existed for >11 years (version
> >> > >2.0
> >> > >was released in 2006) and still doesn't have this feature makes me
> >> > >think
> >> > >that it's unlikely to be supported in the next few years either.
> >> > >
> >> > >I suspect most PRs affect a very small number of modules, so I think
> >> > >the
> >> > >performance advantage of a build system truly supporting incremental
> >> > >builds
> >> > >may be so overwhelming as to trump many other factors. Of course,
> we'd
> >> > >need
> >> > >to prototype and have hard numbers in hand to discuss this with more
> >> > >substance.
> >> > >
> >> > >[1]
> >> > >https://stackoverflow.com/questions/8918165/does-maven-
> >> > support-incremental-builds
> >> > >
> >> > >On Mon, Oct 30, 2017 at 10:57 PM Romain Manni-Bucau
> >> > >
> >> > >wrote:
> >> > >
> >> > >> Hi
> >> > >>
> >> > >> Even if not a commiter or even PMC, I'd like to mention a few
> points
> >> > >from
> >> > >> an external eye:
> >> > >>
> >> > >> - Maven stays the most common build tool and easier one for any
> user.
> >> > >It
> >> > >> means it is the best one to hope contributions IMHO.
> >> > >> - Maven has incremental support but if there is any blocker the
> >> > >community
> >> > >> is probably ready to enhance it (has been done for compiler plugin
> >> > >for
> >> > >> instance)
> >> > >> - Gradle hides issues easily with its daemon so a build without
> >> > >daemon is
> >> > >> needed
> >> > >> - Gradle doesnt isolate plugins well enough so ensure your planned
> >> > >plugins
> >> > >> doesnt conflict
> >> > >> - Only Maven is correctly supported in mainstream and OS/free IDE
> >> > >>
> >> > >> This is the reasons why I think Maven is better - not even entering
> >> > >into
> >> > >> the ASF points.
> >> > >>
> >> > >> Now Maven is not perfect but some quick enhancements can be done:
> >> > >>
> >> > >> - A fast build profile can be created
> >> > >> - Takari scheduler can be used yo enhance the parallel build
> >> > >> - Scripts can be provided to build a subpart of the project
> >> > >> - A beam extension can surely be done to optimize or compute the
> >> > >reactors
> >> > >> more easily based on module names
> >> > >>
> >> > >> Romain
> >> > >>
> >> > >> Le 31 oct. 2017 06:42, "Jean-Baptiste Onofré"  a
> >> > >écrit :
> >> > >>
> >> > >> -0
> >> > >>
> >> > >> For

Re: Problem while upgrading lib

2017-10-03 Thread Charles Chen

Please also use the requirement "pip install apache_beam[gcp]" to pull in
appropriate Google Cloud dependencies, if needed.

On Tue, Oct 3, 2017 at 11:47 AM Ahmet Altay 
wrote:

> google-apitools dependency (which is required for GCS) does not work
> with oauth2client >= 4.0.0 [1]. Because of this Beam Python SDK also does
> not work with oauth2client >= 4.0.0 versions, and this is captured
> correctly in the setup.py [2].
>
> Ahmet
>
> [1]
>
> https://github.com/google/apitools/blob/7aff8d88960b669c9e946c938de5841c5f296f4f/setup.py#L32
> [2]
>
> https://github.com/apache/beam/blob/f9bc76364636b92239510f9e6bd242ea0ea62ac6/sdks/python/setup.py#L104
>
> On Tue, Sep 19, 2017 at 8:40 AM, Morand, Sebastien <
> sebastien.mor...@veolia.com> wrote:
>
> > Hi,
> >
> > No help on this?
> >
> > Regards,
> >
> > *Sébastien MORAND*
> > Team Lead Solution Architect
> > Technology & Operations / Digital Factory
> > Veolia - Group Information Systems & Technology (IS)
> > Cell.: +33 7 52 66 20 81 <+33%207%2052%2066%2020%2081> / Direct: +33 1
> 85 57 71 08 <+33%201%2085%2057%2071%2008>
> > Bureau 0144C (Ouest)
> > 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
> > *www.veolia.com *
> > 
> > 
> > 
> > 
> > 
> >
> > On 15 September 2017 at 10:26, Morand, Sebastien <
> > sebastien.mor...@veolia.com> wrote:
> >
> > > Hi,
> > >
> > > Hi got a problem when I install the oauth2client>=4.0.0 version:
> > >
> > > >>> import apache_beam.io.gcp.gcsio
> > > Traceback (most recent call last):
> > >   File "", line 1, in 
> > >   File "/home/ubuntu/workspace/tmp/beam_oauth/env/local/lib/
> > > python2.7/site-packages/apache_beam/io/gcp/gcsio.py", line 53, in
> > 
> > > 'Google Cloud Storage I/O not supported for this execution
> > environment
> > > '
> > > ImportError: Google Cloud Storage I/O not supported for this execution
> > > environment (could not import storage API client).
> > > >>>
> > >
> > > Steps to reproduce:
> > > virtualenv env && source env/bin/activate && pip install
> > > 'apache_beam==2.1.0' 'oauth2client>=4.0.0' && echo "import
> > > apache_beam.io.gcp.gcsio"|python2
> > >
> > > What is going on? How can I make apache_beam working with oauth2client
> >=
> > > 4.
> > >
> > > Thanks by advance,
> > > Regards,
> > >
> > > *Sébastien MORAND*
> > > Team Lead Solution Architect
> > > Technology & Operations / Digital Factory
> > > Veolia - Group Information Systems & Technology (IS)
> > > Cell.: +33 7 52 66 20 81 <+33%207%2052%2066%2020%2081> / Direct: +33
> 1 85 57 71 08 <+33%201%2085%2057%2071%2008>
> > > <+33%201%2085%2057%2071%2008>
> > > Bureau 0144C (Ouest)
> > > 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
> > > *www.veolia.com *
> > > 
> > > 
> > > 
> > > 
> > > 
> > >
> >
> > --
> >
> > 
> > 
> > This e-mail transmission (message and any attached files) may contain
> > information that is proprietary, privileged and/or confidential to Veolia
> > Environnement and/or its affiliates and is intended exclusively for the
> > person(s) to whom it is addressed. If you are not the intended recipient,
> > please notify the sender by return e-mail and delete all copies of this
> > e-mail, including all attachments. Unless expressly authorized, any use,
> > disclosure, publication, retransmission or dissemination of this e-mail
> > and/or of its attachments is strictly prohibited.
> >
> > Ce message electronique et ses fichiers attaches sont strictement
> > confidentiels et peuvent contenir des elements dont Veolia Environnement
> > et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
> > destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
> > message par erreur, merci de le retourner a son emetteur et de le
> detruire
> > ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
> > publication, la distribution, ou la reproduction non expressement
> > autorisees de ce message et de ses pieces attachees sont interdites.
> > 
> > 
> >
>

Re: Proposal: Unbreak Beam Python 2.1.0 with 2.1.1 bugfix release

2017-09-19 Thread Charles Chen

There is an incompatibility between the apitools
<https://pypi.python.org/pypi/google-apitools> library and the newest
version of the six library <https://pypi.python.org/pypi/six/1.11.0>, on
which we have a dependency.  The underlying bug in apitools is fixed here
<https://github.com/google/apitools/pull/176>, but we can't benefit because
we have that dependency pinned.  The scope of this fix is to pin the "six"
library to the previous version.

On Tue, Sep 19, 2017 at 8:59 PM Ben Chambers <bchamb...@apache.org> wrote:

> Any elaboration or jira issues describing what is broken? Any proposal for
> what changes need to happen to fix it?
>
> On Tue, Sep 19, 2017, 5:49 PM Chamikara Jayalath <chamik...@apache.org>
> wrote:
>
> > +1 for cutting 2.1.1 for Python SDK only.
> >
> > Thanks,
> > Cham
> >
> > On Tue, Sep 19, 2017 at 5:43 PM Robert Bradshaw
> > <rober...@google.com.invalid>
> > wrote:
> >
> > > +1. Right now anyone who follows our quickstart instructions or
> > > otherwise installs the latest release of apache_beam is broken.
> > >
> > > On Tue, Sep 19, 2017 at 2:05 PM, Charles Chen <c...@google.com.invalid>
> > > wrote:
> > > > The latest version (2.1.0) of Beam Python (
> > > > https://pypi.python.org/pypi/apache-beam) is broken due to a change
> in
> > > the
> > > > "six" dependency (BEAM-2964
> > > > <https://issues.apache.org/jira/browse/BEAM-2964>).  For instance,
> > > > installing "apache-beam" in a clean environment and running "python
> -m
> > > > apache_beam.examples.wordcount" results in a failure.  This issue is
> > > fixed
> > > > at head with Robert's recent PR (
> > > https://github.com/apache/beam/pull/3865).
> > > >
> > > > I propose to cherry-pick this change on top of the 2.1.0 release
> branch
> > > (to
> > > > form a new 2.1.1 release branch) and call a vote to release version
> > 2.1.1
> > > > only for Beam Python.
> > > >
> > > > Alternatively, to preserve version alignment we could also re-release
> > > Beam
> > > > Java 2.1.1 with the same code as 2.1.0 modulo the version bump.
> > > Thoughts?
> > > >
> > > > Best,
> > > > Charles
> > >
> >
>

Proposal: Unbreak Beam Python 2.1.0 with 2.1.1 bugfix release

2017-09-19 Thread Charles Chen

The latest version (2.1.0) of Beam Python (
https://pypi.python.org/pypi/apache-beam) is broken due to a change in the
"six" dependency (BEAM-2964
).  For instance,
installing "apache-beam" in a clean environment and running "python -m
apache_beam.examples.wordcount" results in a failure.  This issue is fixed
at head with Robert's recent PR (https://github.com/apache/beam/pull/3865).

I propose to cherry-pick this change on top of the 2.1.0 release branch (to
form a new 2.1.1 release branch) and call a vote to release version 2.1.1
only for Beam Python.

Alternatively, to preserve version alignment we could also re-release Beam
Java 2.1.1 with the same code as 2.1.0 modulo the version bump.  Thoughts?

Best,
Charles

Streaming support available on Beam Python DIrectRunner

2017-07-12 Thread Charles Chen

We recently checked in the last few changes needed to support streaming
pipelines on the Beam Python DirectRunner (BEAM-1265
).  As of HEAD (1-2 weeks
ago) and the 2.1.0 RC, Python SDK users can now write their pipelines in
streaming mode and run them locally on their own machine.

Check out the streaming wordcount example here (streaming_wordcount.py
)
and please kick the tires, try out the new functionality and report any
bugs you may encounter.  Use the "--streaming" PipelineOption to enable
this new functionality.

Currently, the I/Os supported are the TestStream

and
Google Cloud PubSub I/O.  Chamikara is working on implementing
SplittableDoFn as the Python streaming source API so that it will be easy
to write new streaming sources.  Python streaming support for other runners
like Cloud Dataflow and Flink will be provided through the FnAPI (please
contact me if you would be interested in joining the Python Streaming Alpha
for Google Cloud Dataflow).

For reference, here are some of the relevant PRs checked in for this effort:

https://github.com/apache/beam/pull/3318
https://github.com/apache/beam/pull/3362
https://github.com/apache/beam/pull/3370
https://github.com/apache/beam/pull/3405
https://github.com/apache/beam/pull/3409
https://github.com/apache/beam/pull/3440
https://github.com/apache/beam/pull/3444
https://github.com/apache/beam/pull/3499

Best,
Charles

98 matches

Mail list logo