Re: Design Proposal for python wheels build process

2018-07-16 Thread Ahmet Altay
Thank you Boyuan. This proposal looks good to me.

I agree that it would be great to agree on this before the start of the
next release. The reason is we built wheel files for the past 2 releases
however we do not yet have a working process. Releases are slowed down
because of that. If we can address it sooner, we can avoid the same issue
for the next release.

The biggest open question here is creating a new repo for the purpose of
creating the wheel files. (1) Is this a concern (2) If this is not a
concern, who could help us do this, would that be infra?

Ahmet

On Mon, Jul 16, 2018 at 2:47 PM, Boyuan Zhang  wrote:

> Hey all,
>
> After 2.4.0, we also publish python wheels as part of python artifacts. In
> order to make release process easier, we need a standardized way to build
> and stage python wheels as s part of release process.
>
> In this deign proposal[1], I debriefed how we built python wheels for
> 2.4.0 & 2.5.0, and proposal an approach for further release. We have
> several action items in this discussion:
> 1. We need someone who has the permission to create a new repo under
> apache, example .
> 2. We need to decide whether we want to create a special account with beam
> commiter
> permissions for release process.
> 3. We need to choose a more suitable way to store credentials in travis.
>
> If we could make these decision asap, then it's possible for 2.6.0 release
> to build python wheels using this new approach, which may save a lot of
> time for release process.
>
> Thanks for all your attention and help! Looking forward to get feedback
> from you.
>
> Boyuan
>
> [1] https://docs.google.com/document/d/1MRVFs48e6g7wORshr2UpuOVD_
> yTSJTbmR65_j8XbGek/edit?usp=sharing
>


Re: BiqQueryIO.write and Wait.on

2018-07-16 Thread Eugene Kirpichov
Hi Carlos,

Any updates / roadblocks you hit?

On Tue, Jul 3, 2018 at 7:13 AM Eugene Kirpichov 
wrote:

> Awesome!! Thanks for the heads up, very exciting, this is going to make a
> lot of people happy :)
>
> On Tue, Jul 3, 2018, 3:40 AM Carlos Alonso  wrote:
>
>> + dev@beam.apache.org
>>
>> Just a quick email to let you know that I'm starting developing this.
>>
>> On Fri, Apr 20, 2018 at 10:30 PM Eugene Kirpichov 
>> wrote:
>>
>>> Hi Carlos,
>>>
>>> Thank you for expressing interest in taking this on! Let me give you a
>>> few pointers to start, and I'll be happy to help everywhere along the way.
>>>
>>> Basically we want BigQueryIO.write() to return something (e.g. a
>>> PCollection) that can be used as input to Wait.on().
>>> Currently it returns a WriteResult, which only contains a
>>> PCollection of failed inserts - that one can not be used
>>> directly, instead we should add another component to WriteResult that
>>> represents the result of successfully writing some data.
>>>
>>> Given that BQIO supports dynamic destination writes, I think it makes
>>> sense for that to be a PCollection> so that in theory
>>> we could sequence different destinations independently (currently Wait.on()
>>> does not provide such a feature, but it could); and it will require
>>> changing WriteResult to be WriteResult. As for what the "???"
>>> might be - it is something that represents the result of successfully
>>> writing a window of data. I think it can even be Void, or "?" (wildcard
>>> type) for now, until we figure out something better.
>>>
>>> Implementing this would require roughly the following work:
>>> - Add this PCollection> to WriteResult
>>> - Modify the BatchLoads transform to provide it on both codepaths:
>>> expandTriggered() and expandUntriggered()
>>> ...- expandTriggered() itself writes via 2 codepaths: single-partition
>>> and multi-partition. Both need to be handled - we need to get a
>>> PCollection> from each of them, and Flatten these two
>>> PCollections together to get the final result. The single-partition
>>> codepath (writeSinglePartition) under the hood already uses WriteTables
>>> that returns a KV so it's directly usable. The
>>> multi-partition codepath ends in WriteRenameTriggered - unfortunately, this
>>> codepath drops DestinationT along the way and will need to be refactored a
>>> bit to keep it until the end.
>>> ...- expandUntriggered() should be treated the same way.
>>> - Modify the StreamingWriteTables transform to provide it
>>> ...- Here also, the challenge is to propagate the DestinationT type all
>>> the way until the end of StreamingWriteTables - it will need to be
>>> refactored. After such a refactoring, returning a KV
>>> should be easy.
>>>
>>> Another challenge with all of this is backwards compatibility in terms
>>> of API and pipeline update.
>>> Pipeline update is much less of a concern for the BatchLoads codepath,
>>> because it's typically used in batch-mode pipelines that don't get updated.
>>> I would recommend to start with this, perhaps even with only the
>>> untriggered codepath (it is much more commonly used) - that will pave the
>>> way for future work.
>>>
>>> Hope this helps, please ask more if something is unclear!
>>>
>>> On Fri, Apr 20, 2018 at 12:48 AM Carlos Alonso 
>>> wrote:
>>>
 Hey Eugene!!

 I’d gladly take a stab on it although I’m not sure how much available
 time I might have to put into but... yeah, let’s try it.

 Where should I begin? Is there a Jira issue or shall I file one?

 Thanks!
 On Thu, 12 Apr 2018 at 00:41, Eugene Kirpichov 
 wrote:

> Hi,
>
> Yes, you're both right - BigQueryIO.write() is currently not
> implemented in a way that it can be used with Wait.on(). It would 
> certainly
> be a welcome contribution to change this - many people expressed interest
> in specifically waiting for BigQuery writes. Is any of you interested in
> helping out?
>
> Thanks.
>
> On Fri, Apr 6, 2018 at 12:36 AM Carlos Alonso 
> wrote:
>
>> Hi Simon, I think your explanation was very accurate, at least to my
>> understanding. I'd also be interested in getting batch load result's
>> feedback on the pipeline... hopefully someone may suggest something,
>> otherwise we could propose submitting a Jira, or even better, a PR!! :)
>>
>> Thanks!
>>
>> On Thu, Apr 5, 2018 at 2:01 PM Simon Kitching <
>> simon.kitch...@unbelievable-machine.com> wrote:
>>
>>> Hi All,
>>>
>>> I need to write some data to BigQuery (batch-mode) and then send a
>>> Pubsub message to trigger further processing.
>>>
>>> I found this thread titled "Callbacks/other functions run after a
>>> PDone/output transform" on the user-list which was very relevant:
>>>
>>> https://lists.apache.org/thread.html/ddcdf93604396b1cbcacdff49aba60817dc90ee7c8434725ea0d26c0@%3Cuser.beam.apache.org%3E
>>>
>>> Thanks to the 

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Eugene Kirpichov
We did not, but I think we should. So far, in 100% of the PRs I've
authored, the default functionality of CODEOWNERS did the wrong thing and I
had to fix something up manually.

On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud  wrote:

> This sounds like a good plan. Did we want to rename the CODEOWNERS file to
> disable github's mass adding of reviewers while we figure this out?
>
> Andrew
>
> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> Le 16 juil. 2018, à 19:17, Holden Karau  a écrit:
>>>
>>> Ok if no one objects I'll create the INFRA ticket after OSCON and we can
>>> test it for a week and decide if it helps or hinders.
>>>
>>> On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré < j...@nanthrax.net>
>>> wrote:
>>>
 Agree to test it for a week.

 Regards
 JB
 Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com> a
 écrit:
>
> Would folks be OK with me asking infra to turn on blame based
> suggestions for Beam and trying it out for a week?
>
> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez < rfern...@google.com>
> wrote:
>
>> +1 using blame -- nifty :)
>>
>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan < bat...@google.com>
>> wrote:
>>
>>> +1. This is great.
>>>
>>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com>
>>> wrote:
>>>
 Mention bot looks cool, as it tries to guess the reviewer using
 blame.
 I've written a quick and dirty script that uses only CODEOWNERS.

 Its output looks like:
 $ python suggest_reviewers.py --pr 5940
 INFO:root:Selected reviewer @lukecwik for:
 /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
 (path_pattern: /runners/core-construction-java*)
 INFO:root:Selected reviewer @lukecwik for:
 /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
 (path_pattern: /runners/core-construction-java*)
 INFO:root:Selected reviewer @echauchot for:
 /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
 (path_pattern: /runners/core-java*)
 INFO:root:Selected reviewer @lukecwik for:
 /runners/flink/build.gradle (path_pattern: */build.gradle*)
 INFO:root:Selected reviewer @lukecwik for:
 /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
 (path_pattern: *.java)
 INFO:root:Selected reviewer @pabloem for:
 /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
 (path_pattern: /runners/google-cloud-dataflow-java*)
 INFO:root:Selected reviewer @lukecwik for:
 /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
 (path_pattern: /sdks/java/core*)
 Suggested reviewers: @echauchot, @lukecwik, @pabloem

 Script is in: https://github.com/apache/beam/pull/5951


 What does the community think? Do you prefer blame-based or
 rules-based reviewer suggestions?

 On Fri, Jul 13, 2018 at 11:13 AM Holden Karau <
 hol...@pigscanfly.ca> wrote:

> I'm looking at something similar in the Spark project, and while
> it's now archived by FB it seems like something like
> https://github.com/facebookarchive/mention-bot might do what we
> want. I'm going to spin up a version on my K8 cluster and see if I 
> can ask
> infra to add a webhook and if it works for Spark we could ask INFRA 
> to add
> a second webhook for Beam. (Or if the Beam folks are more interested 
> in
> experimenting I can do Beam first as a smaller project and roll with 
> that).
>
> Let me know :)
>
> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
> kirpic...@google.com> wrote:
>
>> Sounds reasonable for now, thanks!
>> It's unfortunate that Github's CODEOWNERS feature appears to be
>> effectively unusable for Beam but I'd hope that Github might pay 
>> attention
>> and fix things if we submit feedback, with us being one of the most 
>> active
>> Apache projects - did anyone do this yet / planning to?
>>
>> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri < eh...@google.com>
>> wrote:
>>
>>> While I like the idea of having a CODEOWNERS file, the Github
>>> implementation is lacking:
>>> 1. Reviewers are automatically assigned at each push.
>>> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in
>>> Eugene's PR 5940).
>>> 3. Non-committers aren't 

Re: An update on Eugene

2018-07-16 Thread Thomas Weise
Eugene,

Thanks for all your contributions to the project and especially the
leadership in the IOs and SDF area.

Congrats and best wishes for your new opportunity.

Thomas



On Mon, Jul 16, 2018 at 4:00 PM Ismaël Mejía  wrote:

> I am sad to read this, but at the same happy for you and your future
> adventures Eugene.
>
> Thanks a lot for all the work you have done in this project, all the
> work on SDF, the improvements on composability and the File-based IOs,
> and of course for your reviews that really helped improve the quality
> in many areas of the project as well as other areas that I probably
> forget. In general it has been really nice to see the way you grew in
> the open source side of the project too, and your presence will be
> definitely missed.
>
> Best wishes for the work on the new programming model and hope to hear
> back from you in the future.
>


Re: An update on Eugene

2018-07-16 Thread Ismaël Mejía
I am sad to read this, but at the same happy for you and your future
adventures Eugene.

Thanks a lot for all the work you have done in this project, all the
work on SDF, the improvements on composability and the File-based IOs,
and of course for your reviews that really helped improve the quality
in many areas of the project as well as other areas that I probably
forget. In general it has been really nice to see the way you grew in
the open source side of the project too, and your presence will be
definitely missed.

Best wishes for the work on the new programming model and hope to hear
back from you in the future.


Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Andrew Pilloud
This sounds like a good plan. Did we want to rename the CODEOWNERS file to
disable github's mass adding of reviewers while we figure this out?

Andrew

On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré 
wrote:

> +1
>
> Le 16 juil. 2018, à 19:17, Holden Karau  a écrit:
>>
>> Ok if no one objects I'll create the INFRA ticket after OSCON and we can
>> test it for a week and decide if it helps or hinders.
>>
>> On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré < j...@nanthrax.net>
>> wrote:
>>
>>> Agree to test it for a week.
>>>
>>> Regards
>>> JB
>>> Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com> a
>>> écrit:

 Would folks be OK with me asking infra to turn on blame based
 suggestions for Beam and trying it out for a week?

 On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez < rfern...@google.com>
 wrote:

> +1 using blame -- nifty :)
>
> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan < bat...@google.com>
> wrote:
>
>> +1. This is great.
>>
>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com> wrote:
>>
>>> Mention bot looks cool, as it tries to guess the reviewer using
>>> blame.
>>> I've written a quick and dirty script that uses only CODEOWNERS.
>>>
>>> Its output looks like:
>>> $ python suggest_reviewers.py --pr 5940
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>>> (path_pattern: /runners/core-construction-java*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>>> (path_pattern: /runners/core-construction-java*)
>>> INFO:root:Selected reviewer @echauchot for:
>>> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>>> (path_pattern: /runners/core-java*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/flink/build.gradle (path_pattern: */build.gradle*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>>> (path_pattern: *.java)
>>> INFO:root:Selected reviewer @pabloem for:
>>> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>>> (path_pattern: /runners/google-cloud-dataflow-java*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>>> (path_pattern: /sdks/java/core*)
>>> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>>>
>>> Script is in: https://github.com/apache/beam/pull/5951
>>>
>>>
>>> What does the community think? Do you prefer blame-based or
>>> rules-based reviewer suggestions?
>>>
>>> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau < hol...@pigscanfly.ca>
>>> wrote:
>>>
 I'm looking at something similar in the Spark project, and while
 it's now archived by FB it seems like something like
 https://github.com/facebookarchive/mention-bot might do what we
 want. I'm going to spin up a version on my K8 cluster and see if I can 
 ask
 infra to add a webhook and if it works for Spark we could ask INFRA to 
 add
 a second webhook for Beam. (Or if the Beam folks are more interested in
 experimenting I can do Beam first as a smaller project and roll with 
 that).

 Let me know :)

 On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
 kirpic...@google.com> wrote:

> Sounds reasonable for now, thanks!
> It's unfortunate that Github's CODEOWNERS feature appears to be
> effectively unusable for Beam but I'd hope that Github might pay 
> attention
> and fix things if we submit feedback, with us being one of the most 
> active
> Apache projects - did anyone do this yet / planning to?
>
> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri < eh...@google.com>
> wrote:
>
>> While I like the idea of having a CODEOWNERS file, the Github
>> implementation is lacking:
>> 1. Reviewers are automatically assigned at each push.
>> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in
>> Eugene's PR 5940).
>> 3. Non-committers aren't assigned as reviewers.
>> 4. Non-committers can't change the list of reviewers.
>>
>> I propose renaming the file to disable the auto-reviewer
>> assignment feature.
>> In its place I'll add a script that suggests reviewers.
>>
>> On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri < eh...@google.com>
>> wrote:

Re: Beam Dependency Ownership

2018-07-16 Thread Yifan Zou
Thanks all for taking the ownership of Beam dependencies! This is an import
step to keep our dependencies healthy and up-to-date.
We will close this thread now. The next step would be integrating the
ownership information into the Beam codebase and implementing a tool to
create and manage the JIRA tickets of Beam deps in order to track the
upgrade process.

Thank you.

Regards.
Yifan

On Mon, Jul 9, 2018 at 10:24 AM Yifan Zou  wrote:

> If you haven't already, please take a look at the Beam SDK Dependency
> Ownership
> and
> sign up with any dependencies that you are familiar with. In case anyone
> miss, there is a second tab for the Python SDK.
>
> Thanks.
>
> Yifan
>
> On Thu, Jun 28, 2018 at 6:37 AM Tim Robertson 
> wrote:
>
>> Thanks for this Yifan,
>> I've added my name to all Hadoop related dependencies, solr, along with
>> es.
>>
>>
>>
>> On Thu, Jun 28, 2018 at 3:28 PM, Etienne Chauchot 
>> wrote:
>>
>>> I've added myself and @Tim Robertson on elasticsearchIO related deps.
>>>
>>> Etienne
>>>
>>> Le mercredi 27 juin 2018 à 14:05 -0700, Chamikara Jayalath a écrit :
>>>
>>> It's mentioned under "Dependency declarations may identify owners that
>>> are responsible for upgrading respective dependencies". Feel free to update
>>> if you think more details should be added to it. I think it'll be easier if
>>> we transfer data in spreadsheet to comments close to dependency
>>> declarations instead of maintaining the spreadsheet (after we collect the
>>> data). Otherwise we'll have to put an extra effort to make sure that the
>>> spreadsheet, BeamModulePlugin, and Python setup.py are in sync. We can
>>> decide on the exact format of the comment to make sure that automated tool
>>> can easily parse the comment.
>>>
>>> - Cham
>>>
>>> On Wed, Jun 27, 2018 at 1:45 PM Yifan Zou  wrote:
>>>
>>> Thanks Scott, I will supplement the missing packages to the spreadsheet.
>>> And, we expect this being kept up to date along with the Beam project
>>> growth. Shall we mention this in the Dependency Guide page
>>> , @Chamikara Jayalath
>>> ?
>>>
>>> On Wed, Jun 27, 2018 at 11:17 AM Scott Wegner  wrote:
>>>
>>> Thanks for kicking off this process Yifan-- I'll add my name to some
>>> dependencies I'm familiar with.
>>>
>>> Do you expect this to be a one-time process, or will we maintain the
>>> owners over time? If we will maintain this list, it would be easier to keep
>>> it up-to-date if it was closer to the code. i.e. perhaps each dependency
>>> registration in the Gradle BeamModulePlugin [1] should include a list of
>>> owners.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L325
>>>
>>> On Wed, Jun 27, 2018 at 8:52 AM Yifan Zou  wrote:
>>>
>>> Hi all,
>>>
>>> We now have the automated detections for Beam dependency updates and
>>> sending a weekly report to dev mailing list. In order to address the
>>> updates in time, we want to find owners for all dependencies of Beam, and
>>> finally, Jira bugs will be automatically created and assigned to the owners
>>> if actions need to be taken. We also welcome non-owners to upgrade
>>> dependency packages, but only owners will receive the Jira tickets.
>>>
>>> Please review the spreadsheet Beam SDK Dependency Ownership
>>> 
>>>  and
>>> sign off if you are familiar with any Beam dependencies and willing to
>>> take in charge of them. It is definitely fine that a single package have
>>> multiple owners. The more owners we have, the more helps we will get to
>>> keep Beam dependencies in a healthy state.
>>>
>>> Thank you :)
>>>
>>> Regards.
>>> Yifan
>>>
>>>
>>> https://docs.google.com/spreadsheets/d/12NN3vPqFTBQtXBc0fg4sFIb9c_mgst0IDePB_0Ui8kE/edit?ts=5b32bec1#gid=0
>>>
>>>
>>


Re: Let's start getting rid of BoundedSource

2018-07-16 Thread Eugene Kirpichov
Hey all,

The PR https://github.com/apache/beam/pull/5940 was merged, and now all
runners at "master" support bounded-per-element SDFs!
Thanks +Ismaël Mejía  for the reviews.
I have updated the Capability Matrix as well:
https://beam.apache.org/documentation/runners/capability-matrix/


On Mon, Jul 16, 2018 at 7:56 AM Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> I think it's the purpose of SDF to simplify the BoundedSource like writing.
>
> I agree that extended @SplitRestriction is a good approach.
>
> Regards
> JB
>
> On 16/07/2018 16:52, Eugene Kirpichov wrote:
> > Hi Etienne - thanks for catching this; indeed, I somehow missed that
> > actually several runners do this same thing - it seemed to me as
> > something that can be done in user code (because it involves combining
> > estimated size + split in pretty much the same way), but I'm not so
> > sure: even though many runners have a "desired parallelism" option or
> > alike, it's not all of them, so we can't use such an option universally.
> >
> > Maybe then the right thing to do is to:
> > - Use bounded SDFs for these
> > - Change SDF @SplitRestriction API to take a desired number of splits as
> > a parameter, and introduce an API @EstimateOutputSizeBytes(element)
> > valid only on bounded SDFs
> > - Add some plumbing to the standard bounded SDF expansion so that
> > different runners can compute that parameter differently, the two
> > standard ways being "split into given number of splits" or "split based
> > on the sub-linear formula of estimated size".
> >
> > I think this would work, though this is somewhat more work than I
> > anticipated. Any alternative ideas?
> >
> > On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot  > > wrote:
> >
> > Hi,
> > thanks Eugene for analyzing and sharing that.
> > I have one comment inline
> >
> > Etienne
> >
> > Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
> >> Hey beamers,
> >>
> >> I've always wondered whether the BoundedSource implementations in
> >> the Beam SDK are worth their complexity, or whether they rather
> >> could be converted to the much easier to code ParDo style, which
> >> is also more modular and allows you to very easily implement
> >> readAll().
> >>
> >> There's a handful: file-based sources, BigQuery, Bigtable, HBase,
> >> Elasticsearch, MongoDB, Solr and a couple more.
> >>
> >> Curiously enough, BoundedSource vs. ParDo matters *only* on
> >> Dataflow, because AFAICT Dataflow is the only runner that cares
> >> about the things that BoundedSource can do and ParDo can't:
> >> - size estimation (used to choose an initial number of workers)
> >> [ok, Flink calls the function to return statistics, but doesn't
> >> seem to do anything else with it]
> > => Spark uses size estimation to set desired bundle size with
> > something like desiredBundleSize = estimatedSize /
> > nbOfWorkersConfigured (partitions)
> > See
> >
> https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org/apache/beam/runners/spark/io/SourceRDD.java#L101
> >
> >
> >> - splitting into bundles of given size (Dataflow chooses the
> >> number of bundles to create based on a simple formula that's not
> >> entirely unlike K*sqrt(size))
> >> - liquid sharding (splitAtFraction())
> >>
> >> If Dataflow didn't exist, there'd be no reason at all to use
> >> BoundedSource. So the question "which ones can be converted to
> >> ParDo" is really "which ones are used on Dataflow in ways that
> >> make these functions matter". Previously, my conservative
> >> assumption was that the answer is "all of them", but turns out
> >> this is not so.
> >>
> >> Liquid sharding always matters; if the source is liquid-shardable,
> >> for now we have to keep it a source (until SDF gains liquid
> >> sharding - which should happen in a quarter or two I think).
> >>
> >> Choosing number of bundles to split into is easily done in SDK
> >> code, see https://github.com/apache/beam/pull/5886 for example;
> >> DatastoreIO does something similar.
> >>
> >> The remaining thing to analyze is, when does initial scaling
> >> matter. So as a member of the Dataflow team, I analyzed statistics
> >> of production Dataflow jobs in the past month. I can not share my
> >> queries nor the data, because they are proprietary to Google - so
> >> I am sharing just the general methodology and conclusions, because
> >> they matter to the Beam community. I looked at a few criteria,
> >> such as:
> >> - The job should be not too short and not too long: if it's too
> >> short then scaling couldn't have kicked in much at all; if it's
> >> too long then dynamic autoscaling would have been sufficient.
> >> - The job should use, at peak, at least a handful of workers
> >> (otherwise 

Design Proposal for python wheels build process

2018-07-16 Thread Boyuan Zhang
Hey all,

After 2.4.0, we also publish python wheels as part of python artifacts. In
order to make release process easier, we need a standardized way to build
and stage python wheels as s part of release process.

In this deign proposal[1], I debriefed how we built python wheels for 2.4.0
& 2.5.0, and proposal an approach for further release. We have several
action items in this discussion:
1. We need someone who has the permission to create a new repo under
apache, example .
2. We need to decide whether we want to create a special account with beam
commiter
permissions for release process.
3. We need to choose a more suitable way to store credentials in travis.

If we could make these decision asap, then it's possible for 2.6.0 release
to build python wheels using this new approach, which may save a lot of
time for release process.

Thanks for all your attention and help! Looking forward to get feedback
from you.

Boyuan

[1]
https://docs.google.com/document/d/1MRVFs48e6g7wORshr2UpuOVD_yTSJTbmR65_j8XbGek/edit?usp=sharing


Git Bisect broken on the beam repo after the beam-site merge

2018-07-16 Thread Andrew Pilloud
It appears that as of https://github.com/apache/beam/pull/5641 we have two
root commits on the beam repo. This is breaking git bisect, as it expects
repo to only have one initial commit. (git bisect is always following the
doc tree to the root instead of going down the source tree.) I'm going to
split up my bisect to before and after this merge to find the root cause of
the bug I'm tracking down, but I wonder if anyone knows an easier way to do
this? Is there something I'm doing wrong?

Andrew


Re: An update on Eugene

2018-07-16 Thread Jesse Anderson
Thanks for all your work!

On Mon, Jul 16, 2018, 9:17 PM Eugene Kirpichov  wrote:

> Hi beamers,
>
> After 5.5 years working on data processing systems at Google, several of
> these years working on Dataflow and Beam, I am moving on to do something
> new (also at Google) in the area of programming models for machine
> learning. Anybody who worked with me closely knows how much I love building
> programming models, so I could not pass up on the opportunity to build a
> new one - I expect to have a lot of fun there!
>
> On the new team we very much plan to make things open-source when the time
> is right, and make use of Beam, just as TensorFlow does - so I will stay in
> touch with the community, and I expect that we will still work together on
> some things. However, Beam will no longer be the main focus of my work.
>
> I've made the decision a couple months ago and have spent the time since
> then getting things into a good state and handing over the community
> efforts in which I have played a particularly active role - they are in
> very capable hands:
> - Robert Bradshaw and Ankur Goenka on Google side are taking charge of
> Portable Runners (e.g. the Portable Flink runner).
> - Luke Cwik will be in charge of the future of Splittable DoFn. Ismael
> Mejia has also been involved in the effort and actively helping, and I
> believe he continues to do so.
> - The Beam IO ecosystem in general is in very good shape (perhaps the best
> in the industry) and does not need a lot of constant direction; and it has
> a great community (thanks JB, Ismael, Etienne and many others!) - however,
> on Google side, Chamikara Jayalath will take it over.
>
> It was a great pleasure working with you all. My last day formally on Beam
> will be this coming Friday, then I'll take a couple weeks of vacation and
> jump right in on the new team.
>
> Of course, if my involvement in something is necessary, I'm still
> available on all the same channels as always (email, Slack, Hangouts) -
> but, in general, please contact the folks mentioned above instead of me
> about the respective matters from now on.
>
> Thanks!
>


An update on Eugene

2018-07-16 Thread Eugene Kirpichov
Hi beamers,

After 5.5 years working on data processing systems at Google, several of
these years working on Dataflow and Beam, I am moving on to do something
new (also at Google) in the area of programming models for machine
learning. Anybody who worked with me closely knows how much I love building
programming models, so I could not pass up on the opportunity to build a
new one - I expect to have a lot of fun there!

On the new team we very much plan to make things open-source when the time
is right, and make use of Beam, just as TensorFlow does - so I will stay in
touch with the community, and I expect that we will still work together on
some things. However, Beam will no longer be the main focus of my work.

I've made the decision a couple months ago and have spent the time since
then getting things into a good state and handing over the community
efforts in which I have played a particularly active role - they are in
very capable hands:
- Robert Bradshaw and Ankur Goenka on Google side are taking charge of
Portable Runners (e.g. the Portable Flink runner).
- Luke Cwik will be in charge of the future of Splittable DoFn. Ismael
Mejia has also been involved in the effort and actively helping, and I
believe he continues to do so.
- The Beam IO ecosystem in general is in very good shape (perhaps the best
in the industry) and does not need a lot of constant direction; and it has
a great community (thanks JB, Ismael, Etienne and many others!) - however,
on Google side, Chamikara Jayalath will take it over.

It was a great pleasure working with you all. My last day formally on Beam
will be this coming Friday, then I'll take a couple weeks of vacation and
jump right in on the new team.

Of course, if my involvement in something is necessary, I'm still available
on all the same channels as always (email, Slack, Hangouts) - but, in
general, please contact the folks mentioned above instead of me about the
respective matters from now on.

Thanks!


Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Jean-Baptiste Onofré
+1



Le 16 juil. 2018 à 19:17, à 19:17, Holden Karau  a 
écrit:
>
>Ok if no one objects I'll create the INFRA ticket after OSCON and we
>can test it for a week and decide if it helps or hinders.
>
>
>
> 
>  On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré 
>  mailto:j...@nanthrax.net;>j...@nanthrax.net wrote:
>  
> 
>
>  
>   
>Agree to test it for a week.
>
>
>   
>   
>Regards
>
>   
>   
>JB
>   
>   
>Le 16 juil. 2018, à 18:59, Holden Karau 
>mailto:holden.ka...@gmail.com; target="_blank"
>rel="noreferrer">holden.ka...@gmail.com a écrit:
>
> 
>  
>Would folks be OK with me asking infra to turn on blame based
>suggestions for Beam and trying it out for a week?
>   
>   
>   
>
>  On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez 
>mailto:rfern...@google.com; target="_blank"
>rel="noreferrer">rfern...@google.com wrote:
> 
>
>
> 
>style="font-family:arial,helvetica,sans-serif;font-size:small;color:#00">
>+1 using blame -- nifty :)
>   
>  
> 
> 
> 
>  
>On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan 
>mailto:bat...@google.com; rel="noreferrer noreferrer"
>target="_blank">bat...@google.com wrote:
>   
>  
>
>   
> +1. This is great.
>
>   
>   
>   
>
>  On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri 
>mailto:eh...@google.com; rel="noreferrer noreferrer"
>target="_blank">eh...@google.com wrote:
> 
>
>
> 
>Mention bot looks cool, as it tries to guess the reviewer using blame.
>  
>  I've written a quick and dirty script that uses only CODEOWNERS.
>  
>  
>   
>  
>  
>Its output looks like:
>  
>  
>   
> $ python suggest_reviewers.py --pr 5940
>   
>   
>INFO:root:Selected reviewer @lukecwik for:
>/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>(path_pattern: /runners/core-construction-java*)
>   
>   
>INFO:root:Selected reviewer @lukecwik for:
>/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>(path_pattern: /runners/core-construction-java*)
>   
>   
>INFO:root:Selected reviewer @echauchot for:
>/runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>(path_pattern: /runners/core-java*)
>   
>   
>INFO:root:Selected reviewer @lukecwik for: /runners/flink/build.gradle
>(path_pattern: */build.gradle*)
>   
>   
>INFO:root:Selected reviewer @lukecwik for:
>/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>(path_pattern: *.java)
>   
>   
>INFO:root:Selected reviewer @pabloem for:
>/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>(path_pattern: /runners/google-cloud-dataflow-java*)
>   
>   
>INFO:root:Selected reviewer @lukecwik for:
>/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>(path_pattern: /sdks/java/core*)
>
>   
> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>   
>  
>  
>   
>  
>  
>Script is in:
>https://github.com/apache/beam/pull/5951; rel="noreferrer
>noreferrer"
>target="_blank">https://github.com/apache/beam/pull/5951
>   
>  
>  
>   
>  
>  
>   
>  
>  
>What does the community think? Do you prefer blame-based or rules-based
>reviewer suggestions?
>  
> 
> 
> 
>  
>On Fri, Jul 13, 2018 at 11:13 AM Holden Karau 
>mailto:hol...@pigscanfly.ca; rel="noreferrer noreferrer"
>target="_blank">hol...@pigscanfly.ca wrote:
>   
>  
>
>   
>I'm looking at something similar in the Spark project, and while it's
>now archived by FB it seems like something like
>https://github.com/facebookarchive/mention-bot;
>rel="noreferrer noreferrer"
>target="_blank">https://github.com/facebookarchive/mention-bot
>might do what we want. I'm going to spin up a version on my K8 cluster
>and see if I can ask infra to add a webhook and if it works for Spark
>we could ask INFRA to add a second webhook for Beam. (Or if the Beam
>folks are more interested in experimenting I can do Beam first as a
>smaller project and roll with that).
>
>   

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Holden Karau
Ok if no one objects I'll create the INFRA ticket after OSCON and we can
test it for a week and decide if it helps or hinders.

On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré  wrote:

> Agree to test it for a week.
>
> Regards
> JB
> Le 16 juil. 2018, à 18:59, Holden Karau  a écrit:
>>
>> Would folks be OK with me asking infra to turn on blame based suggestions
>> for Beam and trying it out for a week?
>>
>> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez < rfern...@google.com>
>> wrote:
>>
>>> +1 using blame -- nifty :)
>>>
>>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan < bat...@google.com>
>>> wrote:
>>>
 +1. This is great.

 On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com> wrote:

> Mention bot looks cool, as it tries to guess the reviewer using blame.
> I've written a quick and dirty script that uses only CODEOWNERS.
>
> Its output looks like:
> $ python suggest_reviewers.py --pr 5940
> INFO:root:Selected reviewer @lukecwik for:
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
> (path_pattern: /runners/core-construction-java*)
> INFO:root:Selected reviewer @lukecwik for:
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
> (path_pattern: /runners/core-construction-java*)
> INFO:root:Selected reviewer @echauchot for:
> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
> (path_pattern: /runners/core-java*)
> INFO:root:Selected reviewer @lukecwik for: /runners/flink/build.gradle
> (path_pattern: */build.gradle*)
> INFO:root:Selected reviewer @lukecwik for:
> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
> (path_pattern: *.java)
> INFO:root:Selected reviewer @pabloem for:
> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
> (path_pattern: /runners/google-cloud-dataflow-java*)
> INFO:root:Selected reviewer @lukecwik for:
> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
> (path_pattern: /sdks/java/core*)
> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>
> Script is in: https://github.com/apache/beam/pull/5951
>
>
> What does the community think? Do you prefer blame-based or
> rules-based reviewer suggestions?
>
> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau < hol...@pigscanfly.ca>
> wrote:
>
>> I'm looking at something similar in the Spark project, and while it's
>> now archived by FB it seems like something like
>> https://github.com/facebookarchive/mention-bot might do what we
>> want. I'm going to spin up a version on my K8 cluster and see if I can 
>> ask
>> infra to add a webhook and if it works for Spark we could ask INFRA to 
>> add
>> a second webhook for Beam. (Or if the Beam folks are more interested in
>> experimenting I can do Beam first as a smaller project and roll with 
>> that).
>>
>> Let me know :)
>>
>> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
>> kirpic...@google.com> wrote:
>>
>>> Sounds reasonable for now, thanks!
>>> It's unfortunate that Github's CODEOWNERS feature appears to be
>>> effectively unusable for Beam but I'd hope that Github might pay 
>>> attention
>>> and fix things if we submit feedback, with us being one of the most 
>>> active
>>> Apache projects - did anyone do this yet / planning to?
>>>
>>> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri < eh...@google.com>
>>> wrote:
>>>
 While I like the idea of having a CODEOWNERS file, the Github
 implementation is lacking:
 1. Reviewers are automatically assigned at each push.
 2. Reviewer assignment can be excessive (e.g. 5 reviewers in
 Eugene's PR 5940).
 3. Non-committers aren't assigned as reviewers.
 4. Non-committers can't change the list of reviewers.

 I propose renaming the file to disable the auto-reviewer assignment
 feature.
 In its place I'll add a script that suggests reviewers.

 On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri < eh...@google.com>
 wrote:

> Hi Etienne,
>
> Yes you could be as precise as you want. The paths I listed are
> just suggestions. :)
>
>
> On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
>
>> Hi,
>>
>> I think it's already do-able just providing the expected path.
>>
>> It's a good idea especially for the core.
>>
>> Regards
>> JB
>>
>> On 13/07/2018 09:51, Etienne Chauchot 

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Jean-Baptiste Onofré
Agree to test it for a week.

Regards
JB

Le 16 juil. 2018 à 18:59, à 18:59, Holden Karau  a 
écrit:
>Would folks be OK with me asking infra to turn on blame based
>suggestions
>for Beam and trying it out for a week?
>
>On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez 
>wrote:
>
>> +1 using blame -- nifty :)
>>
>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan 
>> wrote:
>>
>>> +1. This is great.
>>>
>>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri  wrote:
>>>
 Mention bot looks cool, as it tries to guess the reviewer using
>blame.
 I've written a quick and dirty script that uses only CODEOWNERS.

 Its output looks like:
 $ python suggest_reviewers.py --pr 5940
 INFO:root:Selected reviewer @lukecwik for:

>/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
 (path_pattern: /runners/core-construction-java*)
 INFO:root:Selected reviewer @lukecwik for:

>/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
 (path_pattern: /runners/core-construction-java*)
 INFO:root:Selected reviewer @echauchot for:

>/runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
 (path_pattern: /runners/core-java*)
 INFO:root:Selected reviewer @lukecwik for:
>/runners/flink/build.gradle
 (path_pattern: */build.gradle*)
 INFO:root:Selected reviewer @lukecwik for:

>/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
 (path_pattern: *.java)
 INFO:root:Selected reviewer @pabloem for:

>/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
 (path_pattern: /runners/google-cloud-dataflow-java*)
 INFO:root:Selected reviewer @lukecwik for:

>/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
 (path_pattern: /sdks/java/core*)
 Suggested reviewers: @echauchot, @lukecwik, @pabloem

 Script is in: https://github.com/apache/beam/pull/5951


 What does the community think? Do you prefer blame-based or
>rules-based
 reviewer suggestions?

 On Fri, Jul 13, 2018 at 11:13 AM Holden Karau
>
 wrote:

> I'm looking at something similar in the Spark project, and while
>it's
> now archived by FB it seems like something like
> https://github.com/facebookarchive/mention-bot might do what we
>want.
> I'm going to spin up a version on my K8 cluster and see if I can
>ask infra
> to add a webhook and if it works for Spark we could ask INFRA to
>add a
> second webhook for Beam. (Or if the Beam folks are more interested
>in
> experimenting I can do Beam first as a smaller project and roll
>with that).
>
> Let me know :)
>
> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
> kirpic...@google.com> wrote:
>
>> Sounds reasonable for now, thanks!
>> It's unfortunate that Github's CODEOWNERS feature appears to be
>> effectively unusable for Beam but I'd hope that Github might pay
>attention
>> and fix things if we submit feedback, with us being one of the
>most active
>> Apache projects - did anyone do this yet / planning to?
>>
>> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri 
>wrote:
>>
>>> While I like the idea of having a CODEOWNERS file, the Github
>>> implementation is lacking:
>>> 1. Reviewers are automatically assigned at each push.
>>> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in
>Eugene's
>>> PR 5940).
>>> 3. Non-committers aren't assigned as reviewers.
>>> 4. Non-committers can't change the list of reviewers.
>>>
>>> I propose renaming the file to disable the auto-reviewer
>assignment
>>> feature.
>>> In its place I'll add a script that suggests reviewers.
>>>
>>> On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri 
>wrote:
>>>
 Hi Etienne,

 Yes you could be as precise as you want. The paths I listed are
>just
 suggestions. :)


 On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré <
 j...@nanthrax.net> wrote:

> Hi,
>
> I think it's already do-able just providing the expected path.
>
> It's a good idea especially for the core.
>
> Regards
> JB
>
> On 13/07/2018 09:51, Etienne Chauchot wrote:
> > Hi Udi,
> >
> > I also have a question, related to what Eugene asked : I see
>that
> the
> > code paths are the ones of the modules. Can we be more
>precise
> than that
> > to assign reviewers ? As an example, I added myself to
>runner/core
> > because I wanted to take a look at the PRs related to
> > runner/core/metrics but I'm getting assigned to all
>runner-core
> 

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Holden Karau
Would folks be OK with me asking infra to turn on blame based suggestions
for Beam and trying it out for a week?

On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez  wrote:

> +1 using blame -- nifty :)
>
> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan 
> wrote:
>
>> +1. This is great.
>>
>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri  wrote:
>>
>>> Mention bot looks cool, as it tries to guess the reviewer using blame.
>>> I've written a quick and dirty script that uses only CODEOWNERS.
>>>
>>> Its output looks like:
>>> $ python suggest_reviewers.py --pr 5940
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>>> (path_pattern: /runners/core-construction-java*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>>> (path_pattern: /runners/core-construction-java*)
>>> INFO:root:Selected reviewer @echauchot for:
>>> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>>> (path_pattern: /runners/core-java*)
>>> INFO:root:Selected reviewer @lukecwik for: /runners/flink/build.gradle
>>> (path_pattern: */build.gradle*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>>> (path_pattern: *.java)
>>> INFO:root:Selected reviewer @pabloem for:
>>> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>>> (path_pattern: /runners/google-cloud-dataflow-java*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>>> (path_pattern: /sdks/java/core*)
>>> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>>>
>>> Script is in: https://github.com/apache/beam/pull/5951
>>>
>>>
>>> What does the community think? Do you prefer blame-based or rules-based
>>> reviewer suggestions?
>>>
>>> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau 
>>> wrote:
>>>
 I'm looking at something similar in the Spark project, and while it's
 now archived by FB it seems like something like
 https://github.com/facebookarchive/mention-bot might do what we want.
 I'm going to spin up a version on my K8 cluster and see if I can ask infra
 to add a webhook and if it works for Spark we could ask INFRA to add a
 second webhook for Beam. (Or if the Beam folks are more interested in
 experimenting I can do Beam first as a smaller project and roll with that).

 Let me know :)

 On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
 kirpic...@google.com> wrote:

> Sounds reasonable for now, thanks!
> It's unfortunate that Github's CODEOWNERS feature appears to be
> effectively unusable for Beam but I'd hope that Github might pay attention
> and fix things if we submit feedback, with us being one of the most active
> Apache projects - did anyone do this yet / planning to?
>
> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri  wrote:
>
>> While I like the idea of having a CODEOWNERS file, the Github
>> implementation is lacking:
>> 1. Reviewers are automatically assigned at each push.
>> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in Eugene's
>> PR 5940).
>> 3. Non-committers aren't assigned as reviewers.
>> 4. Non-committers can't change the list of reviewers.
>>
>> I propose renaming the file to disable the auto-reviewer assignment
>> feature.
>> In its place I'll add a script that suggests reviewers.
>>
>> On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri  wrote:
>>
>>> Hi Etienne,
>>>
>>> Yes you could be as precise as you want. The paths I listed are just
>>> suggestions. :)
>>>
>>>
>>> On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>>
 Hi,

 I think it's already do-able just providing the expected path.

 It's a good idea especially for the core.

 Regards
 JB

 On 13/07/2018 09:51, Etienne Chauchot wrote:
 > Hi Udi,
 >
 > I also have a question, related to what Eugene asked : I see that
 the
 > code paths are the ones of the modules. Can we be more precise
 than that
 > to assign reviewers ? As an example, I added myself to runner/core
 > because I wanted to take a look at the PRs related to
 > runner/core/metrics but I'm getting assigned to all runner-core
 PRs. Can
 > we specify paths like
 >
 runners/core-java/src/main/java/org/apache/beam/runners/core/metrics ?
 > I know it is a bit too precise so a bit risky, but in that
 particular
 > case, I 

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Andrew Pilloud
I personally like blame-based suggestions. The downside is that you
effectively become a owner of anything you touch. Most of the time blame
based suggestions will return multiple candidates. Could we use the
CODEOWNERS file to filter down the suggestions?

Andrew

On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan  wrote:

> +1. This is great.
>
> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri  wrote:
>
>> Mention bot looks cool, as it tries to guess the reviewer using blame.
>> I've written a quick and dirty script that uses only CODEOWNERS.
>>
>> Its output looks like:
>> $ python suggest_reviewers.py --pr 5940
>> INFO:root:Selected reviewer @lukecwik for:
>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>> (path_pattern: /runners/core-construction-java*)
>> INFO:root:Selected reviewer @lukecwik for:
>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>> (path_pattern: /runners/core-construction-java*)
>> INFO:root:Selected reviewer @echauchot for:
>> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>> (path_pattern: /runners/core-java*)
>> INFO:root:Selected reviewer @lukecwik for: /runners/flink/build.gradle
>> (path_pattern: */build.gradle*)
>> INFO:root:Selected reviewer @lukecwik for:
>> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>> (path_pattern: *.java)
>> INFO:root:Selected reviewer @pabloem for:
>> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>> (path_pattern: /runners/google-cloud-dataflow-java*)
>> INFO:root:Selected reviewer @lukecwik for:
>> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>> (path_pattern: /sdks/java/core*)
>> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>>
>> Script is in: https://github.com/apache/beam/pull/5951
>>
>>
>> What does the community think? Do you prefer blame-based or rules-based
>> reviewer suggestions?
>>
>> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau 
>> wrote:
>>
>>> I'm looking at something similar in the Spark project, and while it's
>>> now archived by FB it seems like something like
>>> https://github.com/facebookarchive/mention-bot might do what we want.
>>> I'm going to spin up a version on my K8 cluster and see if I can ask infra
>>> to add a webhook and if it works for Spark we could ask INFRA to add a
>>> second webhook for Beam. (Or if the Beam folks are more interested in
>>> experimenting I can do Beam first as a smaller project and roll with that).
>>>
>>> Let me know :)
>>>
>>> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov >> > wrote:
>>>
 Sounds reasonable for now, thanks!
 It's unfortunate that Github's CODEOWNERS feature appears to be
 effectively unusable for Beam but I'd hope that Github might pay attention
 and fix things if we submit feedback, with us being one of the most active
 Apache projects - did anyone do this yet / planning to?

 On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri  wrote:

> While I like the idea of having a CODEOWNERS file, the Github
> implementation is lacking:
> 1. Reviewers are automatically assigned at each push.
> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in Eugene's
> PR 5940).
> 3. Non-committers aren't assigned as reviewers.
> 4. Non-committers can't change the list of reviewers.
>
> I propose renaming the file to disable the auto-reviewer assignment
> feature.
> In its place I'll add a script that suggests reviewers.
>
> On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri  wrote:
>
>> Hi Etienne,
>>
>> Yes you could be as precise as you want. The paths I listed are just
>> suggestions. :)
>>
>>
>> On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi,
>>>
>>> I think it's already do-able just providing the expected path.
>>>
>>> It's a good idea especially for the core.
>>>
>>> Regards
>>> JB
>>>
>>> On 13/07/2018 09:51, Etienne Chauchot wrote:
>>> > Hi Udi,
>>> >
>>> > I also have a question, related to what Eugene asked : I see that
>>> the
>>> > code paths are the ones of the modules. Can we be more precise
>>> than that
>>> > to assign reviewers ? As an example, I added myself to runner/core
>>> > because I wanted to take a look at the PRs related to
>>> > runner/core/metrics but I'm getting assigned to all runner-core
>>> PRs. Can
>>> > we specify paths like
>>> >
>>> runners/core-java/src/main/java/org/apache/beam/runners/core/metrics ?
>>> > I know it is a bit too precise so a bit risky, but in that
>>> particular
>>> > case, I doubt that the path will change.
>>> >
>>> > Etienne
>>> >
>>> > Le jeudi 12 

Re: Beam site test/merge Jenkins issue

2018-07-16 Thread Andrew Pilloud
This is a really persistent flap in the website build. I would guess it
hits 80% of the time, if you do enough builds it will eventually succeed. I
opened an issue on it a while back:
https://issues.apache.org/jira/browse/BEAM-4686

Andrew

On Mon, Jul 16, 2018 at 5:05 AM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> let me take a look, it's maybe the client key auth which failing.
>
> Regards
> JB
>
> On 16/07/2018 13:02, Alexey Romanenko wrote:
> > Hi,
> >
> > From time to time, I observe *gpg key* issues in Jenkins job when I try
> > to test/merge Beam site PR.
> > For example:
> > https://builds.apache.org/job/beam_PreCommit_Website_Stage/1192/console
> >
> > It says the following:
> > gpg: keyserver communications error: keyserver helper general error
> > gpg: keyserver communications error: unknown pubkey algorithm
> > gpg: keyserver receive failed: unknown pubkey algorithm
> >
> > Is it know problem and how I can overcome this?
> >
> > Thank you,
> > Alexey
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-07-16 Thread Andrew Pilloud
Your doc looks good to me. It looks like only one question remains: should
it be a Confluence or Github wiki. I see other Apache projects using both,
so it seems like either one is possible with support of the Beam community.
It might be time to call a vote on this?

Andrew

On Fri, Jul 13, 2018 at 9:14 PM Mikhail Gryzykhin  wrote:

> Hello everyone it's Mikhail and I'm here to revive this long-sleeping
> thread.
>
> I have summarized discussion above into a design/proposal document
> 
> .
>
> The initial proposal is what I consider the best approach, so it is open
> for change.
>
> Please comment on following topics:
> 1. Another engines you have in mind.
> 2. If you have access to configure corresponding engine
> 3. General ideas.
>
> Since this is a long-desired change, please, be active.
>
> --Mikhail
>
> Have feedback ?
>
>
> On Tue, Jun 12, 2018 at 5:24 PM Griselda Cuevas  wrote:
>
>>
>> Hi Everyone,
>>
>>
>> (a) should we do it? -- I like the idea of having a wiki, yes. Mainly to
>> differentiate the documentation we cater to users and the one we cater to
>> contributors. For things like user examples, and more demo-y content, I'd
>> suggest we still host it in the Website.
>>
>> (b) what should go there? -- The ultimate purpose of the wiki should be
>> to host everything needed to a) get started (with official documentation)
>> and b) how to get the most out of Beam (here is where I see things like
>> what Robert suggested could fit, tips, tricks and other cool things created
>> by and for our contributors.)
>>
>> (c) what should not go there? -- Any demos, examples or showcases. I
>> think that material should be either embedded or linked (listed) in the
>> website.
>>
>> Tu summarize, I'd like to see the wiki to be a knowledge collection for*
>> people who contribute* to the project and the website the collection of
>> information that allows *someone to make the decision to use Beam* (or
>> join the community).
>>
>> When we are ready to vote on the creation of a wiki, I'd like to propose
>> that the first thing we document there is the Beam Improvement plan along
>> side with a concrete "Get Started Contributing to Beam" cheatsheet.
>>
>> WDYT?
>>
>>
>> On Tue, 12 Jun 2018 at 09:34, Alexey Romanenko 
>> wrote:
>>
>>> +1 for having Wiki for devs and users.
>>>
>>> Even though editing interface is not so native and obvious (comparing to
>>> Google docs), but, at least, it will be already put in one place and should
>>> be much more easy to search and discover.
>>>
>>> The only my concern about Wiki (based on using it in other different
>>> projects) that, in course of time, the information becomes outdated and
>>> weak structured which makes this not so valuable and even deceptive.
>>>
>>> WBR,
>>> Alexey
>>>
>>> On 12 Jun 2018, at 18:01, Robert Bradshaw  wrote:
>>>
>>> On Mon, Jun 11, 2018 at 2:40 PM Kenneth Knowles  wrote:
>>>
 OK, yea, that all makes sense to me. Like this?

  - site/documentation: writing just for users
  - site/contribute: basic stuff as-is, writing for users to entice
 them, links to the next...
  - wiki/contributors: contributors writing just for each other

 And you also have

  - wiki/users: users writing for users

 That's interesting.

>>>
>>> Yep. We don't have to start wiki/users right away, but it could be
>>> useful down the line.
>>>
>>>
>>>
 On Mon, Jun 11, 2018 at 2:30 PM Robert Bradshaw 
 wrote:

> On Fri, Jun 8, 2018 at 2:18 PM Kenneth Knowles  wrote:
>
>
>> I disagree strongly here - I don't think the wiki will have
>> appropriate polish for users. Even if carefully polished I don't think 
>> the
>> presentation style is right, and it is not flexible. Power users will 
>> find
>> it, of course.
>>
>
> I wasn't imagining a wiki as a platform for developers to author
> documentation, rather a place for users to author content for other users
> (tips and tricks, handy PTransforms, etc.) at a much lower bar than
> expecting users to go in and update our documentation. I agree with the
> goal of not (further) fragmenting our documentation.
>
> As for mixing contributor vs. user information on the same site, I
> think it's valuable to have some integration and treat the two as a
> continuum (after all, our (direct) users are already developers) and
> consider it an asset to have a "contribute" heading right in the main 
> site.
> (Perhaps, if it's confusing, we could move it all the way to the right.) I
> don't think we'll be doing ourselves a favor by blinding copying all the
> existing docs to a wiki. That being said I think it makes sense to start
> playing with using a wiki, and see how much value that adds on top of what
> we already have.
>
>
>>
>>
>>> On Fri, Jun 

Re: Let's start getting rid of BoundedSource

2018-07-16 Thread Jean-Baptiste Onofré
Hi guys,

I think it's the purpose of SDF to simplify the BoundedSource like writing.

I agree that extended @SplitRestriction is a good approach.

Regards
JB

On 16/07/2018 16:52, Eugene Kirpichov wrote:
> Hi Etienne - thanks for catching this; indeed, I somehow missed that
> actually several runners do this same thing - it seemed to me as
> something that can be done in user code (because it involves combining
> estimated size + split in pretty much the same way), but I'm not so
> sure: even though many runners have a "desired parallelism" option or
> alike, it's not all of them, so we can't use such an option universally.
> 
> Maybe then the right thing to do is to:
> - Use bounded SDFs for these
> - Change SDF @SplitRestriction API to take a desired number of splits as
> a parameter, and introduce an API @EstimateOutputSizeBytes(element)
> valid only on bounded SDFs
> - Add some plumbing to the standard bounded SDF expansion so that
> different runners can compute that parameter differently, the two
> standard ways being "split into given number of splits" or "split based
> on the sub-linear formula of estimated size".
> 
> I think this would work, though this is somewhat more work than I
> anticipated. Any alternative ideas?
> 
> On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot  > wrote:
> 
> Hi,
> thanks Eugene for analyzing and sharing that.
> I have one comment inline
> 
> Etienne
> 
> Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
>> Hey beamers,
>>
>> I've always wondered whether the BoundedSource implementations in
>> the Beam SDK are worth their complexity, or whether they rather
>> could be converted to the much easier to code ParDo style, which
>> is also more modular and allows you to very easily implement
>> readAll().
>>
>> There's a handful: file-based sources, BigQuery, Bigtable, HBase,
>> Elasticsearch, MongoDB, Solr and a couple more.
>>
>> Curiously enough, BoundedSource vs. ParDo matters *only* on
>> Dataflow, because AFAICT Dataflow is the only runner that cares
>> about the things that BoundedSource can do and ParDo can't:
>> - size estimation (used to choose an initial number of workers)
>> [ok, Flink calls the function to return statistics, but doesn't
>> seem to do anything else with it]
> => Spark uses size estimation to set desired bundle size with
> something like desiredBundleSize = estimatedSize /
> nbOfWorkersConfigured (partitions)
> See
> 
> https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org/apache/beam/runners/spark/io/SourceRDD.java#L101
> 
> 
>> - splitting into bundles of given size (Dataflow chooses the
>> number of bundles to create based on a simple formula that's not
>> entirely unlike K*sqrt(size))
>> - liquid sharding (splitAtFraction())
>>
>> If Dataflow didn't exist, there'd be no reason at all to use
>> BoundedSource. So the question "which ones can be converted to
>> ParDo" is really "which ones are used on Dataflow in ways that
>> make these functions matter". Previously, my conservative
>> assumption was that the answer is "all of them", but turns out
>> this is not so.
>>
>> Liquid sharding always matters; if the source is liquid-shardable,
>> for now we have to keep it a source (until SDF gains liquid
>> sharding - which should happen in a quarter or two I think).
>>
>> Choosing number of bundles to split into is easily done in SDK
>> code, see https://github.com/apache/beam/pull/5886 for example;
>> DatastoreIO does something similar.
>>
>> The remaining thing to analyze is, when does initial scaling
>> matter. So as a member of the Dataflow team, I analyzed statistics
>> of production Dataflow jobs in the past month. I can not share my
>> queries nor the data, because they are proprietary to Google - so
>> I am sharing just the general methodology and conclusions, because
>> they matter to the Beam community. I looked at a few criteria,
>> such as:
>> - The job should be not too short and not too long: if it's too
>> short then scaling couldn't have kicked in much at all; if it's
>> too long then dynamic autoscaling would have been sufficient.
>> - The job should use, at peak, at least a handful of workers
>> (otherwise means it wasn't used in settings where much scaling
>> happened)
>> After a couple more rounds of narrowing-down, with some
>> hand-checking that the results and criteria so far make sense, I
>> ended up with nothing - no jobs that would have suffered a serious
>> performance regression if their BoundedSource had not supported
>> initial size estimation [of course, except for the
>> liquid-shardable ones].
>>
>> Based on this, I would like to propose to convert the following
>> 

Re: Let's start getting rid of BoundedSource

2018-07-16 Thread Eugene Kirpichov
Hi Etienne - thanks for catching this; indeed, I somehow missed that
actually several runners do this same thing - it seemed to me as something
that can be done in user code (because it involves combining estimated size
+ split in pretty much the same way), but I'm not so sure: even though many
runners have a "desired parallelism" option or alike, it's not all of them,
so we can't use such an option universally.

Maybe then the right thing to do is to:
- Use bounded SDFs for these
- Change SDF @SplitRestriction API to take a desired number of splits as a
parameter, and introduce an API @EstimateOutputSizeBytes(element) valid
only on bounded SDFs
- Add some plumbing to the standard bounded SDF expansion so that different
runners can compute that parameter differently, the two standard ways being
"split into given number of splits" or "split based on the sub-linear
formula of estimated size".

I think this would work, though this is somewhat more work than I
anticipated. Any alternative ideas?

On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot 
wrote:

> Hi,
> thanks Eugene for analyzing and sharing that.
> I have one comment inline
>
> Etienne
>
> Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
>
> Hey beamers,
>
> I've always wondered whether the BoundedSource implementations in the Beam
> SDK are worth their complexity, or whether they rather could be converted
> to the much easier to code ParDo style, which is also more modular and
> allows you to very easily implement readAll().
>
> There's a handful: file-based sources, BigQuery, Bigtable, HBase,
> Elasticsearch, MongoDB, Solr and a couple more.
>
> Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow,
> because AFAICT Dataflow is the only runner that cares about the things that
> BoundedSource can do and ParDo can't:
> - size estimation (used to choose an initial number of workers) [ok, Flink
> calls the function to return statistics, but doesn't seem to do anything
> else with it]
>
> => Spark uses size estimation to set desired bundle size with something
> like desiredBundleSize = estimatedSize / nbOfWorkersConfigured (partitions)
> See
> https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org/apache/beam/runners/spark/io/SourceRDD.java#L101
>
>
> - splitting into bundles of given size (Dataflow chooses the number of
> bundles to create based on a simple formula that's not entirely unlike
> K*sqrt(size))
> - liquid sharding (splitAtFraction())
>
> If Dataflow didn't exist, there'd be no reason at all to use
> BoundedSource. So the question "which ones can be converted to ParDo" is
> really "which ones are used on Dataflow in ways that make these functions
> matter". Previously, my conservative assumption was that the answer is "all
> of them", but turns out this is not so.
>
> Liquid sharding always matters; if the source is liquid-shardable, for now
> we have to keep it a source (until SDF gains liquid sharding - which should
> happen in a quarter or two I think).
>
> Choosing number of bundles to split into is easily done in SDK code, see
> https://github.com/apache/beam/pull/5886 for example; DatastoreIO does
> something similar.
>
> The remaining thing to analyze is, when does initial scaling matter. So as
> a member of the Dataflow team, I analyzed statistics of production Dataflow
> jobs in the past month. I can not share my queries nor the data, because
> they are proprietary to Google - so I am sharing just the general
> methodology and conclusions, because they matter to the Beam community. I
> looked at a few criteria, such as:
> - The job should be not too short and not too long: if it's too short then
> scaling couldn't have kicked in much at all; if it's too long then dynamic
> autoscaling would have been sufficient.
> - The job should use, at peak, at least a handful of workers (otherwise
> means it wasn't used in settings where much scaling happened)
> After a couple more rounds of narrowing-down, with some hand-checking that
> the results and criteria so far make sense, I ended up with nothing - no
> jobs that would have suffered a serious performance regression if their
> BoundedSource had not supported initial size estimation [of course, except
> for the liquid-shardable ones].
>
> Based on this, I would like to propose to convert the following
> BoundedSource-based IOs to ParDo-based, and while we're at it, probably
> also add readAll() versions (not necessarily in exactly the same PR):
> - ElasticsearchIO
> - SolrIO
> - MongoDbIO
> - MongoDbGridFSIO
> - CassandraIO
> - HCatalogIO
> - HadoopInputFormatIO
> - UnboundedToBoundedSourceAdapter (already have a PR in progress for this
> one)
> These would not translate to a single ParDo - rather, they'd translate to
> ParDo(estimate size and split according to the formula), Reshuffle,
> ParDo(read data) - or possibly to a bounded SDF doing roughly the same
> (luckily after 

Beam Dependency Check Report (2018-07-16)

2018-07-16 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
google-cloud-bigquery
0.25.0
1.4.0
2017-06-26
2018-07-16


google-cloud-core
0.25.0
0.28.1
2018-06-07
2018-06-07


google-cloud-pubsub
0.26.0
0.35.4
2017-06-26
2018-06-08


ply
3.8
3.11
2018-06-07
2018-06-07


High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
org.assertj:assertj-core
2.5.0
3.10.0
2016-07-03
2018-05-11


com.google.auto.service:auto-service
1.0-rc2
1.0-rc4
2018-06-25
2017-12-11


biz.aQute:bndlib
1.43.0
2.0.0.20130123-133441
2018-06-25
2018-06-25


org.apache.cassandra:cassandra-all
3.9
3.11.2
2016-09-26
2018-02-14


org.apache.commons:commons-dbcp2
2.1.1
2.5.0
2015-08-02
2018-07-16


de.flapdoodle.embed:de.flapdoodle.embed.mongo
1.50.1
2.1.1
2015-12-11
2018-06-25


de.flapdoodle.embed:de.flapdoodle.embed.process
1.50.1
2.0.5
2015-12-11
2018-06-25


org.apache.derby:derby
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbyclient
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbynet
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.elasticsearch:elasticsearch
5.6.3
6.3.1
2017-10-06
2018-07-09


org.elasticsearch:elasticsearch-hadoop
5.0.0
6.3.1
2016-10-26
2018-07-09


org.elasticsearch.client:elasticsearch-rest-client
5.6.3
6.3.1
2017-10-06
2018-07-09


com.google.errorprone:error_prone_annotations
2.1.2
2.3.1
None
2018-07-16


com.alibaba:fastjson
1.2.12
1.2.47
2016-05-21
2018-03-15


org.elasticsearch.test:framework
5.6.3
6.3.1
2017-10-06
2018-07-09


org.freemarker:freemarker
2.3.25-incubating
2.3.28
2016-06-14
2018-03-30


net.ltgt.gradle:gradle-apt-plugin
0.13
0.17
2017-11-01
2018-06-25


com.commercehub.gradle.plugin:gradle-avro-plugin
0.11.0
0.14.2
2018-01-30
2018-06-06


gradle.plugin.com.palantir.gradle.docker:gradle-docker
0.13.0
0.20.1
2017-04-05
2018-07-09


com.github.ben-manes:gradle-versions-plugin
0.17.0
0.20.0
2018-06-06
2018-06-25


org.codehaus.groovy:groovy-all
2.4.13
3.0.0-alpha-3
2017-11-22
2018-06-26


io.grpc:grpc-context
1.12.0
1.13.2
None
2018-07-16


io.grpc:grpc-protobuf
1.12.0
1.13.2
None
2018-07-16


io.grpc:grpc-testing
1.12.0
1.13.2
None
2018-07-16


com.google.code.gson:gson
2.7
2.8.5
None
2018-07-16


com.google.guava:guava
20.0
25.1-jre
None
2018-07-16


org.apache.hbase:hbase-common
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-hadoop-compat
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-server
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-shaded-client
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-shaded-server
1.2.6
2.0.0-alpha2
2017-05-29
2018-05-31


org.apache.hive:hive-cli
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.hive:hive-common
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16
 

Re: Beam site test/merge Jenkins issue

2018-07-16 Thread Jean-Baptiste Onofré
Hi,

let me take a look, it's maybe the client key auth which failing.

Regards
JB

On 16/07/2018 13:02, Alexey Romanenko wrote:
> Hi,
> 
> From time to time, I observe *gpg key* issues in Jenkins job when I try
> to test/merge Beam site PR.
> For example:
> https://builds.apache.org/job/beam_PreCommit_Website_Stage/1192/console
> 
> It says the following:
> gpg: keyserver communications error: keyserver helper general error
> gpg: keyserver communications error: unknown pubkey algorithm
> gpg: keyserver receive failed: unknown pubkey algorithm
> 
> Is it know problem and how I can overcome this?
> 
> Thank you,
> Alexey

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Beam site test/merge Jenkins issue

2018-07-16 Thread Alexey Romanenko
Hi,

From time to time, I observe gpg key issues in Jenkins job when I try to 
test/merge Beam site PR.
For example:
https://builds.apache.org/job/beam_PreCommit_Website_Stage/1192/console 


It says the following:
gpg: keyserver communications error: keyserver helper general error
gpg: keyserver communications error: unknown pubkey algorithm
gpg: keyserver receive failed: unknown pubkey algorithm

Is it know problem and how I can overcome this?

Thank you,
Alexey

Re: Let's start getting rid of BoundedSource

2018-07-16 Thread Etienne Chauchot
Hi, thanks Eugene for analyzing and sharing that.I have one comment inline
Etienne
Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
> Hey beamers,
> I've always wondered whether the BoundedSource implementations in the Beam 
> SDK are worth their complexity, or whether
> they rather could be converted to the much easier to code ParDo style, which 
> is also more modular and allows you to
> very easily implement readAll().
> 
> There's a handful: file-based sources, BigQuery, Bigtable, HBase, 
> Elasticsearch, MongoDB, Solr and a couple more.
> 
> Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow, because 
> AFAICT Dataflow is the only runner that
> cares about the things that BoundedSource can do and ParDo can't:
> - size estimation (used to choose an initial number of workers) [ok, Flink 
> calls the function to return statistics,
> but doesn't seem to do anything else with it]
=> Spark uses size estimation to set desired bundle size with something like 
desiredBundleSize = estimatedSize /
nbOfWorkersConfigured (partitions)See 
https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runne
rs/spark/src/main/java/org/apache/beam/runners/spark/io/SourceRDD.java#L101

> - splitting into bundles of given size (Dataflow chooses the number of 
> bundles to create based on a simple formula
> that's not entirely unlike K*sqrt(size))
> - liquid sharding (splitAtFraction())
> 
> If Dataflow didn't exist, there'd be no reason at all to use BoundedSource. 
> So the question "which ones can be
> converted to ParDo" is really "which ones are used on Dataflow in ways that 
> make these functions matter". Previously,
> my conservative assumption was that the answer is "all of them", but turns 
> out this is not so.
> 
> Liquid sharding always matters; if the source is liquid-shardable, for now we 
> have to keep it a source (until SDF
> gains liquid sharding - which should happen in a quarter or two I think).
> 
> Choosing number of bundles to split into is easily done in SDK code, see 
> https://github.com/apache/beam/pull/5886 for
> example; DatastoreIO does something similar.
> 
> The remaining thing to analyze is, when does initial scaling matter. So as a 
> member of the Dataflow team, I analyzed
> statistics of production Dataflow jobs in the past month. I can not share my 
> queries nor the data, because they are
> proprietary to Google - so I am sharing just the general methodology and 
> conclusions, because they matter to the Beam
> community. I looked at a few criteria, such as:
> - The job should be not too short and not too long: if it's too short then 
> scaling couldn't have kicked in much at
> all; if it's too long then dynamic autoscaling would have been sufficient.
> - The job should use, at peak, at least a handful of workers (otherwise means 
> it wasn't used in settings where much
> scaling happened)
> After a couple more rounds of narrowing-down, with some hand-checking that 
> the results and criteria so far make sense,
> I ended up with nothing - no jobs that would have suffered a serious 
> performance regression if their BoundedSource had
> not supported initial size estimation [of course, except for the 
> liquid-shardable ones].
> 
> Based on this, I would like to propose to convert the following 
> BoundedSource-based IOs to ParDo-based, and while
> we're at it, probably also add readAll() versions (not necessarily in exactly 
> the same PR):
> - ElasticsearchIO
> - SolrIO
> - MongoDbIO
> - MongoDbGridFSIO
> - CassandraIO
> - HCatalogIO
> - HadoopInputFormatIO
> - UnboundedToBoundedSourceAdapter (already have a PR in progress for this one)
> These would not translate to a single ParDo - rather, they'd translate to 
> ParDo(estimate size and split according to
> the formula), Reshuffle, ParDo(read data) - or possibly to a bounded SDF 
> doing roughly the same (luckily after https:/
> /github.com/apache/beam/pull/5940 all runners at master will support bounded 
> SDF so this is safe compatibility-wise).
> Pretty much like DatastoreIO does.
> 
> I would like to also propose to change the IO authoring guide 
> https://beam.apache.org/documentation/io/authoring-overv
> iew/#when-to-implement-using-the-source-api to basically say "Never implement 
> a new BoundedSource unless you can
> support liquid sharding". And add a utility for computing a desired number of 
> splits.
> 
> There might be some more details here to iron out, but I wanted to check with 
> the community that this overall makes
> sense.
> 
> Thanks.

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Huygaa Batsaikhan
+1. This is great.

On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri  wrote:

> Mention bot looks cool, as it tries to guess the reviewer using blame.
> I've written a quick and dirty script that uses only CODEOWNERS.
>
> Its output looks like:
> $ python suggest_reviewers.py --pr 5940
> INFO:root:Selected reviewer @lukecwik for:
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
> (path_pattern: /runners/core-construction-java*)
> INFO:root:Selected reviewer @lukecwik for:
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
> (path_pattern: /runners/core-construction-java*)
> INFO:root:Selected reviewer @echauchot for:
> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
> (path_pattern: /runners/core-java*)
> INFO:root:Selected reviewer @lukecwik for: /runners/flink/build.gradle
> (path_pattern: */build.gradle*)
> INFO:root:Selected reviewer @lukecwik for:
> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
> (path_pattern: *.java)
> INFO:root:Selected reviewer @pabloem for:
> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
> (path_pattern: /runners/google-cloud-dataflow-java*)
> INFO:root:Selected reviewer @lukecwik for:
> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
> (path_pattern: /sdks/java/core*)
> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>
> Script is in: https://github.com/apache/beam/pull/5951
>
>
> What does the community think? Do you prefer blame-based or rules-based
> reviewer suggestions?
>
> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau 
> wrote:
>
>> I'm looking at something similar in the Spark project, and while it's now
>> archived by FB it seems like something like
>> https://github.com/facebookarchive/mention-bot might do what we want.
>> I'm going to spin up a version on my K8 cluster and see if I can ask infra
>> to add a webhook and if it works for Spark we could ask INFRA to add a
>> second webhook for Beam. (Or if the Beam folks are more interested in
>> experimenting I can do Beam first as a smaller project and roll with that).
>>
>> Let me know :)
>>
>> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov 
>> wrote:
>>
>>> Sounds reasonable for now, thanks!
>>> It's unfortunate that Github's CODEOWNERS feature appears to be
>>> effectively unusable for Beam but I'd hope that Github might pay attention
>>> and fix things if we submit feedback, with us being one of the most active
>>> Apache projects - did anyone do this yet / planning to?
>>>
>>> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri  wrote:
>>>
 While I like the idea of having a CODEOWNERS file, the Github
 implementation is lacking:
 1. Reviewers are automatically assigned at each push.
 2. Reviewer assignment can be excessive (e.g. 5 reviewers in Eugene's
 PR 5940).
 3. Non-committers aren't assigned as reviewers.
 4. Non-committers can't change the list of reviewers.

 I propose renaming the file to disable the auto-reviewer assignment
 feature.
 In its place I'll add a script that suggests reviewers.

 On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri  wrote:

> Hi Etienne,
>
> Yes you could be as precise as you want. The paths I listed are just
> suggestions. :)
>
>
> On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> I think it's already do-able just providing the expected path.
>>
>> It's a good idea especially for the core.
>>
>> Regards
>> JB
>>
>> On 13/07/2018 09:51, Etienne Chauchot wrote:
>> > Hi Udi,
>> >
>> > I also have a question, related to what Eugene asked : I see that
>> the
>> > code paths are the ones of the modules. Can we be more precise than
>> that
>> > to assign reviewers ? As an example, I added myself to runner/core
>> > because I wanted to take a look at the PRs related to
>> > runner/core/metrics but I'm getting assigned to all runner-core
>> PRs. Can
>> > we specify paths like
>> >
>> runners/core-java/src/main/java/org/apache/beam/runners/core/metrics ?
>> > I know it is a bit too precise so a bit risky, but in that
>> particular
>> > case, I doubt that the path will change.
>> >
>> > Etienne
>> >
>> > Le jeudi 12 juillet 2018 à 16:49 -0700, Eugene Kirpichov a écrit :
>> >> Hi Udi,
>> >>
>> >> I see that the PR was merged - thanks! However it seems to have
>> some
>> >> unintended effects.
>> >>
>> >> On my PR https://github.com/apache/beam/pull/5940 , I assigned a
>> >> reviewer manually, but the moment I pushed a new commit, it
>> >> auto-assigned a lot of other people to it, and I had to remove
>> them.

Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #102

2018-07-16 Thread Apache Jenkins Server
See 


Changes:

[github] [BEAM-4432] Adding Sources to produce Synthetic output for Batch

--
[...truncated 17.81 MB...]
:beam-sdks-java-maven-archetypes-starter:compileTestJava (Thread[Daemon 
worker,5,main]) completed. Took 0.001 secs.
:beam-sdks-java-maven-archetypes-starter:processTestResources (Thread[Daemon 
worker,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:processTestResources UP-TO-DATE
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:processTestResources' is 
f74f3200edf284b276c50da93794d928
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:processTestResources': Caching has 
not been enabled for the task
Skipping task ':beam-sdks-java-maven-archetypes-starter:processTestResources' 
as it is up-to-date.
:beam-sdks-java-maven-archetypes-starter:processTestResources (Thread[Daemon 
worker,5,main]) completed. Took 0.002 secs.
:beam-sdks-java-maven-archetypes-starter:testClasses (Thread[Daemon 
worker,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:testClasses UP-TO-DATE
Skipping task ':beam-sdks-java-maven-archetypes-starter:testClasses' as it has 
no actions.
:beam-sdks-java-maven-archetypes-starter:testClasses (Thread[Daemon 
worker,5,main]) completed. Took 0.0 secs.
:beam-sdks-java-maven-archetypes-starter:shadowTestJar (Thread[Daemon 
worker,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:shadowTestJar
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:shadowTestJar' is 
8410a2e3a1dbdd4b9419ea22776f2bc6
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:shadowTestJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:shadowTestJar' is not up-to-date 
because:
  No history is available.
***
GRADLE SHADOW STATS

Total Jars: 1 (includes project)
Total Time: 0.0s [0ms]
Average Time/Jar: 0.0s [0.0ms]
***
:beam-sdks-java-maven-archetypes-starter:shadowTestJar (Thread[Daemon 
worker,5,main]) completed. Took 0.006 secs.
:beam-sdks-java-maven-archetypes-starter:sourcesJar (Thread[Daemon 
worker,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:sourcesJar
file or directory 
'
 not found
Build cache key for task ':beam-sdks-java-maven-archetypes-starter:sourcesJar' 
is a106f15937cacfee668e25636b705e03
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:sourcesJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:sourcesJar' is not up-to-date 
because:
  No history is available.
file or directory 
'
 not found
:beam-sdks-java-maven-archetypes-starter:sourcesJar (Thread[Daemon 
worker,5,main]) completed. Took 0.003 secs.
:beam-sdks-java-maven-archetypes-starter:testSourcesJar (Thread[Daemon 
worker,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:testSourcesJar
file or directory 
'
 not found
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:testSourcesJar' is 
58715d6b8e221cace68f230ccfd69fd4
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:testSourcesJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:testSourcesJar' is not 
up-to-date because:
  No history is available.
file or directory 
'
 not found
:beam-sdks-java-maven-archetypes-starter:testSourcesJar (Thread[Daemon 
worker,5,main]) completed. Took 0.003 secs.
:beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication (Thread[Daemon 
worker,5,main]) started.

> Task :beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication
Build cache key for task 
':beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication' is 
e88836a5bca732f78522d2de5a70d4e6
Caching disabled for task 
':beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication': Caching has 
not been enabled for the task
Task ':beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication' is not 
up-to-date because:
  Task.upToDateWhen is false.
:beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication (Thread[Daemon 
worker,5,main]) completed. Took 0.011 secs.
:beam-sdks-java-nexmark:compileJava (Thread[Daemon worker,5,main]) started.

> Task :beam-sdks-java-nexmark:compileJava
Build cache key for task ':beam-sdks-java-nexmark:compileJava' is