Re: Let's start getting rid of BoundedSource

2018-07-17 Thread Robert Bradshaw
On Sun, Jul 15, 2018 at 2:20 PM Eugene Kirpichov 
wrote:

> Hey beamers,
>
> I've always wondered whether the BoundedSource implementations in the Beam
> SDK are worth their complexity, or whether they rather could be converted
> to the much easier to code ParDo style, which is also more modular and
> allows you to very easily implement readAll().
>
> There's a handful: file-based sources, BigQuery, Bigtable, HBase,
> Elasticsearch, MongoDB, Solr and a couple more.
>
> Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow,
> because AFAICT Dataflow is the only runner that cares about the things that
> BoundedSource can do and ParDo can't:
> - size estimation (used to choose an initial number of workers) [ok, Flink
> calls the function to return statistics, but doesn't seem to do anything
> else with it]
> - splitting into bundles of given size (Dataflow chooses the number of
> bundles to create based on a simple formula that's not entirely unlike
> K*sqrt(size))
> - liquid sharding (splitAtFraction())
>
> If Dataflow didn't exist, there'd be no reason at all to use
> BoundedSource. So the question "which ones can be converted to ParDo" is
> really "which ones are used on Dataflow in ways that make these functions
> matter". Previously, my conservative assumption was that the answer is "all
> of them", but turns out this is not so.
>
> Liquid sharding always matters; if the source is liquid-shardable, for now
> we have to keep it a source (until SDF gains liquid sharding - which should
> happen in a quarter or two I think).
>
> Choosing number of bundles to split into is easily done in SDK code, see
> https://github.com/apache/beam/pull/5886 for example; DatastoreIO does
> something similar.
>
> The remaining thing to analyze is, when does initial scaling matter. So as
> a member of the Dataflow team, I analyzed statistics of production Dataflow
> jobs in the past month. I can not share my queries nor the data, because
> they are proprietary to Google - so I am sharing just the general
> methodology and conclusions, because they matter to the Beam community. I
> looked at a few criteria, such as:
> - The job should be not too short and not too long: if it's too short then
> scaling couldn't have kicked in much at all; if it's too long then dynamic
> autoscaling would have been sufficient.
> - The job should use, at peak, at least a handful of workers (otherwise
> means it wasn't used in settings where much scaling happened)
> After a couple more rounds of narrowing-down, with some hand-checking that
> the results and criteria so far make sense, I ended up with nothing - no
> jobs that would have suffered a serious performance regression if their
> BoundedSource had not supported initial size estimation [of course, except
> for the liquid-shardable ones].
>
> Based on this, I would like to propose to convert the following
> BoundedSource-based IOs to ParDo-based, and while we're at it, probably
> also add readAll() versions (not necessarily in exactly the same PR):
> - ElasticsearchIO
> - SolrIO
> - MongoDbIO
> - MongoDbGridFSIO
> - CassandraIO
> - HCatalogIO
> - HadoopInputFormatIO
> - UnboundedToBoundedSourceAdapter (already have a PR in progress for this
> one)
> These would not translate to a single ParDo - rather, they'd translate to
> ParDo(estimate size and split according to the formula), Reshuffle,
> ParDo(read data) - or possibly to a bounded SDF doing roughly the same
> (luckily after https://github.com/apache/beam/pull/5940 all runners at
> master will support bounded SDF so this is safe compatibility-wise). Pretty
> much like DatastoreIO does.
>

I think there is some value in having a single API that one implements,
rather than having every IO manually implement the ParDo + Reshuffle +
ParDo pattern. Until these convert over to SDFs, I'm not sure there's a net
win to manually converting them to ParDos rather than automatically
executing BoundedSources as ParDos. (There's also the sizing hooks that
have been pointed out, and even for those that don't yet support liquid
sharding, it'd be nice if, when someone wants to add liquid sharding to an
IO, they don't have to go in and massively restructure things.)

But perhaps I'm underestimating the complexity of using the BoundedSource
API vs. manually writing a sequence of ParDos. Or is it that existing
BoundedSources
don't lend themselves well to readAll? (Would a readAll extends
PTransform, T> be that hard? Specific sources could
implement a nicer ReadAll that is built on this.)

I would like to also propose to change the IO authoring guide
> https://beam.apache.org/documentation/io/authoring-overview/#when-to-implement-using-the-source-api
>  to
> basically say "Never implement a new BoundedSource unless you can support
> liquid sharding". And add a utility for computing a desired number of
> splits.
>
> There might be some more details here to iron out, but I wanted to check
> with the community that this overall makes sense.
>
> 

Re: [PROPOSAL] Prepare Beam 2.6.0 release

2018-07-17 Thread Jean-Baptiste Onofré
Hi Pablo,

I'm investigating this issue, but it's a little long process.

So, I propose you start with the release process,  cutting the branch,
and then, I will create a cherry-pick PR for this one.

Regards
JB

On 17/07/2018 20:19, Pablo Estrada wrote:
> Checking once more:
> What does the communitythink we should do
> about https://issues.apache.org/jira/browse/BEAM-4750 ? Should I bump it
> to 2.7.0?
> Best
> -P.
> 
> On Fri, Jul 13, 2018 at 5:15 PM Ahmet Altay  > wrote:
> 
> Update:  https://issues.apache.org/jira/browse/BEAM-4784 is not a
> release blocker, details in the JIRA issue.
> 
> On Fri, Jul 13, 2018 at 11:12 AM, Thomas Weise  > wrote:
> 
> Can one of our Python experts please take a look
> at https://issues.apache.org/jira/browse/BEAM-4784 and advise if
> this should be addressed for the release?
> 
> Thanks,
> Thomas
> 
> 
> On Fri, Jul 13, 2018 at 11:02 AM Ahmet Altay  > wrote:
> 
> 
> 
> On Fri, Jul 13, 2018 at 10:48 AM, Pablo Estrada
> mailto:pabl...@google.com>> wrote:
> 
> Hi all,
> I've triaged most issues marked for 2.6.0 release. I've
> localized two that need a decision / attention:
> 
> - https://issues.apache.org/jira/browse/BEAM-4417 -
> Bigquery IO Numeric Datatype Support. Cham is not
> available to fix this at the moment, but this is a
> critical issue. Is anyone able to tackle this / should
> we bump this to next release?
> 
> 
> I bumped this to the next release. I think Cham will be the
> best person to address it when he is back. And with the
> regular release cadence, it would not be delayed by much.
>  
> 
> 
> - https://issues.apache.org/jira/browse/BEAM-4750 -
> Performance degradation due to some safeguards in
> beam-sdks-java-core. JB, are you looking to fix this?
> Should we bump? I had the impression that it was an easy
> fix, but I'm not sure.
> 
> If you're aware of any other issue that needs to be
> included as a release blocker, please report it to me.
> Best
> -P.
> 
> On Thu, Jul 12, 2018 at 2:15 AM Etienne Chauchot
> mailto:echauc...@apache.org>> wrote:
> 
> +1,
> 
> Thanks for volunteering Pablo, thanks also to have
> caught tickets that I forgot to close :)
> 
> Etienne
> 
> Le mercredi 11 juillet 2018 à 12:55 -0700, Alan
> Myrvold a écrit :
>> +1 Thanks for volunteering, Pablo
>>
>> On Wed, Jul 11, 2018 at 11:49 AM Jason Kuster
>> > > wrote:
>>> +1 sounds great
>>>
>>> On Wed, Jul 11, 2018 at 11:06 AM Thomas Weise
>>> mailto:t...@apache.org>> wrote:
 +1

 Thanks for volunteering, Pablo!

 On Mon, Jul 9, 2018 at 9:56 PM Jean-Baptiste
 Onofré >>> > wrote:
> +1
>
> I planned to send the proposal as well ;)
>
> Regards
> JB
>
> On 09/07/2018 23:16, Pablo Estrada wrote:
> > Hello everyone!
> >
> > As per the previously agreed-upon schedule
> for Beam releases, the
> > process for the 2.6.0 Beam release should
> start on July 17th.
> >
> > I volunteer to perform this release. 
> >
> > Here is the schedule that I have in mind:
> >
> > - We start triaging JIRA issues this week.
> > - I will cut a release branch on July 17.
> > - After July 17, any blockers will need to be
> cherry-picked into the
> > release branch.
> > - As soon as tests look good, and blockers
> have been addressed, I will
> > perform the other release tasks.
> >
> > Does that seem reasonable to the community?
> >
> > Best
> > -P.
>  

Re: Vendoring / Shading Protobuf and gRPC

2018-07-17 Thread Thomas Weise
Thanks, the classpath order matters indeed.

Still not able to run RemoteExecutionTest, but I was able to get the Flink
portable test to work by adding the following to the *top* of the
dependency list of *beam-runners-flink_2.11_test*

vendor/sdks-java-extensions-protobuf/build/libs/beam-vendor-sdks-java-extensions-protobuf-2.6.0-SNAPSHOT.jar
model/fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar


On Tue, Jul 17, 2018 at 6:00 PM Ankur Goenka  wrote:

> Yes, I am able to run it.
>
> For tests, you also need to add dependencies to
> ":beam-runners-java-fn-execution/beam-runners-java-fn-execution_*test*"
> module.
>
> Also, I only added
> :beam-model-job-management-2.6.0-SNAPSHOT.jar
> :beam-model-fn-execution-2.6.0-SNAPSHOT.jar
> to the dependencies manually so not sure if you want to add
> io.grpc:grpc-core:1.12.0 and com.google.protobuf:protobuf-java:3.5.1 to
> the dependencies.
>
> Note, you need to move them up in the dependencies list.
>
>
> On Tue, Jul 17, 2018 at 5:54 PM Thomas Weise  wrote:
>
>> Are you able to
>> run org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from
>> within Intellij ?
>>
>> I can get the compile errors to disappear by adding
>> beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0
>> and com.google.protobuf:protobuf-java:3.5.1
>>
>> Running the test still fails since other dependencies are missing.
>>
>>
>> On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka  wrote:
>>
>>> For reference:
>>> I was able to make intellij work with the master by doing following steps
>>>
>>>1. Remove module :beam:vendor-sdks-java-extensions-protobuf from
>>>intellij.
>>>2. Adding
>>>
>>> :beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>>and 
>>> :beam-model-job-management/build/libs/beam-model-job-management-2.6.0-SNAPSHOT.jar
>>>to the appropriate modules at the top of the dependency list.
>>>
>>>
>>> On Tue, Jul 17, 2018 at 2:29 PM Thomas Weise  wrote:
>>>
 Adding the external jar in Intellij (2018.1) currently fails due to a
 duplicate source directory (sdks/java/extensions/protobuf/src/main/java).

 The build as such also fails, with:  error: warnings found and -Werror
 specified

 Ismaël found removing
 https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L538
 as workaround.


 On Thu, Jul 12, 2018 at 1:55 PM Ismaël Mejía  wrote:

> Seems reasonable, but why exactly may we need the model (or protobuf
> related things) in the future in the SDK ? wasn’t it supposed to be
> translated into the Pipeline proto representation via the runners (and
> in this case the dep reside in the runner side) ?
> On Thu, Jul 12, 2018 at 2:50 AM Lukasz Cwik  wrote:
> >
> > Got a fix[1] for Andrews issue which turned out to be a release
> blocker since it broke performing the release. Also fixed several minor
> things like javadoc that were wrong with the release. Solving it allowed 
> me
> to do the publishing in parallel and cut the release time from 20+ mins to
> 8 mins on my machine.
> >
> > 1: https://github.com/apache/beam/pull/5936
> >
> > On Wed, Jul 11, 2018 at 3:51 PM Andrew Pilloud 
> wrote:
> >>
> >> We discussed this in person, sounds like my issue is known and will
> be fixed shortly. I'm running builds with '-Ppublishing' because I need to
> generate release artifacts for bundling the Beam SQL shell with the Google
> Cloud SDK. Hope to eventually just use the Beam release, but we are
> currently cutting a release off master every week to quickly iterate on 
> bug
> fixes.
> >>
> >> Andrew
> >>
> >> On Wed, Jul 11, 2018 at 1:39 PM Lukasz Cwik 
> wrote:
> >>>
> >>> Andrew, to my knowledge it seems as though your running into
> BEAM-4744, is there a reason you need to specify -Ppublishing?
> >>>
> >>> No particular reason to using ByteString within ByteKey and
> TextSource. Note that we currently do shade away protobuf in 
> sdks/java/core
> so we could either migrate to using a vendored version or re-implement the
> functionality to not use ByteString. Note that sdks/java/core can now
> dependend on the model/* classes and perform the Pipeline -> Proto
> translation as this will be needed to support portability efforts so I
> would prefer just migrating to use the vendored versions of the code. 
> Filed
> BEAM-4766.
> >>>
> >>> As for the IO module, I was referring to the upstream
> bigtable/bigquery/... libraries vended by Google. If they trimmed their 
> API
> surface to not expose gRPC or protobuf, then we wouldn't have to worry
> about having the shading logic within sdks/java/io/google-cloud-platform. 
> I
> know that this will be impossible for some connectors without backwards

Re: Vendoring / Shading Protobuf and gRPC

2018-07-17 Thread Ankur Goenka
Yes, I am able to run it.

For tests, you also need to add dependencies to
":beam-runners-java-fn-execution/beam-runners-java-fn-execution_*test*"
module.

Also, I only added
:beam-model-job-management-2.6.0-SNAPSHOT.jar
:beam-model-fn-execution-2.6.0-SNAPSHOT.jar
to the dependencies manually so not sure if you want to add
io.grpc:grpc-core:1.12.0 and com.google.protobuf:protobuf-java:3.5.1 to the
dependencies.

Note, you need to move them up in the dependencies list.


On Tue, Jul 17, 2018 at 5:54 PM Thomas Weise  wrote:

> Are you able to
> run org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from
> within Intellij ?
>
> I can get the compile errors to disappear by adding
> beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0
> and com.google.protobuf:protobuf-java:3.5.1
>
> Running the test still fails since other dependencies are missing.
>
>
> On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka  wrote:
>
>> For reference:
>> I was able to make intellij work with the master by doing following steps
>>
>>1. Remove module :beam:vendor-sdks-java-extensions-protobuf from
>>intellij.
>>2. Adding
>>
>> :beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>and 
>> :beam-model-job-management/build/libs/beam-model-job-management-2.6.0-SNAPSHOT.jar
>>to the appropriate modules at the top of the dependency list.
>>
>>
>> On Tue, Jul 17, 2018 at 2:29 PM Thomas Weise  wrote:
>>
>>> Adding the external jar in Intellij (2018.1) currently fails due to a
>>> duplicate source directory (sdks/java/extensions/protobuf/src/main/java).
>>>
>>> The build as such also fails, with:  error: warnings found and -Werror
>>> specified
>>>
>>> Ismaël found removing
>>> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L538
>>> as workaround.
>>>
>>>
>>> On Thu, Jul 12, 2018 at 1:55 PM Ismaël Mejía  wrote:
>>>
 Seems reasonable, but why exactly may we need the model (or protobuf
 related things) in the future in the SDK ? wasn’t it supposed to be
 translated into the Pipeline proto representation via the runners (and
 in this case the dep reside in the runner side) ?
 On Thu, Jul 12, 2018 at 2:50 AM Lukasz Cwik  wrote:
 >
 > Got a fix[1] for Andrews issue which turned out to be a release
 blocker since it broke performing the release. Also fixed several minor
 things like javadoc that were wrong with the release. Solving it allowed me
 to do the publishing in parallel and cut the release time from 20+ mins to
 8 mins on my machine.
 >
 > 1: https://github.com/apache/beam/pull/5936
 >
 > On Wed, Jul 11, 2018 at 3:51 PM Andrew Pilloud 
 wrote:
 >>
 >> We discussed this in person, sounds like my issue is known and will
 be fixed shortly. I'm running builds with '-Ppublishing' because I need to
 generate release artifacts for bundling the Beam SQL shell with the Google
 Cloud SDK. Hope to eventually just use the Beam release, but we are
 currently cutting a release off master every week to quickly iterate on bug
 fixes.
 >>
 >> Andrew
 >>
 >> On Wed, Jul 11, 2018 at 1:39 PM Lukasz Cwik 
 wrote:
 >>>
 >>> Andrew, to my knowledge it seems as though your running into
 BEAM-4744, is there a reason you need to specify -Ppublishing?
 >>>
 >>> No particular reason to using ByteString within ByteKey and
 TextSource. Note that we currently do shade away protobuf in sdks/java/core
 so we could either migrate to using a vendored version or re-implement the
 functionality to not use ByteString. Note that sdks/java/core can now
 dependend on the model/* classes and perform the Pipeline -> Proto
 translation as this will be needed to support portability efforts so I
 would prefer just migrating to use the vendored versions of the code. Filed
 BEAM-4766.
 >>>
 >>> As for the IO module, I was referring to the upstream
 bigtable/bigquery/... libraries vended by Google. If they trimmed their API
 surface to not expose gRPC or protobuf, then we wouldn't have to worry
 about having the shading logic within sdks/java/io/google-cloud-platform. I
 know that this will be impossible for some connectors without backwards
 incompatible changes since they exposed protobuf on their API surface. I
 know that Chamikara was looking to shade this away in the
 sdks/java/io/google-cloud-platform but only had limited success in the 
 past.
 >>>
 >>> On Wed, Jul 11, 2018 at 1:14 PM Ismaël Mejía 
 wrote:
 
  This is great news in particular for runners (Spark) where the
 leaking of some grpc subdependencies caused stability issues and required
 extra shading. Great !
 
  About the other modules
 
  > Note, these are the following modules that still depend on
 protobuf 

Re: Vendoring / Shading Protobuf and gRPC

2018-07-17 Thread Thomas Weise
Are you able to
run org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from
within Intellij ?

I can get the compile errors to disappear by adding
beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0
and com.google.protobuf:protobuf-java:3.5.1

Running the test still fails since other dependencies are missing.


On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka  wrote:

> For reference:
> I was able to make intellij work with the master by doing following steps
>
>1. Remove module :beam:vendor-sdks-java-extensions-protobuf from
>intellij.
>2. Adding
>
> :beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>and 
> :beam-model-job-management/build/libs/beam-model-job-management-2.6.0-SNAPSHOT.jar
>to the appropriate modules at the top of the dependency list.
>
>
> On Tue, Jul 17, 2018 at 2:29 PM Thomas Weise  wrote:
>
>> Adding the external jar in Intellij (2018.1) currently fails due to a
>> duplicate source directory (sdks/java/extensions/protobuf/src/main/java).
>>
>> The build as such also fails, with:  error: warnings found and -Werror
>> specified
>>
>> Ismaël found removing
>> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L538
>> as workaround.
>>
>>
>> On Thu, Jul 12, 2018 at 1:55 PM Ismaël Mejía  wrote:
>>
>>> Seems reasonable, but why exactly may we need the model (or protobuf
>>> related things) in the future in the SDK ? wasn’t it supposed to be
>>> translated into the Pipeline proto representation via the runners (and
>>> in this case the dep reside in the runner side) ?
>>> On Thu, Jul 12, 2018 at 2:50 AM Lukasz Cwik  wrote:
>>> >
>>> > Got a fix[1] for Andrews issue which turned out to be a release
>>> blocker since it broke performing the release. Also fixed several minor
>>> things like javadoc that were wrong with the release. Solving it allowed me
>>> to do the publishing in parallel and cut the release time from 20+ mins to
>>> 8 mins on my machine.
>>> >
>>> > 1: https://github.com/apache/beam/pull/5936
>>> >
>>> > On Wed, Jul 11, 2018 at 3:51 PM Andrew Pilloud 
>>> wrote:
>>> >>
>>> >> We discussed this in person, sounds like my issue is known and will
>>> be fixed shortly. I'm running builds with '-Ppublishing' because I need to
>>> generate release artifacts for bundling the Beam SQL shell with the Google
>>> Cloud SDK. Hope to eventually just use the Beam release, but we are
>>> currently cutting a release off master every week to quickly iterate on bug
>>> fixes.
>>> >>
>>> >> Andrew
>>> >>
>>> >> On Wed, Jul 11, 2018 at 1:39 PM Lukasz Cwik  wrote:
>>> >>>
>>> >>> Andrew, to my knowledge it seems as though your running into
>>> BEAM-4744, is there a reason you need to specify -Ppublishing?
>>> >>>
>>> >>> No particular reason to using ByteString within ByteKey and
>>> TextSource. Note that we currently do shade away protobuf in sdks/java/core
>>> so we could either migrate to using a vendored version or re-implement the
>>> functionality to not use ByteString. Note that sdks/java/core can now
>>> dependend on the model/* classes and perform the Pipeline -> Proto
>>> translation as this will be needed to support portability efforts so I
>>> would prefer just migrating to use the vendored versions of the code. Filed
>>> BEAM-4766.
>>> >>>
>>> >>> As for the IO module, I was referring to the upstream
>>> bigtable/bigquery/... libraries vended by Google. If they trimmed their API
>>> surface to not expose gRPC or protobuf, then we wouldn't have to worry
>>> about having the shading logic within sdks/java/io/google-cloud-platform. I
>>> know that this will be impossible for some connectors without backwards
>>> incompatible changes since they exposed protobuf on their API surface. I
>>> know that Chamikara was looking to shade this away in the
>>> sdks/java/io/google-cloud-platform but only had limited success in the past.
>>> >>>
>>> >>> On Wed, Jul 11, 2018 at 1:14 PM Ismaël Mejía 
>>> wrote:
>>> 
>>>  This is great news in particular for runners (Spark) where the
>>> leaking of some grpc subdependencies caused stability issues and required
>>> extra shading. Great !
>>> 
>>>  About the other modules
>>> 
>>>  > Note, these are the following modules that still depend on
>>> protobuf that are shaded away and could move to use a vendored variant of
>>> protobuf:
>>>  > * sdks/java/core
>>>  > * sdks/java/extensions/sql
>>> 
>>>  For sdks/java/core the dependency in protobuf seems to be minor,
>>> from a quick look it seems that it is only used to import ByteString in two
>>> classes: ByteKey and TextSource so hopefully we can rewrite both and get
>>> rid of the dependency altogether (making core smaller which is always a
>>> win).
>>>  Can we fill a JIRA for this or do I miss other reasons to depend on
>>> protobuf in core?
>>> 
>>>  For sdks/java/extensions/sql I don’t know if I am missing

Re: Vendoring / Shading Protobuf and gRPC

2018-07-17 Thread Ankur Goenka
For reference:
I was able to make intellij work with the master by doing following steps

   1. Remove module :beam:vendor-sdks-java-extensions-protobuf from
   intellij.
   2. Adding
   
:beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
   and 
:beam-model-job-management/build/libs/beam-model-job-management-2.6.0-SNAPSHOT.jar
   to the appropriate modules at the top of the dependency list.


On Tue, Jul 17, 2018 at 2:29 PM Thomas Weise  wrote:

> Adding the external jar in Intellij (2018.1) currently fails due to a
> duplicate source directory (sdks/java/extensions/protobuf/src/main/java).
>
> The build as such also fails, with:  error: warnings found and -Werror
> specified
>
> Ismaël found removing
> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L538
> as workaround.
>
>
> On Thu, Jul 12, 2018 at 1:55 PM Ismaël Mejía  wrote:
>
>> Seems reasonable, but why exactly may we need the model (or protobuf
>> related things) in the future in the SDK ? wasn’t it supposed to be
>> translated into the Pipeline proto representation via the runners (and
>> in this case the dep reside in the runner side) ?
>> On Thu, Jul 12, 2018 at 2:50 AM Lukasz Cwik  wrote:
>> >
>> > Got a fix[1] for Andrews issue which turned out to be a release blocker
>> since it broke performing the release. Also fixed several minor things like
>> javadoc that were wrong with the release. Solving it allowed me to do the
>> publishing in parallel and cut the release time from 20+ mins to 8 mins on
>> my machine.
>> >
>> > 1: https://github.com/apache/beam/pull/5936
>> >
>> > On Wed, Jul 11, 2018 at 3:51 PM Andrew Pilloud 
>> wrote:
>> >>
>> >> We discussed this in person, sounds like my issue is known and will be
>> fixed shortly. I'm running builds with '-Ppublishing' because I need to
>> generate release artifacts for bundling the Beam SQL shell with the Google
>> Cloud SDK. Hope to eventually just use the Beam release, but we are
>> currently cutting a release off master every week to quickly iterate on bug
>> fixes.
>> >>
>> >> Andrew
>> >>
>> >> On Wed, Jul 11, 2018 at 1:39 PM Lukasz Cwik  wrote:
>> >>>
>> >>> Andrew, to my knowledge it seems as though your running into
>> BEAM-4744, is there a reason you need to specify -Ppublishing?
>> >>>
>> >>> No particular reason to using ByteString within ByteKey and
>> TextSource. Note that we currently do shade away protobuf in sdks/java/core
>> so we could either migrate to using a vendored version or re-implement the
>> functionality to not use ByteString. Note that sdks/java/core can now
>> dependend on the model/* classes and perform the Pipeline -> Proto
>> translation as this will be needed to support portability efforts so I
>> would prefer just migrating to use the vendored versions of the code. Filed
>> BEAM-4766.
>> >>>
>> >>> As for the IO module, I was referring to the upstream
>> bigtable/bigquery/... libraries vended by Google. If they trimmed their API
>> surface to not expose gRPC or protobuf, then we wouldn't have to worry
>> about having the shading logic within sdks/java/io/google-cloud-platform. I
>> know that this will be impossible for some connectors without backwards
>> incompatible changes since they exposed protobuf on their API surface. I
>> know that Chamikara was looking to shade this away in the
>> sdks/java/io/google-cloud-platform but only had limited success in the past.
>> >>>
>> >>> On Wed, Jul 11, 2018 at 1:14 PM Ismaël Mejía 
>> wrote:
>> 
>>  This is great news in particular for runners (Spark) where the
>> leaking of some grpc subdependencies caused stability issues and required
>> extra shading. Great !
>> 
>>  About the other modules
>> 
>>  > Note, these are the following modules that still depend on
>> protobuf that are shaded away and could move to use a vendored variant of
>> protobuf:
>>  > * sdks/java/core
>>  > * sdks/java/extensions/sql
>> 
>>  For sdks/java/core the dependency in protobuf seems to be minor,
>> from a quick look it seems that it is only used to import ByteString in two
>> classes: ByteKey and TextSource so hopefully we can rewrite both and get
>> rid of the dependency altogether (making core smaller which is always a
>> win).
>>  Can we fill a JIRA for this or do I miss other reasons to depend on
>> protobuf in core?
>> 
>>  For sdks/java/extensions/sql I don’t know if I am missing something,
>> but I don’t see any code use of protobuf and I doubt that calcite uses
>> protobuf so maybe it is there just because it was leaking from somewhere
>> else in Beam, we should better check this first.
>> 
>>  > These modules expose protobuf because it is part of the API
>> surface:
>>  > * sdks/java/extensions/protobuf
>>  > * sdks/java/io/google-cloud-platform (I believe that gRPC could be
>> shaded here but preferrably the IO module would do it so we wouldn't have
>> this 

Re: Live coding & reviewing adventures

2018-07-17 Thread Holden Karau
Sure! I’ll respond with a pip freeze when I land.

On Tue, Jul 17, 2018 at 2:28 PM Suneel Marthi  wrote:

> Could u publish the python transitive deps some place that have the
> Beam-Flink runner working ?
>
> On Tue, Jul 17, 2018 at 5:26 PM, Holden Karau 
> wrote:
>
>> And I've got an hour to kill @ SFO today so at some of the suggestions
>> from folks I'm going to do a more user focused one trying getting the TFT
>> demo to work with the portable flink runner (hopefully) -
>> https://www.youtube.com/watch?v=wL9mvQeN36E
>>
>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
>> wrote:
>>
>>> Hi folks! I've been doing some live coding in my other projects and I
>>> figured I'd do some with Apache Beam as well.
>>>
>>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>>> review tooling possibilities (looking at forking spark-pr-dashboard for
>>> other projects like beam and setting up mentionbot to work with ASF infra)
>>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>>
>>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>>> trying to get easier dependency management for the Python portable runner
>>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>>
>>> If your interested in seeing more of the development process I hope you
>>> will join me :)
>>>
>>> P.S.
>>>
>>> You can also follow on twitch which does a better job of notifications
>>> https://www.twitch.tv/holdenkarau
>>>
>>> Also one of the other thing I do is "live reviews" of PRs but they are
>>> generally opt-in and I don't have enough opt-ins from the Beam community to
>>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>>> a live streamed review of your PRs let me know (if your curious to what
>>> they look like you can see some of them here in Spark land
>>> 
>>> ).
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
> --
Twitter: https://twitter.com/holdenkarau


Re: Live coding & reviewing adventures

2018-07-17 Thread Holden Karau
Yup I’ve got a fork of that (I skip re-uploading the input data files since
conference WiFi) and store to gcs so I can see the output.

Didn’t finish everything in today’s stream because the WiFi was a bit
flakey during the first container build and stalled apt, but I’ll do a
follow up to finish it up.

On Tue, Jul 17, 2018 at 2:34 PM Ankur Goenka  wrote:

> +1
> For reference here is a sample job
> https://github.com/axelmagn/model-analysis/blob/axelmagn-hacks/examples/chicago_taxi/preprocess_flink.sh
>
> Also 1 quick heads up, output file will be created in docker container if
> you use local file system.
>
>
>
>
> On Tue, Jul 17, 2018 at 2:27 PM Holden Karau  wrote:
>
>> And I've got an hour to kill @ SFO today so at some of the suggestions
>> from folks I'm going to do a more user focused one trying getting the TFT
>> demo to work with the portable flink runner (hopefully) -
>> https://www.youtube.com/watch?v=wL9mvQeN36E
>>
>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
>> wrote:
>>
>>> Hi folks! I've been doing some live coding in my other projects and I
>>> figured I'd do some with Apache Beam as well.
>>>
>>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>>> review tooling possibilities (looking at forking spark-pr-dashboard for
>>> other projects like beam and setting up mentionbot to work with ASF infra)
>>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>>
>>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>>> trying to get easier dependency management for the Python portable runner
>>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>>
>>> If your interested in seeing more of the development process I hope you
>>> will join me :)
>>>
>>> P.S.
>>>
>>> You can also follow on twitch which does a better job of notifications
>>> https://www.twitch.tv/holdenkarau
>>>
>>> Also one of the other thing I do is "live reviews" of PRs but they are
>>> generally opt-in and I don't have enough opt-ins from the Beam community to
>>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>>> a live streamed review of your PRs let me know (if your curious to what
>>> they look like you can see some of them here in Spark land
>>> 
>>> ).
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau


Re: No JVM - new Runner?

2018-07-17 Thread Austin Bennett
Hi Henning,

Helped me unearth: https://s.apache.org/beam-job-api

and can dig into:
https://github.com/apache/beam/tree/master/model

Additionally, see embedded ->


On Tue, Jul 17, 2018, 12:02 PM Henning Rohde  wrote:

> There are essentially 2 complementary portability API surfaces that you'd
> need to implement: job management incl. job submission and execution as
> well as some worker deployment plumbing specific to the runner. Note that
> the source of truth is the model protos -- the design docs linked from
> https://beam.apache.org/contribute/portability/ and (even more so) the
> website guides are not always up to date.
>

Sounds about right :-)

Haven't done anything with grpc/proto, so gotta begin digging through that.




>
> Currently, all runners are in Java and share numerous components and
> utilities. A non-JVM runner would have to build all that from scratch --
> although, as you mention, if you're using Go or Python the corresponding
> SDKs likely have many pieces that can be reused. A minor potential hiccup
> is that gRPC/protobuf is not natively supported everywhere, so you may end
> up interoperating with the C versions of the libraries if you pick a
> non-supported language. A separate challenge regardless of the language
> is how directly the Beam model and primitives map to the engine.
>
> All that said, I think it's definitely feasible to do something
> interesting. Are you specifically thinking of a Go Wallaroo runner?
>

That was my initial thought (go based runner, given likely ease of some
compatibility as well as potential wider reusability), though certainly
things to consider (like using Pony Lang, given that's the core of
Wallaroo).  Much of this is beyond what I'd have enough sense to be able to
implement on my own, so much is homework for me, and eventually rely on
both runner and Beam communities for guidance should this be well received
all around and if it is ever to come to fruition.  Really, much much HW for
me at this point!

Ultimately, easiest for me to rely on Beam's Python SDK for my ML focused
workflows (when writing beam -- though eagerly awaiting full py3 support,
so potentially a more impactful immediate area to contribute), so there's
also that compatibility too.

Thanks,
>  Henning
>

Thanks!
Austin



>
> On Tue, Jul 17, 2018 at 9:26 AM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Sweet; that led me to:  https://beam.apache.org/contr
>> ibute/runner-guide/#the-runner-api (which I can't believe I missed).
>>
>>
>>
>> On Tue, Jul 17, 2018 at 9:21 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Austin,
>>>
>>> If your runner provide the gRPC portabality layer (allowing any SDK to
>>> "interact" with the runner), it will work no matter how the runner is
>>> implemented (JVM or not).
>>>
>>> However, it means that you will have to mimic the Runner API for the
>>> translation.
>>>
>>> Regards
>>> JB
>>>
>>> On 17/07/2018 18:19, Austin Bennett wrote:
>>> > Hi Beam Devs,
>>> >
>>> > I still don't quite understand:
>>> >
>>> > "Apache Beam provides a portable API layer for building sophisticated
>>> > data-parallel processing pipelines that may be executed across a
>>> > diversity of execution engines, or /runners/."
>>> >
>>> > (from https://beam.apache.org/documentation/runners/capability-matrix/
>>> )
>>> >
>>> > And specifically, close reading
>>> > of: https://beam.apache.org/contribute/portability/
>>> >
>>> > What if I'd like to implement a runner that is non-JVM?  Though would
>>> > leverage the Python and Go SDKs?  Specifically, thinking of:
>>> >  https://www.wallaroolabs.com (I am out in NY meeting with friends
>>> there
>>> > later this week, and wanted to get a sense of, feasibility, work
>>> > involved, etc -- to propose that we add a new Wallaroo runner).
>>> >
>>> > Is there a way to keep java out of the mix completely and still work
>>> > with Beam on a non JVM runner (seems maybe eventually, but what about
>>> > currently/near future)?
>>> >
>>> > Any input, thoughts, ideas, other pages or info to explore -- all
>>> > appreciated; thanks!
>>> > Austin
>>> >
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>


Re: Live coding & reviewing adventures

2018-07-17 Thread Ankur Goenka
+1
For reference here is a sample job
https://github.com/axelmagn/model-analysis/blob/axelmagn-hacks/examples/chicago_taxi/preprocess_flink.sh

Also 1 quick heads up, output file will be created in docker container if
you use local file system.




On Tue, Jul 17, 2018 at 2:27 PM Holden Karau  wrote:

> And I've got an hour to kill @ SFO today so at some of the suggestions
> from folks I'm going to do a more user focused one trying getting the TFT
> demo to work with the portable flink runner (hopefully) -
> https://www.youtube.com/watch?v=wL9mvQeN36E
>
> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
> wrote:
>
>> Hi folks! I've been doing some live coding in my other projects and I
>> figured I'd do some with Apache Beam as well.
>>
>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>> review tooling possibilities (looking at forking spark-pr-dashboard for
>> other projects like beam and setting up mentionbot to work with ASF infra)
>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>
>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>> trying to get easier dependency management for the Python portable runner
>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>
>> If your interested in seeing more of the development process I hope you
>> will join me :)
>>
>> P.S.
>>
>> You can also follow on twitch which does a better job of notifications
>> https://www.twitch.tv/holdenkarau
>>
>> Also one of the other thing I do is "live reviews" of PRs but they are
>> generally opt-in and I don't have enough opt-ins from the Beam community to
>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>> a live streamed review of your PRs let me know (if your curious to what
>> they look like you can see some of them here in Spark land
>> 
>> ).
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Vendoring / Shading Protobuf and gRPC

2018-07-17 Thread Thomas Weise
Adding the external jar in Intellij (2018.1) currently fails due to a
duplicate source directory (sdks/java/extensions/protobuf/src/main/java).

The build as such also fails, with:  error: warnings found and -Werror
specified

Ismaël found removing
https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L538
as workaround.


On Thu, Jul 12, 2018 at 1:55 PM Ismaël Mejía  wrote:

> Seems reasonable, but why exactly may we need the model (or protobuf
> related things) in the future in the SDK ? wasn’t it supposed to be
> translated into the Pipeline proto representation via the runners (and
> in this case the dep reside in the runner side) ?
> On Thu, Jul 12, 2018 at 2:50 AM Lukasz Cwik  wrote:
> >
> > Got a fix[1] for Andrews issue which turned out to be a release blocker
> since it broke performing the release. Also fixed several minor things like
> javadoc that were wrong with the release. Solving it allowed me to do the
> publishing in parallel and cut the release time from 20+ mins to 8 mins on
> my machine.
> >
> > 1: https://github.com/apache/beam/pull/5936
> >
> > On Wed, Jul 11, 2018 at 3:51 PM Andrew Pilloud 
> wrote:
> >>
> >> We discussed this in person, sounds like my issue is known and will be
> fixed shortly. I'm running builds with '-Ppublishing' because I need to
> generate release artifacts for bundling the Beam SQL shell with the Google
> Cloud SDK. Hope to eventually just use the Beam release, but we are
> currently cutting a release off master every week to quickly iterate on bug
> fixes.
> >>
> >> Andrew
> >>
> >> On Wed, Jul 11, 2018 at 1:39 PM Lukasz Cwik  wrote:
> >>>
> >>> Andrew, to my knowledge it seems as though your running into
> BEAM-4744, is there a reason you need to specify -Ppublishing?
> >>>
> >>> No particular reason to using ByteString within ByteKey and
> TextSource. Note that we currently do shade away protobuf in sdks/java/core
> so we could either migrate to using a vendored version or re-implement the
> functionality to not use ByteString. Note that sdks/java/core can now
> dependend on the model/* classes and perform the Pipeline -> Proto
> translation as this will be needed to support portability efforts so I
> would prefer just migrating to use the vendored versions of the code. Filed
> BEAM-4766.
> >>>
> >>> As for the IO module, I was referring to the upstream
> bigtable/bigquery/... libraries vended by Google. If they trimmed their API
> surface to not expose gRPC or protobuf, then we wouldn't have to worry
> about having the shading logic within sdks/java/io/google-cloud-platform. I
> know that this will be impossible for some connectors without backwards
> incompatible changes since they exposed protobuf on their API surface. I
> know that Chamikara was looking to shade this away in the
> sdks/java/io/google-cloud-platform but only had limited success in the past.
> >>>
> >>> On Wed, Jul 11, 2018 at 1:14 PM Ismaël Mejía 
> wrote:
> 
>  This is great news in particular for runners (Spark) where the
> leaking of some grpc subdependencies caused stability issues and required
> extra shading. Great !
> 
>  About the other modules
> 
>  > Note, these are the following modules that still depend on protobuf
> that are shaded away and could move to use a vendored variant of protobuf:
>  > * sdks/java/core
>  > * sdks/java/extensions/sql
> 
>  For sdks/java/core the dependency in protobuf seems to be minor, from
> a quick look it seems that it is only used to import ByteString in two
> classes: ByteKey and TextSource so hopefully we can rewrite both and get
> rid of the dependency altogether (making core smaller which is always a
> win).
>  Can we fill a JIRA for this or do I miss other reasons to depend on
> protobuf in core?
> 
>  For sdks/java/extensions/sql I don’t know if I am missing something,
> but I don’t see any code use of protobuf and I doubt that calcite uses
> protobuf so maybe it is there just because it was leaking from somewhere
> else in Beam, we should better check this first.
> 
>  > These modules expose protobuf because it is part of the API surface:
>  > * sdks/java/extensions/protobuf
>  > * sdks/java/io/google-cloud-platform (I believe that gRPC could be
> shaded here but preferrably the IO module would do it so we wouldn't have
> this maintenance burden.)
> 
>  Can you please elaborate on ‘but preferrably the IO module would do
> it so we wouldn't have this maintenance burden’. I remember there was an
> issue when running the examples in the spark runner examples because of
> sdks/java/io/google-cloud-platform leaking netty via gRPC (BEAM-3519) [Note
> that this is hidden at this moment because of pure luck Spark 2.3.x and
> Beam are aligned on netty version but this can change in the future so
> hopefully this can be shaded/controlled].
> 
>  On Wed, Jul 11, 2018 at 8:55 PM Andrew Pilloud 
> wrote:
> 

Re: Live coding & reviewing adventures

2018-07-17 Thread Suneel Marthi
Could u publish the python transitive deps some place that have the
Beam-Flink runner working ?

On Tue, Jul 17, 2018 at 5:26 PM, Holden Karau  wrote:

> And I've got an hour to kill @ SFO today so at some of the suggestions
> from folks I'm going to do a more user focused one trying getting the TFT
> demo to work with the portable flink runner (hopefully) -
> https://www.youtube.com/watch?v=wL9mvQeN36E
>
> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
> wrote:
>
>> Hi folks! I've been doing some live coding in my other projects and I
>> figured I'd do some with Apache Beam as well.
>>
>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>> review tooling possibilities (looking at forking spark-pr-dashboard for
>> other projects like beam and setting up mentionbot to work with ASF infra)
>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>
>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>> trying to get easier dependency management for the Python portable runner
>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>
>> If your interested in seeing more of the development process I hope you
>> will join me :)
>>
>> P.S.
>>
>> You can also follow on twitch which does a better job of notifications
>> https://www.twitch.tv/holdenkarau
>>
>> Also one of the other thing I do is "live reviews" of PRs but they are
>> generally opt-in and I don't have enough opt-ins from the Beam community to
>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>> a live streamed review of your PRs let me know (if your curious to what
>> they look like you can see some of them here in Spark land
>> 
>> ).
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Live coding & reviewing adventures

2018-07-17 Thread Holden Karau
And I've got an hour to kill @ SFO today so at some of the suggestions from
folks I'm going to do a more user focused one trying getting the TFT demo
to work with the portable flink runner (hopefully) -
https://www.youtube.com/watch?v=wL9mvQeN36E

On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau  wrote:

> Hi folks! I've been doing some live coding in my other projects and I
> figured I'd do some with Apache Beam as well.
>
> Today @ 3pm pacific I'm going be doing some impromptu exploration better
> review tooling possibilities (looking at forking spark-pr-dashboard for
> other projects like beam and setting up mentionbot to work with ASF infra)
> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>
> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
> trying to get easier dependency management for the Python portable runner
> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>
> If your interested in seeing more of the development process I hope you
> will join me :)
>
> P.S.
>
> You can also follow on twitch which does a better job of notifications
> https://www.twitch.tv/holdenkarau
>
> Also one of the other thing I do is "live reviews" of PRs but they are
> generally opt-in and I don't have enough opt-ins from the Beam community to
> do live reviews in Beam, if you work on Beam and would be OK with me doing
> a live streamed review of your PRs let me know (if your curious to what
> they look like you can see some of them here in Spark land
> 
> ).
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: [PROPOSAL] Prepare Beam 2.6.0 release

2018-07-17 Thread Lukasz Cwik
+1 for cutting the release branch and merging in any necessary PRs.

On Tue, Jul 17, 2018 at 11:33 AM Ahmet Altay  wrote:

> JB said that he is looking at it (
> https://lists.apache.org/thread.html/1e329318a4dafe27b9ff304d9460d05d8966e5ceebaf4ebfb948e2b8@%3Cdev.beam.apache.org%3E
> )
>
> I propose cutting the release branch as proposed (July 17); and applying
> any fixes related to the issue to the branch.
>
> On Tue, Jul 17, 2018 at 11:19 AM, Pablo Estrada 
> wrote:
>
>> Checking once more:
>> What does the communitythink we should do about
>> https://issues.apache.org/jira/browse/BEAM-4750 ? Should I bump it to
>> 2.7.0?
>> Best
>> -P.
>>
>> On Fri, Jul 13, 2018 at 5:15 PM Ahmet Altay  wrote:
>>
>>> Update:  https://issues.apache.org/jira/browse/BEAM-4784 is not a
>>> release blocker, details in the JIRA issue.
>>>
>>> On Fri, Jul 13, 2018 at 11:12 AM, Thomas Weise  wrote:
>>>
 Can one of our Python experts please take a look at
 https://issues.apache.org/jira/browse/BEAM-4784 and advise if this
 should be addressed for the release?

 Thanks,
 Thomas


 On Fri, Jul 13, 2018 at 11:02 AM Ahmet Altay  wrote:

>
>
> On Fri, Jul 13, 2018 at 10:48 AM, Pablo Estrada 
> wrote:
>
>> Hi all,
>> I've triaged most issues marked for 2.6.0 release. I've localized two
>> that need a decision / attention:
>>
>> - https://issues.apache.org/jira/browse/BEAM-4417 - Bigquery IO
>> Numeric Datatype Support. Cham is not available to fix this at the 
>> moment,
>> but this is a critical issue. Is anyone able to tackle this / should we
>> bump this to next release?
>>
>
> I bumped this to the next release. I think Cham will be the best
> person to address it when he is back. And with the regular release 
> cadence,
> it would not be delayed by much.
>
>
>>
>> - https://issues.apache.org/jira/browse/BEAM-4750 - Performance
>> degradation due to some safeguards in beam-sdks-java-core. JB, are you
>> looking to fix this? Should we bump? I had the impression that it was an
>> easy fix, but I'm not sure.
>>
>> If you're aware of any other issue that needs to be included as a
>> release blocker, please report it to me.
>> Best
>> -P.
>>
>> On Thu, Jul 12, 2018 at 2:15 AM Etienne Chauchot <
>> echauc...@apache.org> wrote:
>>
>>> +1,
>>>
>>> Thanks for volunteering Pablo, thanks also to have caught tickets
>>> that I forgot to close :)
>>>
>>> Etienne
>>>
>>> Le mercredi 11 juillet 2018 à 12:55 -0700, Alan Myrvold a écrit :
>>>
>>> +1 Thanks for volunteering, Pablo
>>>
>>> On Wed, Jul 11, 2018 at 11:49 AM Jason Kuster <
>>> jasonkus...@google.com> wrote:
>>>
>>> +1 sounds great
>>>
>>> On Wed, Jul 11, 2018 at 11:06 AM Thomas Weise 
>>> wrote:
>>>
>>> +1
>>>
>>> Thanks for volunteering, Pablo!
>>>
>>> On Mon, Jul 9, 2018 at 9:56 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>> +1
>>>
>>> I planned to send the proposal as well ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 09/07/2018 23:16, Pablo Estrada wrote:
>>> > Hello everyone!
>>> >
>>> > As per the previously agreed-upon schedule for Beam releases, the
>>> > process for the 2.6.0 Beam release should start on July 17th.
>>> >
>>> > I volunteer to perform this release.
>>> >
>>> > Here is the schedule that I have in mind:
>>> >
>>> > - We start triaging JIRA issues this week.
>>> > - I will cut a release branch on July 17.
>>> > - After July 17, any blockers will need to be cherry-picked into
>>> the
>>> > release branch.
>>> > - As soon as tests look good, and blockers have been addressed, I
>>> will
>>> > perform the other release tasks.
>>> >
>>> > Does that seem reasonable to the community?
>>> >
>>> > Best
>>> > -P.
>>> > --
>>> > Got feedback? go/pabloem-feedback
>>> 
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>>
>>> --
>> Got feedback? go/pabloem-feedback
>> 
>>
>
>
>>> --
>> Got feedback? go/pabloem-feedback
>> 
>>
>
>


Re: No JVM - new Runner?

2018-07-17 Thread Henning Rohde
There are essentially 2 complementary portability API surfaces that you'd
need to implement: job management incl. job submission and execution as
well as some worker deployment plumbing specific to the runner. Note that
the source of truth is the model protos -- the design docs linked from
https://beam.apache.org/contribute/portability/ and (even more so) the
website guides are not always up to date.

Currently, all runners are in Java and share numerous components and
utilities. A non-JVM runner would have to build all that from scratch --
although, as you mention, if you're using Go or Python the corresponding
SDKs likely have many pieces that can be reused. A minor potential hiccup
is that gRPC/protobuf is not natively supported everywhere, so you may end
up interoperating with the C versions of the libraries if you pick a
non-supported language. A separate challenge regardless of the language is
how directly the Beam model and primitives map to the engine.

All that said, I think it's definitely feasible to do something
interesting. Are you specifically thinking of a Go Wallaroo runner?

Thanks,
 Henning

On Tue, Jul 17, 2018 at 9:26 AM Austin Bennett 
wrote:

> Sweet; that led me to:
> https://beam.apache.org/contribute/runner-guide/#the-runner-api (which I
> can't believe I missed).
>
>
>
> On Tue, Jul 17, 2018 at 9:21 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Austin,
>>
>> If your runner provide the gRPC portabality layer (allowing any SDK to
>> "interact" with the runner), it will work no matter how the runner is
>> implemented (JVM or not).
>>
>> However, it means that you will have to mimic the Runner API for the
>> translation.
>>
>> Regards
>> JB
>>
>> On 17/07/2018 18:19, Austin Bennett wrote:
>> > Hi Beam Devs,
>> >
>> > I still don't quite understand:
>> >
>> > "Apache Beam provides a portable API layer for building sophisticated
>> > data-parallel processing pipelines that may be executed across a
>> > diversity of execution engines, or /runners/."
>> >
>> > (from https://beam.apache.org/documentation/runners/capability-matrix/)
>> >
>> > And specifically, close reading
>> > of: https://beam.apache.org/contribute/portability/
>> >
>> > What if I'd like to implement a runner that is non-JVM?  Though would
>> > leverage the Python and Go SDKs?  Specifically, thinking of:
>> >  https://www.wallaroolabs.com (I am out in NY meeting with friends
>> there
>> > later this week, and wanted to get a sense of, feasibility, work
>> > involved, etc -- to propose that we add a new Wallaroo runner).
>> >
>> > Is there a way to keep java out of the mix completely and still work
>> > with Beam on a non JVM runner (seems maybe eventually, but what about
>> > currently/near future)?
>> >
>> > Any input, thoughts, ideas, other pages or info to explore -- all
>> > appreciated; thanks!
>> > Austin
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


Re: [PROPOSAL] Prepare Beam 2.6.0 release

2018-07-17 Thread Ahmet Altay
JB said that he is looking at it (
https://lists.apache.org/thread.html/1e329318a4dafe27b9ff304d9460d05d8966e5ceebaf4ebfb948e2b8@%3Cdev.beam.apache.org%3E
)

I propose cutting the release branch as proposed (July 17); and applying
any fixes related to the issue to the branch.

On Tue, Jul 17, 2018 at 11:19 AM, Pablo Estrada  wrote:

> Checking once more:
> What does the communitythink we should do about https://issues.apache.
> org/jira/browse/BEAM-4750 ? Should I bump it to 2.7.0?
> Best
> -P.
>
> On Fri, Jul 13, 2018 at 5:15 PM Ahmet Altay  wrote:
>
>> Update:  https://issues.apache.org/jira/browse/BEAM-4784 is not a
>> release blocker, details in the JIRA issue.
>>
>> On Fri, Jul 13, 2018 at 11:12 AM, Thomas Weise  wrote:
>>
>>> Can one of our Python experts please take a look at
>>> https://issues.apache.org/jira/browse/BEAM-4784 and advise if this
>>> should be addressed for the release?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Fri, Jul 13, 2018 at 11:02 AM Ahmet Altay  wrote:
>>>


 On Fri, Jul 13, 2018 at 10:48 AM, Pablo Estrada 
 wrote:

> Hi all,
> I've triaged most issues marked for 2.6.0 release. I've localized two
> that need a decision / attention:
>
> - https://issues.apache.org/jira/browse/BEAM-4417 - Bigquery IO
> Numeric Datatype Support. Cham is not available to fix this at the moment,
> but this is a critical issue. Is anyone able to tackle this / should we
> bump this to next release?
>

 I bumped this to the next release. I think Cham will be the best person
 to address it when he is back. And with the regular release cadence, it
 would not be delayed by much.


>
> - https://issues.apache.org/jira/browse/BEAM-4750 - Performance
> degradation due to some safeguards in beam-sdks-java-core. JB, are you
> looking to fix this? Should we bump? I had the impression that it was an
> easy fix, but I'm not sure.
>
> If you're aware of any other issue that needs to be included as a
> release blocker, please report it to me.
> Best
> -P.
>
> On Thu, Jul 12, 2018 at 2:15 AM Etienne Chauchot 
> wrote:
>
>> +1,
>>
>> Thanks for volunteering Pablo, thanks also to have caught tickets
>> that I forgot to close :)
>>
>> Etienne
>>
>> Le mercredi 11 juillet 2018 à 12:55 -0700, Alan Myrvold a écrit :
>>
>> +1 Thanks for volunteering, Pablo
>>
>> On Wed, Jul 11, 2018 at 11:49 AM Jason Kuster 
>> wrote:
>>
>> +1 sounds great
>>
>> On Wed, Jul 11, 2018 at 11:06 AM Thomas Weise  wrote:
>>
>> +1
>>
>> Thanks for volunteering, Pablo!
>>
>> On Mon, Jul 9, 2018 at 9:56 PM Jean-Baptiste Onofré 
>> wrote:
>>
>> +1
>>
>> I planned to send the proposal as well ;)
>>
>> Regards
>> JB
>>
>> On 09/07/2018 23:16, Pablo Estrada wrote:
>> > Hello everyone!
>> >
>> > As per the previously agreed-upon schedule for Beam releases, the
>> > process for the 2.6.0 Beam release should start on July 17th.
>> >
>> > I volunteer to perform this release.
>> >
>> > Here is the schedule that I have in mind:
>> >
>> > - We start triaging JIRA issues this week.
>> > - I will cut a release branch on July 17.
>> > - After July 17, any blockers will need to be cherry-picked into the
>> > release branch.
>> > - As soon as tests look good, and blockers have been addressed, I
>> will
>> > perform the other release tasks.
>> >
>> > Does that seem reasonable to the community?
>> >
>> > Best
>> > -P.
>> > --
>> > Got feedback? go/pabloem-feedback
>> 
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
>> --
> Got feedback? go/pabloem-feedback
> 
>


>> --
> Got feedback? go/pabloem-feedback
> 
>


Re: [PROPOSAL] Prepare Beam 2.6.0 release

2018-07-17 Thread Pablo Estrada
Checking once more:
What does the communitythink we should do about
https://issues.apache.org/jira/browse/BEAM-4750 ? Should I bump it to 2.7.0?
Best
-P.

On Fri, Jul 13, 2018 at 5:15 PM Ahmet Altay  wrote:

> Update:  https://issues.apache.org/jira/browse/BEAM-4784 is not a release
> blocker, details in the JIRA issue.
>
> On Fri, Jul 13, 2018 at 11:12 AM, Thomas Weise  wrote:
>
>> Can one of our Python experts please take a look at
>> https://issues.apache.org/jira/browse/BEAM-4784 and advise if this
>> should be addressed for the release?
>>
>> Thanks,
>> Thomas
>>
>>
>> On Fri, Jul 13, 2018 at 11:02 AM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Fri, Jul 13, 2018 at 10:48 AM, Pablo Estrada 
>>> wrote:
>>>
 Hi all,
 I've triaged most issues marked for 2.6.0 release. I've localized two
 that need a decision / attention:

 - https://issues.apache.org/jira/browse/BEAM-4417 - Bigquery IO
 Numeric Datatype Support. Cham is not available to fix this at the moment,
 but this is a critical issue. Is anyone able to tackle this / should we
 bump this to next release?

>>>
>>> I bumped this to the next release. I think Cham will be the best person
>>> to address it when he is back. And with the regular release cadence, it
>>> would not be delayed by much.
>>>
>>>

 - https://issues.apache.org/jira/browse/BEAM-4750 - Performance
 degradation due to some safeguards in beam-sdks-java-core. JB, are you
 looking to fix this? Should we bump? I had the impression that it was an
 easy fix, but I'm not sure.

 If you're aware of any other issue that needs to be included as a
 release blocker, please report it to me.
 Best
 -P.

 On Thu, Jul 12, 2018 at 2:15 AM Etienne Chauchot 
 wrote:

> +1,
>
> Thanks for volunteering Pablo, thanks also to have caught tickets that
> I forgot to close :)
>
> Etienne
>
> Le mercredi 11 juillet 2018 à 12:55 -0700, Alan Myrvold a écrit :
>
> +1 Thanks for volunteering, Pablo
>
> On Wed, Jul 11, 2018 at 11:49 AM Jason Kuster 
> wrote:
>
> +1 sounds great
>
> On Wed, Jul 11, 2018 at 11:06 AM Thomas Weise  wrote:
>
> +1
>
> Thanks for volunteering, Pablo!
>
> On Mon, Jul 9, 2018 at 9:56 PM Jean-Baptiste Onofré 
> wrote:
>
> +1
>
> I planned to send the proposal as well ;)
>
> Regards
> JB
>
> On 09/07/2018 23:16, Pablo Estrada wrote:
> > Hello everyone!
> >
> > As per the previously agreed-upon schedule for Beam releases, the
> > process for the 2.6.0 Beam release should start on July 17th.
> >
> > I volunteer to perform this release.
> >
> > Here is the schedule that I have in mind:
> >
> > - We start triaging JIRA issues this week.
> > - I will cut a release branch on July 17.
> > - After July 17, any blockers will need to be cherry-picked into the
> > release branch.
> > - As soon as tests look good, and blockers have been addressed, I
> will
> > perform the other release tasks.
> >
> > Does that seem reasonable to the community?
> >
> > Best
> > -P.
> > --
> > Got feedback? go/pabloem-feedback
> 
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
> --
 Got feedback? go/pabloem-feedback
 

>>>
>>>
> --
Got feedback? go/pabloem-feedback


Re: CODEOWNERS for apache/beam repo

2018-07-17 Thread Udi Meiri
+1 to generating the file.
I'll go ahead and file a PR to remove CODEOWNERS

On Tue, Jul 17, 2018 at 9:28 AM Holden Karau  wrote:

> So it doesn’t support doing that right now, although if we find it’s a
> problem we can specify an exclude file with folks who haven’t contributed
> in the past year. Would people want me to generate that first?
>
> On Tue, Jul 17, 2018 at 10:22 AM Ismaël Mejía  wrote:
>
>> Is there a way to put inactive people as not reviewers for the blame
>> case? I think it can be useful considering that a good amount of our
>> committers are not active at the moment and auto-assigning reviews to
>> them seem like a waste of energy/time.
>> On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov 
>> wrote:
>> >
>> > We did not, but I think we should. So far, in 100% of the PRs I've
>> authored, the default functionality of CODEOWNERS did the wrong thing and I
>> had to fix something up manually.
>> >
>> > On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud 
>> wrote:
>> >>
>> >> This sounds like a good plan. Did we want to rename the CODEOWNERS
>> file to disable github's mass adding of reviewers while we figure this out?
>> >>
>> >> Andrew
>> >>
>> >> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré 
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> Le 16 juil. 2018, à 19:17, Holden Karau  a
>> écrit:
>> 
>>  Ok if no one objects I'll create the INFRA ticket after OSCON and we
>> can test it for a week and decide if it helps or hinders.
>> 
>>  On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré < j...@nanthrax.net>
>> wrote:
>> >
>> > Agree to test it for a week.
>> >
>> > Regards
>> > JB
>> > Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com>
>> a écrit:
>> >>
>> >> Would folks be OK with me asking infra to turn on blame based
>> suggestions for Beam and trying it out for a week?
>> >>
>> >> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez <
>> rfern...@google.com> wrote:
>> >>>
>> >>> +1 using blame -- nifty :)
>> >>>
>> >>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan <
>> bat...@google.com> wrote:
>> 
>>  +1. This is great.
>> 
>>  On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com>
>> wrote:
>> >
>> > Mention bot looks cool, as it tries to guess the reviewer using
>> blame.
>> > I've written a quick and dirty script that uses only CODEOWNERS.
>> >
>> > Its output looks like:
>> > $ python suggest_reviewers.py --pr 5940
>> > INFO:root:Selected reviewer @lukecwik for:
>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>> (path_pattern: /runners/core-construction-java*)
>> > INFO:root:Selected reviewer @lukecwik for:
>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>> (path_pattern: /runners/core-construction-java*)
>> > INFO:root:Selected reviewer @echauchot for:
>> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>> (path_pattern: /runners/core-java*)
>> > INFO:root:Selected reviewer @lukecwik for:
>> /runners/flink/build.gradle (path_pattern: */build.gradle*)
>> > INFO:root:Selected reviewer @lukecwik for:
>> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>> (path_pattern: *.java)
>> > INFO:root:Selected reviewer @pabloem for:
>> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>> (path_pattern: /runners/google-cloud-dataflow-java*)
>> > INFO:root:Selected reviewer @lukecwik for:
>> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>> (path_pattern: /sdks/java/core*)
>> > Suggested reviewers: @echauchot, @lukecwik, @pabloem
>> >
>> > Script is in: https://github.com/apache/beam/pull/5951
>> >
>> >
>> > What does the community think? Do you prefer blame-based or
>> rules-based reviewer suggestions?
>> >
>> > On Fri, Jul 13, 2018 at 11:13 AM Holden Karau <
>> hol...@pigscanfly.ca> wrote:
>> >>
>> >> I'm looking at something similar in the Spark project, and
>> while it's now archived by FB it seems like something like
>> https://github.com/facebookarchive/mention-bot might do what we want.
>> I'm going to spin up a version on my K8 cluster and see if I can ask infra
>> to add a webhook and if it works for Spark we could ask INFRA to add a
>> second webhook for Beam. (Or if the Beam folks are more interested in
>> experimenting I can do Beam first as a smaller project and roll with that).
>> >>
>> >> Let me know :)
>> >>
>> >> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
>> kirpic...@google.com> wrote:
>> >>>
>> >>> 

Re: Let's start getting rid of BoundedSource

2018-07-17 Thread Eugene Kirpichov
On Tue, Jul 17, 2018 at 2:49 AM Etienne Chauchot 
wrote:

> Hi Eugene
>
> Le lundi 16 juillet 2018 à 07:52 -0700, Eugene Kirpichov a écrit :
>
> Hi Etienne - thanks for catching this; indeed, I somehow missed that
> actually several runners do this same thing - it seemed to me as something
> that can be done in user code (because it involves combining estimated size
> + split in pretty much the same way),
>
>
> When you say "user code", you mean IO writter code by opposition to runner
> code right ?
>
Correct: "user code" is what happens in the SDK or the user pipeline.


>
>
> but I'm not so sure: even though many runners have a "desired parallelism"
> option or alike, it's not all of them, so we can't use such an option
> universally.
>
>
> Agree, cannot be universal
>
>
> Maybe then the right thing to do is to:
> - Use bounded SDFs for these
> - Change SDF @SplitRestriction API to take a desired number of splits as a
> parameter, and introduce an API @EstimateOutputSizeBytes(element) valid
> only on bounded SDFs
>
> Agree with the idea but EstimateOutpuSize must return the size of the
> dataset not of an element.
>
Please recall that the element here is e.g. a filename, or name of a
BigTable table, or something like that - i.e. the element describes the
dataset, and the restriction describes what part of the dataset.

If e.g. we have a PCollection of filenames and apply a ReadTextFn
SDF to it, and want the runner to know the total size of all files - the
runner could insert some transforms to apply EstimateOutputSize to each
element and Sum.globally() them.


> On some runners, each worker is set to a given amount of heap. Thus, it is
> important that a runner could evaluate the size of the whole dataset to
> determine the size of each split (to fit in memory of the workers) and thus
> tell the bounded SDF the number of desired splits.
>
> - Add some plumbing to the standard bounded SDF expansion so that
> different runners can compute that parameter differently, the two standard
> ways being "split into given number of splits" or "split based on the
> sub-linear formula of estimated size".
>
> I think this would work, though this is somewhat more work than I
> anticipated. Any alternative ideas?
>
> +1 It will be very similar for an IO developer (@EstimateOutputSizeBytes
> will be similar to source.getEstimatedSizeBytes(),
> and @SplitRestriction(desiredSplits) similar to
> source.split(desiredBundleSize))
>
Yeah I'm not sure this is actually a good thing that these APIs end up so
similar to the old ones - I was hoping we could come up with something
better - but seems like there's no viable alternative at this point :)


>
> Etienne
>
>
> On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot 
> wrote:
>
> Hi,
> thanks Eugene for analyzing and sharing that.
> I have one comment inline
>
> Etienne
>
> Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
>
> Hey beamers,
>
> I've always wondered whether the BoundedSource implementations in the Beam
> SDK are worth their complexity, or whether they rather could be converted
> to the much easier to code ParDo style, which is also more modular and
> allows you to very easily implement readAll().
>
> There's a handful: file-based sources, BigQuery, Bigtable, HBase,
> Elasticsearch, MongoDB, Solr and a couple more.
>
> Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow,
> because AFAICT Dataflow is the only runner that cares about the things that
> BoundedSource can do and ParDo can't:
> - size estimation (used to choose an initial number of workers) [ok, Flink
> calls the function to return statistics, but doesn't seem to do anything
> else with it]
>
> => Spark uses size estimation to set desired bundle size with something
> like desiredBundleSize = estimatedSize / nbOfWorkersConfigured (partitions)
> See
> https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org/apache/beam/runners/spark/io/SourceRDD.java#L101
>
>
> - splitting into bundles of given size (Dataflow chooses the number of
> bundles to create based on a simple formula that's not entirely unlike
> K*sqrt(size))
> - liquid sharding (splitAtFraction())
>
> If Dataflow didn't exist, there'd be no reason at all to use
> BoundedSource. So the question "which ones can be converted to ParDo" is
> really "which ones are used on Dataflow in ways that make these functions
> matter". Previously, my conservative assumption was that the answer is "all
> of them", but turns out this is not so.
>
> Liquid sharding always matters; if the source is liquid-shardable, for now
> we have to keep it a source (until SDF gains liquid sharding - which should
> happen in a quarter or two I think).
>
> Choosing number of bundles to split into is easily done in SDK code, see
> https://github.com/apache/beam/pull/5886 for example; DatastoreIO does
> something similar.
>
> The remaining thing to analyze is, when does initial scaling matter. 

Re: BiqQueryIO.write and Wait.on

2018-07-17 Thread Eugene Kirpichov
Hmm, I think this approach has some complications:
- Using JobStatus makes it tied to using BigQuery batch load jobs, but the
return type ought to be the same regardless of which method of writing is
used (including potential future BigQuery APIs - they are evolving), or how
many BigQuery load jobs are involved in writing a given window (it can be
multiple).
- Returning a success/failure indicator makes it prone to users ignoring
the failure: the default behavior should be that, if the pipeline succeeds,
that means all data was successfully written - if users want different
error handling, e.g. a deadletter queue, they should have to specify it
explicitly.

I would recommend to return a PCollection of a type that's invariant to
which load method is used (streaming writes, load jobs, multiple load jobs
etc.). If it's unclear what type that should be, you could introduce an
empty type e.g. "class BigQueryWriteResult {}" just for the sake of
signaling success, and later add something to it.

On Tue, Jul 17, 2018 at 12:30 AM Carlos Alonso  wrote:

> All good so far. I've been a bit side tracked but more or less I have the
> idea of using the JobStatus as part of the collection so that not only the
> completion is signaled, but also the result (success/failure) can be
> accessed, how does it sound?
>
> Regards
>
> On Tue, Jul 17, 2018 at 3:07 AM Eugene Kirpichov 
> wrote:
>
>> Hi Carlos,
>>
>> Any updates / roadblocks you hit?
>>
>>
>> On Tue, Jul 3, 2018 at 7:13 AM Eugene Kirpichov 
>> wrote:
>>
>>> Awesome!! Thanks for the heads up, very exciting, this is going to make
>>> a lot of people happy :)
>>>
>>> On Tue, Jul 3, 2018, 3:40 AM Carlos Alonso  wrote:
>>>
 + dev@beam.apache.org

 Just a quick email to let you know that I'm starting developing this.

 On Fri, Apr 20, 2018 at 10:30 PM Eugene Kirpichov 
 wrote:

> Hi Carlos,
>
> Thank you for expressing interest in taking this on! Let me give you a
> few pointers to start, and I'll be happy to help everywhere along the way.
>
> Basically we want BigQueryIO.write() to return something (e.g. a
> PCollection) that can be used as input to Wait.on().
> Currently it returns a WriteResult, which only contains a
> PCollection of failed inserts - that one can not be used
> directly, instead we should add another component to WriteResult that
> represents the result of successfully writing some data.
>
> Given that BQIO supports dynamic destination writes, I think it makes
> sense for that to be a PCollection> so that in 
> theory
> we could sequence different destinations independently (currently 
> Wait.on()
> does not provide such a feature, but it could); and it will require
> changing WriteResult to be WriteResult. As for what the 
> "???"
> might be - it is something that represents the result of successfully
> writing a window of data. I think it can even be Void, or "?" (wildcard
> type) for now, until we figure out something better.
>
> Implementing this would require roughly the following work:
> - Add this PCollection> to WriteResult
> - Modify the BatchLoads transform to provide it on both codepaths:
> expandTriggered() and expandUntriggered()
> ...- expandTriggered() itself writes via 2 codepaths: single-partition
> and multi-partition. Both need to be handled - we need to get a
> PCollection> from each of them, and Flatten these two
> PCollections together to get the final result. The single-partition
> codepath (writeSinglePartition) under the hood already uses WriteTables
> that returns a KV so it's directly usable. The
> multi-partition codepath ends in WriteRenameTriggered - unfortunately, 
> this
> codepath drops DestinationT along the way and will need to be refactored a
> bit to keep it until the end.
> ...- expandUntriggered() should be treated the same way.
> - Modify the StreamingWriteTables transform to provide it
> ...- Here also, the challenge is to propagate the DestinationT type
> all the way until the end of StreamingWriteTables - it will need to be
> refactored. After such a refactoring, returning a KV
> should be easy.
>
> Another challenge with all of this is backwards compatibility in terms
> of API and pipeline update.
> Pipeline update is much less of a concern for the BatchLoads codepath,
> because it's typically used in batch-mode pipelines that don't get 
> updated.
> I would recommend to start with this, perhaps even with only the
> untriggered codepath (it is much more commonly used) - that will pave the
> way for future work.
>
> Hope this helps, please ask more if something is unclear!
>
> On Fri, Apr 20, 2018 at 12:48 AM Carlos Alonso 
> wrote:
>
>> Hey Eugene!!
>>
>> I’d gladly take a stab on it although I’m not sure how much available
>> time I 

Re: CODEOWNERS for apache/beam repo

2018-07-17 Thread Holden Karau
So it doesn’t support doing that right now, although if we find it’s a
problem we can specify an exclude file with folks who haven’t contributed
in the past year. Would people want me to generate that first?

On Tue, Jul 17, 2018 at 10:22 AM Ismaël Mejía  wrote:

> Is there a way to put inactive people as not reviewers for the blame
> case? I think it can be useful considering that a good amount of our
> committers are not active at the moment and auto-assigning reviews to
> them seem like a waste of energy/time.
> On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov 
> wrote:
> >
> > We did not, but I think we should. So far, in 100% of the PRs I've
> authored, the default functionality of CODEOWNERS did the wrong thing and I
> had to fix something up manually.
> >
> > On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud 
> wrote:
> >>
> >> This sounds like a good plan. Did we want to rename the CODEOWNERS file
> to disable github's mass adding of reviewers while we figure this out?
> >>
> >> Andrew
> >>
> >> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré 
> wrote:
> >>>
> >>> +1
> >>>
> >>> Le 16 juil. 2018, à 19:17, Holden Karau  a
> écrit:
> 
>  Ok if no one objects I'll create the INFRA ticket after OSCON and we
> can test it for a week and decide if it helps or hinders.
> 
>  On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré < j...@nanthrax.net>
> wrote:
> >
> > Agree to test it for a week.
> >
> > Regards
> > JB
> > Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com> a
> écrit:
> >>
> >> Would folks be OK with me asking infra to turn on blame based
> suggestions for Beam and trying it out for a week?
> >>
> >> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez <
> rfern...@google.com> wrote:
> >>>
> >>> +1 using blame -- nifty :)
> >>>
> >>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan <
> bat...@google.com> wrote:
> 
>  +1. This is great.
> 
>  On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com>
> wrote:
> >
> > Mention bot looks cool, as it tries to guess the reviewer using
> blame.
> > I've written a quick and dirty script that uses only CODEOWNERS.
> >
> > Its output looks like:
> > $ python suggest_reviewers.py --pr 5940
> > INFO:root:Selected reviewer @lukecwik for:
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
> (path_pattern: /runners/core-construction-java*)
> > INFO:root:Selected reviewer @lukecwik for:
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
> (path_pattern: /runners/core-construction-java*)
> > INFO:root:Selected reviewer @echauchot for:
> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
> (path_pattern: /runners/core-java*)
> > INFO:root:Selected reviewer @lukecwik for:
> /runners/flink/build.gradle (path_pattern: */build.gradle*)
> > INFO:root:Selected reviewer @lukecwik for:
> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
> (path_pattern: *.java)
> > INFO:root:Selected reviewer @pabloem for:
> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
> (path_pattern: /runners/google-cloud-dataflow-java*)
> > INFO:root:Selected reviewer @lukecwik for:
> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
> (path_pattern: /sdks/java/core*)
> > Suggested reviewers: @echauchot, @lukecwik, @pabloem
> >
> > Script is in: https://github.com/apache/beam/pull/5951
> >
> >
> > What does the community think? Do you prefer blame-based or
> rules-based reviewer suggestions?
> >
> > On Fri, Jul 13, 2018 at 11:13 AM Holden Karau <
> hol...@pigscanfly.ca> wrote:
> >>
> >> I'm looking at something similar in the Spark project, and
> while it's now archived by FB it seems like something like
> https://github.com/facebookarchive/mention-bot might do what we want. I'm
> going to spin up a version on my K8 cluster and see if I can ask infra to
> add a webhook and if it works for Spark we could ask INFRA to add a second
> webhook for Beam. (Or if the Beam folks are more interested in
> experimenting I can do Beam first as a smaller project and roll with that).
> >>
> >> Let me know :)
> >>
> >> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
> kirpic...@google.com> wrote:
> >>>
> >>> Sounds reasonable for now, thanks!
> >>> It's unfortunate that Github's CODEOWNERS feature appears to
> be effectively unusable for Beam but I'd hope that Github might pay
> attention and fix things if we submit feedback, with us being one of 

Re: No JVM - new Runner?

2018-07-17 Thread Austin Bennett
Sweet; that led me to:
https://beam.apache.org/contribute/runner-guide/#the-runner-api (which I
can't believe I missed).



On Tue, Jul 17, 2018 at 9:21 AM, Jean-Baptiste Onofré 
wrote:

> Hi Austin,
>
> If your runner provide the gRPC portabality layer (allowing any SDK to
> "interact" with the runner), it will work no matter how the runner is
> implemented (JVM or not).
>
> However, it means that you will have to mimic the Runner API for the
> translation.
>
> Regards
> JB
>
> On 17/07/2018 18:19, Austin Bennett wrote:
> > Hi Beam Devs,
> >
> > I still don't quite understand:
> >
> > "Apache Beam provides a portable API layer for building sophisticated
> > data-parallel processing pipelines that may be executed across a
> > diversity of execution engines, or /runners/."
> >
> > (from https://beam.apache.org/documentation/runners/capability-matrix/)
> >
> > And specifically, close reading
> > of: https://beam.apache.org/contribute/portability/
> >
> > What if I'd like to implement a runner that is non-JVM?  Though would
> > leverage the Python and Go SDKs?  Specifically, thinking of:
> >  https://www.wallaroolabs.com (I am out in NY meeting with friends there
> > later this week, and wanted to get a sense of, feasibility, work
> > involved, etc -- to propose that we add a new Wallaroo runner).
> >
> > Is there a way to keep java out of the mix completely and still work
> > with Beam on a non JVM runner (seems maybe eventually, but what about
> > currently/near future)?
> >
> > Any input, thoughts, ideas, other pages or info to explore -- all
> > appreciated; thanks!
> > Austin
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: No JVM - new Runner?

2018-07-17 Thread Jean-Baptiste Onofré
Hi Austin,

If your runner provide the gRPC portabality layer (allowing any SDK to
"interact" with the runner), it will work no matter how the runner is
implemented (JVM or not).

However, it means that you will have to mimic the Runner API for the
translation.

Regards
JB

On 17/07/2018 18:19, Austin Bennett wrote:
> Hi Beam Devs,
> 
> I still don't quite understand:
> 
> "Apache Beam provides a portable API layer for building sophisticated
> data-parallel processing pipelines that may be executed across a
> diversity of execution engines, or /runners/."
> 
> (from https://beam.apache.org/documentation/runners/capability-matrix/)
> 
> And specifically, close reading
> of: https://beam.apache.org/contribute/portability/
> 
> What if I'd like to implement a runner that is non-JVM?  Though would
> leverage the Python and Go SDKs?  Specifically, thinking of:
>  https://www.wallaroolabs.com (I am out in NY meeting with friends there
> later this week, and wanted to get a sense of, feasibility, work
> involved, etc -- to propose that we add a new Wallaroo runner).  
> 
> Is there a way to keep java out of the mix completely and still work
> with Beam on a non JVM runner (seems maybe eventually, but what about
> currently/near future)?  
> 
> Any input, thoughts, ideas, other pages or info to explore -- all
> appreciated; thanks!
> Austin
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


No JVM - new Runner?

2018-07-17 Thread Austin Bennett
Hi Beam Devs,

I still don't quite understand:

"Apache Beam provides a portable API layer for building sophisticated
data-parallel processing pipelines that may be executed across a diversity
of execution engines, or *runners*."

(from https://beam.apache.org/documentation/runners/capability-matrix/)

And specifically, close reading of:
https://beam.apache.org/contribute/portability/

What if I'd like to implement a runner that is non-JVM?  Though would
leverage the Python and Go SDKs?  Specifically, thinking of:
https://www.wallaroolabs.com (I am out in NY meeting with friends there
later this week, and wanted to get a sense of, feasibility, work involved,
etc -- to propose that we add a new Wallaroo runner).

Is there a way to keep java out of the mix completely and still work with
Beam on a non JVM runner (seems maybe eventually, but what about
currently/near future)?

Any input, thoughts, ideas, other pages or info to explore -- all
appreciated; thanks!
Austin


Re: Performance testing documentation - suggestions request

2018-07-17 Thread Łukasz Gajowy
Hi all,

just a quick FYI: improved documentation was merged to the Beam's site.

Thanks to everyone involved!

Thanks,
Łukasz

śr., 30 maj 2018 o 14:25 Łukasz Gajowy  napisał(a):

> Hi,
>
> the Performance Testing Framework is an ongoing effort for some time now.
> As I noticed (and received signals from the community) it is getting more
> popular (this is good) but besides changing commands from mvn to gradle
> ones, it was not updated for a very long time (this is not good at all!).
> It's about time to document it properly. I created a JIRA for this[1].
>
> Before we resolve the JIRA, I thought it would be good to ask you about
> biggest pain points in the current documentation[2] (or in the Performance
> Testing Framework itself) - what do you find the most lacking? Maybe there
> are some sections you'd like to add (eg. "Future plans of development)?
> Such feedback would be very helpful in providing fine-tuned documentation
> so it is much appreciated. Feel free to comment here or in the comments
> section of the ticket. Thanks in advance!
>
> [1]: https://issues.apache.org/jira/browse/BEAM-4430
> [2]:
> https://beam.apache.org/documentation/io/testing/#i-o-transform-integration-tests
>
> Best regards,
> Łukasz
>


Re: Beam site test/merge Jenkins issue

2018-07-17 Thread Alexey Romanenko
Andrew, oops, sorry, I missed this Jira issue. Thank you for pointing out.

Btw, today it worked better - I’ve managed to merge a PR from the very first 
time =) I don’t know if it was a luck or it was already fixed.

Alexey

> On 16 Jul 2018, at 18:50, Andrew Pilloud  wrote:
> 
> This is a really persistent flap in the website build. I would guess it hits 
> 80% of the time, if you do enough builds it will eventually succeed. I opened 
> an issue on it a while back: https://issues.apache.org/jira/browse/BEAM-4686 
> 
> 
> Andrew
> 
> On Mon, Jul 16, 2018 at 5:05 AM Jean-Baptiste Onofré  > wrote:
> Hi,
> 
> let me take a look, it's maybe the client key auth which failing.
> 
> Regards
> JB
> 
> On 16/07/2018 13:02, Alexey Romanenko wrote:
> > Hi,
> > 
> > From time to time, I observe *gpg key* issues in Jenkins job when I try
> > to test/merge Beam site PR.
> > For example:
> > https://builds.apache.org/job/beam_PreCommit_Website_Stage/1192/console 
> > 
> > 
> > It says the following:
> > gpg: keyserver communications error: keyserver helper general error
> > gpg: keyserver communications error: unknown pubkey algorithm
> > gpg: keyserver receive failed: unknown pubkey algorithm
> > 
> > Is it know problem and how I can overcome this?
> > 
> > Thank you,
> > Alexey
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net 
> Talend - http://www.talend.com 



Re: An update on Eugene

2018-07-17 Thread Łukasz Gajowy
Eugene,

thank you, good luck and have lots of fun with new challenges!


wt., 17 lip 2018 o 14:31 Alexey Romanenko 
napisał(a):

> Eugene,
>
> Thank you for your work and good luck with new challenges!
>
> Alexey
>
> > On 17 Jul 2018, at 12:10, Etienne Chauchot  wrote:
> >
> > Hi Eugene,
> >
> > I'm sad also to hear this, you've always been a very valuable team
> member of the Beam community. We will miss you. Thanks for everything
> you've done and all the good things you brought to Beam.
> >
> > I wish you very happy and successful new adventures !
> > Hope to have updates in the future.
> >
> > Etienne
> >
> > Le mardi 17 juillet 2018 à 01:00 +0200, Ismaël Mejía a écrit :
> >> I am sad to read this, but at the same happy for you and your future
> >> adventures Eugene.
> >>
> >> Thanks a lot for all the work you have done in this project, all the
> >> work on SDF, the improvements on composability and the File-based IOs,
> >> and of course for your reviews that really helped improve the quality
> >> in many areas of the project as well as other areas that I probably
> >> forget. In general it has been really nice to see the way you grew in
> >> the open source side of the project too, and your presence will be
> >> definitely missed.
> >>
> >> Best wishes for the work on the new programming model and hope to hear
> >> back from you in the future.
> >>
>
>


Re: An update on Eugene

2018-07-17 Thread Alexey Romanenko
Eugene, 

Thank you for your work and good luck with new challenges!

Alexey

> On 17 Jul 2018, at 12:10, Etienne Chauchot  wrote:
> 
> Hi Eugene,
> 
> I'm sad also to hear this, you've always been a very valuable team member of 
> the Beam community. We will miss you. Thanks for everything you've done and 
> all the good things you brought to Beam.
> 
> I wish you very happy and successful new adventures !
> Hope to have updates in the future.
> 
> Etienne
> 
> Le mardi 17 juillet 2018 à 01:00 +0200, Ismaël Mejía a écrit :
>> I am sad to read this, but at the same happy for you and your future
>> adventures Eugene.
>> 
>> Thanks a lot for all the work you have done in this project, all the
>> work on SDF, the improvements on composability and the File-based IOs,
>> and of course for your reviews that really helped improve the quality
>> in many areas of the project as well as other areas that I probably
>> forget. In general it has been really nice to see the way you grew in
>> the open source side of the project too, and your presence will be
>> definitely missed.
>> 
>> Best wishes for the work on the new programming model and hope to hear
>> back from you in the future.
>> 



Re: Performance issue in Beam 2.4 onwards

2018-07-17 Thread Jean-Baptiste Onofré
Hi,

I'm on it, still investigating/digging.

Regards
JB

On 17/07/2018 09:44, Ismaël Mejía wrote:
> Given that the 2.6.0 cut is supposed to be today (or next days), what
> is the status on this, has it been identified / reverted ? or is there
> any other plan ?
> On Tue, Jul 10, 2018 at 2:50 PM Reuven Lax  wrote:
>>
>> If we added something slow to the core library in order to better test 
>> DirectRunner, that does sound like an unfortunate bug.
>>
>> On Mon, Jul 9, 2018 at 11:21 PM Vojtech Janota  
>> wrote:
>>>
>>> Hi guys,
>>>
>>> Thank you for all of your feedback. I have created relevant issue in JIRA: 
>>> https://issues.apache.org/jira/browse/BEAM-4750
>>>
>>> @Lukasz: me mentioning the DirectRunner was somewhat unfortunate - the 
>>> bottleneck was introduced into the core library and so Flink and Spark 
>>> runners would be impacted too
>>>
>>> Thanks,
>>> Vojta
>>>
>>> On Mon, Jul 9, 2018 at 5:48 PM, Lukasz Cwik  wrote:

 Instead of reverting/working around specific checks/tests that the 
 DirectRunner is doing, have you considered using one of the other runners 
 like Flink or Spark with a local execution cluster. You won't hit the 
 validation/verification bottlenecks that DirectRunner specifically imposes.

 On Mon, Jul 9, 2018 at 8:46 AM Jean-Baptiste Onofré  
 wrote:
>
> Thanks for the update Eugene.
>
> @Vojta: do you mind to create a Jira ? I will tackle a fix for that.
>
> Regards
> JB
>
> On 09/07/2018 17:33, Eugene Kirpichov wrote:
>> Hi -
>>
>> If I remember correctly, the reason for this change was to ensure that
>> the state is encodable at all. Prior to the change, there had been
>> situations where the coder specified on a state cell is buggy, absent or
>> set incorrectly (due to some issue in coder inference), but direct
>> runner did not detect this because it never tried to encode the state
>> cells - this would have blown up in any distributed runner.
>>
>> I think it should be possible to relax this and clone only values being
>> added to the state, rather than cloning the whole state on copy(). I
>> don't have time to work on this change myself, but I can review a PR if
>> someone else does.
>>
>> On Mon, Jul 9, 2018 at 8:28 AM Jean-Baptiste Onofré > > wrote:
>>
>> Hi Vojta,
>>
>> I fully agree, that's why it makes sense to wait Eugene's feedback.
>>
>> I remember we had some performance regression on the direct runner
>> identified thanks to Nexmark, but it has been addressed by reverting 
>> a
>> change.
>>
>> Good catch anyway !
>>
>> Regards
>> JB
>>
>> On 09/07/2018 17:20, Vojtech Janota wrote:
>> > Hi Reuven,
>> >
>> > I'm not really complaining about DirectRunner. In fact it seems to
>> me as
>> > if what previously was considered as part of the "expensive extra
>> > checks" done by the DirectRunner is now done within the
>> > beam-runners-core-java library. Considering that all objects 
>> involved
>> > are immutable (in our case at least) and simple assignment is
>> > sufficient, the serialization-deserialization really seems as 
>> unwanted
>> > and hugely expensive correctness check. If there was a problem with
>> > identity copy, wasn't DirectRunner supposed to reveal it?
>> >
>> > Regards,
>> > Vojta
>> >
>> > On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax > 
>> > >> wrote:
>> >
>> > Hi Vojita,
>> >
>> > One problem is that the DirectRunner is designed for testing, 
>> not
>> > for performance. The DirectRunner currently does many
>> > purposely-inefficient things, the point of which is to better
>> expose
>> > potential bugs in tests. For example, the DirectRunner will
>> randomly
>> > shuffle the order of PCollections to ensure that your code
>> does not
>> > rely on ordering.  All of this adds cost, because the current
>> runner
>> > is designed for testing. There have been requests in the past
>> for an
>> > "optimized" local runner, however we don't currently have such
>> a thing.
>> >
>> > In this case, using coders to clone values is more correct. In 
>> a
>> > distributed environment using encode/decode is the only way to
>> copy
>> > values, and the DirectRunner is trying to ensure that your 
>> code is
>> > correct in a distributed environment.
>> >
>> > Reuven
>> >
>> > On Mon, Jul 9, 2018 at 

Re: An update on Eugene

2018-07-17 Thread Etienne Chauchot
Hi Eugene, 

I'm sad also to hear this, you've always been a very valuable team member of 
the Beam community. We will miss you.
Thanks for everything you've done and all the good things you brought to Beam.

I wish you  very happy and successful new adventures !
  
Hope to have updates in the future.

Etienne

Le mardi 17 juillet 2018 à 01:00 +0200, Ismaël Mejía a écrit :
> I am sad to read this, but at the same happy for you and your future
> adventures Eugene.
> 
> Thanks a lot for all the work you have done in this project, all the
> work on SDF, the improvements on composability and the File-based IOs,
> and of course for your reviews that really helped improve the quality
> in many areas of the project as well as other areas that I probably
> forget. In general it has been really nice to see the way you grew in
> the open source side of the project too, and your presence will be
> definitely missed.
> 
> Best wishes for the work on the new programming model and hope to hear
> back from you in the future.
> 

Re: Let's start getting rid of BoundedSource

2018-07-17 Thread Etienne Chauchot
Hi Eugene
Le lundi 16 juillet 2018 à 07:52 -0700, Eugene Kirpichov a écrit :
> Hi Etienne - thanks for catching this; indeed, I somehow missed that actually 
> several runners do this same thing - it
> seemed to me as something that can be done in user code (because it involves 
> combining estimated size + split in
> pretty much the same way), 

When you say "user code", you mean IO writter code by opposition to runner code 
right ? 


> but I'm not so sure: even though many runners have a "desired parallelism" 
> option or alike, it's not all of them, so
> we can't use such an option universally.

Agree, cannot be universal
> Maybe then the right thing to do is to:
> - Use bounded SDFs for these
> - Change SDF @SplitRestriction API to take a desired number of splits as a 
> parameter, and introduce an API
> @EstimateOutputSizeBytes(element) valid only on bounded SDFs
Agree with the idea but EstimateOutpuSize must return the size of the dataset 
not of an element. On some runners, each
worker is set to a given amount of heap.  Thus, it is important that a runner 
could evaluate the size of the whole
dataset to determine the size of each split (to fit in memory of the workers) 
and thus tell the bounded SDF the number
of desired splits.
> - Add some plumbing to the standard bounded SDF expansion so that different 
> runners can compute that parameter
> differently, the two standard ways being "split into given number of splits" 
> or "split based on the sub-linear formula
> of estimated size".
> 
> I think this would work, though this is somewhat more work than I 
> anticipated. Any alternative ideas?
+1 It will be very similar for an IO developer (@EstimateOutputSizeBytes will 
be similar to
source.getEstimatedSizeBytes(), and @SplitRestriction(desiredSplits) similar to 
source.split(desiredBundleSize))
Etienne
> On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot  wrote:
> > Hi, 
> > thanks Eugene for analyzing and sharing that.
> > I have one comment inline
> > 
> > Etienne
> > 
> > Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
> > > Hey beamers,
> > > I've always wondered whether the BoundedSource implementations in the 
> > > Beam SDK are worth their complexity, or
> > > whether they rather could be converted to the much easier to code ParDo 
> > > style, which is also more modular and
> > > allows you to very easily implement readAll().
> > > 
> > > There's a handful: file-based sources, BigQuery, Bigtable, HBase, 
> > > Elasticsearch, MongoDB, Solr and a couple more.
> > > 
> > > Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow, 
> > > because AFAICT Dataflow is the only runner
> > > that cares about the things that BoundedSource can do and ParDo can't:
> > > - size estimation (used to choose an initial number of workers) [ok, 
> > > Flink calls the function to return
> > > statistics, but doesn't seem to do anything else with it]
> > 
> > => Spark uses size estimation to set desired bundle size with something 
> > like desiredBundleSize = estimatedSize /
> > nbOfWorkersConfigured (partitions)
> > See 
> > https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org/apa
> > che/beam/runners/spark/io/SourceRDD.java#L101
> > 
> > 
> > > - splitting into bundles of given size (Dataflow chooses the number of 
> > > bundles to create based on a simple formula
> > > that's not entirely unlike K*sqrt(size))
> > > - liquid sharding (splitAtFraction())
> > > 
> > > If Dataflow didn't exist, there'd be no reason at all to use 
> > > BoundedSource. So the question "which ones can be
> > > converted to ParDo" is really "which ones are used on Dataflow in ways 
> > > that make these functions matter".
> > > Previously, my conservative assumption was that the answer is "all of 
> > > them", but turns out this is not so.
> > > 
> > > Liquid sharding always matters; if the source is liquid-shardable, for 
> > > now we have to keep it a source (until SDF
> > > gains liquid sharding - which should happen in a quarter or two I think).
> > > 
> > > Choosing number of bundles to split into is easily done in SDK code, see 
> > > https://github.com/apache/beam/pull/5886 
> > > for example; DatastoreIO does something similar.
> > > 
> > > The remaining thing to analyze is, when does initial scaling matter. So 
> > > as a member of the Dataflow team, I
> > > analyzed statistics of production Dataflow jobs in the past month. I can 
> > > not share my queries nor the data,
> > > because they are proprietary to Google - so I am sharing just the general 
> > > methodology and conclusions, because
> > > they matter to the Beam community. I looked at a few criteria, such as:
> > > - The job should be not too short and not too long: if it's too short 
> > > then scaling couldn't have kicked in much at
> > > all; if it's too long then dynamic autoscaling would have been sufficient.
> > > - The job should use, at peak, at least a handful of 

Re: CODEOWNERS for apache/beam repo

2018-07-17 Thread Ismaël Mejía
Is there a way to put inactive people as not reviewers for the blame
case? I think it can be useful considering that a good amount of our
committers are not active at the moment and auto-assigning reviews to
them seem like a waste of energy/time.
On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov  wrote:
>
> We did not, but I think we should. So far, in 100% of the PRs I've authored, 
> the default functionality of CODEOWNERS did the wrong thing and I had to fix 
> something up manually.
>
> On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud  wrote:
>>
>> This sounds like a good plan. Did we want to rename the CODEOWNERS file to 
>> disable github's mass adding of reviewers while we figure this out?
>>
>> Andrew
>>
>> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré  
>> wrote:
>>>
>>> +1
>>>
>>> Le 16 juil. 2018, à 19:17, Holden Karau  a écrit:

 Ok if no one objects I'll create the INFRA ticket after OSCON and we can 
 test it for a week and decide if it helps or hinders.

 On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré < j...@nanthrax.net> 
 wrote:
>
> Agree to test it for a week.
>
> Regards
> JB
> Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com> a écrit:
>>
>> Would folks be OK with me asking infra to turn on blame based 
>> suggestions for Beam and trying it out for a week?
>>
>> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez < rfern...@google.com> 
>> wrote:
>>>
>>> +1 using blame -- nifty :)
>>>
>>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan < bat...@google.com> 
>>> wrote:

 +1. This is great.

 On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com> wrote:
>
> Mention bot looks cool, as it tries to guess the reviewer using blame.
> I've written a quick and dirty script that uses only CODEOWNERS.
>
> Its output looks like:
> $ python suggest_reviewers.py --pr 5940
> INFO:root:Selected reviewer @lukecwik for: 
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>  (path_pattern: /runners/core-construction-java*)
> INFO:root:Selected reviewer @lukecwik for: 
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>  (path_pattern: /runners/core-construction-java*)
> INFO:root:Selected reviewer @echauchot for: 
> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>  (path_pattern: /runners/core-java*)
> INFO:root:Selected reviewer @lukecwik for: 
> /runners/flink/build.gradle (path_pattern: */build.gradle*)
> INFO:root:Selected reviewer @lukecwik for: 
> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>  (path_pattern: *.java)
> INFO:root:Selected reviewer @pabloem for: 
> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>  (path_pattern: /runners/google-cloud-dataflow-java*)
> INFO:root:Selected reviewer @lukecwik for: 
> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>  (path_pattern: /sdks/java/core*)
> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>
> Script is in: https://github.com/apache/beam/pull/5951
>
>
> What does the community think? Do you prefer blame-based or 
> rules-based reviewer suggestions?
>
> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau < hol...@pigscanfly.ca> 
> wrote:
>>
>> I'm looking at something similar in the Spark project, and while 
>> it's now archived by FB it seems like something like  
>> https://github.com/facebookarchive/mention-bot might do what we 
>> want. I'm going to spin up a version on my K8 cluster and see if I 
>> can ask infra to add a webhook and if it works for Spark we could 
>> ask INFRA to add a second webhook for Beam. (Or if the Beam folks 
>> are more interested in experimenting I can do Beam first as a 
>> smaller project and roll with that).
>>
>> Let me know :)
>>
>> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov 
>>  wrote:
>>>
>>> Sounds reasonable for now, thanks!
>>> It's unfortunate that Github's CODEOWNERS feature appears to be 
>>> effectively unusable for Beam but I'd hope that Github might pay 
>>> attention and fix things if we submit feedback, with us being one 
>>> of the most active Apache projects - did anyone do this yet / 
>>> planning to?
>>>
>>> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri < 

Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #103

2018-07-17 Thread Apache Jenkins Server
See 


Changes:

[iemejia] Rename COMBINE_TRANSFORM_URN to the more proper

[kirpichov] Adds a naive implementation of bounded SDFs

[echauchot] Move echauchot from runner/core to runner/core/metrics in CODEOWNERS

[kirpichov] Makes SplittableDoFnTest exercise both bounded and unbounded SDFs.

[kirpichov] Supports bounded SDF in all runners.

[ryan.blake.williams] send PutArtifactResponse in 
BeamFileSystemArtifactStagingService

[aaltay] [BEAM-2810] fastavro integration test (#5862)

[kirpichov] Improve error messages on Spark and Flink SDF translation

[qinyeli] Interactive Beam -- yields Read Trans after apply()

--
[...truncated 18.36 MB...]
:beam-sdks-java-maven-archetypes-starter:compileTestJava (Thread[Task worker 
for ':' Thread 12,5,main]) completed. Took 0.001 secs.
:beam-sdks-java-maven-archetypes-starter:processTestResources (Thread[Task 
worker for ':' Thread 12,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:processTestResources UP-TO-DATE
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:processTestResources' is 
f74f3200edf284b276c50da93794d928
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:processTestResources': Caching has 
not been enabled for the task
Skipping task ':beam-sdks-java-maven-archetypes-starter:processTestResources' 
as it is up-to-date.
:beam-sdks-java-maven-archetypes-starter:processTestResources (Thread[Task 
worker for ':' Thread 12,5,main]) completed. Took 0.002 secs.
:beam-sdks-java-maven-archetypes-starter:testClasses (Thread[Task worker for 
':' Thread 12,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:testClasses UP-TO-DATE
Skipping task ':beam-sdks-java-maven-archetypes-starter:testClasses' as it has 
no actions.
:beam-sdks-java-maven-archetypes-starter:testClasses (Thread[Task worker for 
':' Thread 12,5,main]) completed. Took 0.0 secs.
:beam-sdks-java-maven-archetypes-starter:shadowTestJar (Thread[Task worker for 
':' Thread 12,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:shadowTestJar
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:shadowTestJar' is 
b439dd96e32545ee68d96fd5ec040d99
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:shadowTestJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:shadowTestJar' is not up-to-date 
because:
  No history is available.
***
GRADLE SHADOW STATS

Total Jars: 1 (includes project)
Total Time: 0.0s [0ms]
Average Time/Jar: 0.0s [0.0ms]
***
:beam-sdks-java-maven-archetypes-starter:shadowTestJar (Thread[Task worker for 
':' Thread 12,5,main]) completed. Took 0.007 secs.
:beam-sdks-java-maven-archetypes-starter:sourcesJar (Thread[Task worker for ':' 
Thread 12,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:sourcesJar
file or directory 
'
 not found
Build cache key for task ':beam-sdks-java-maven-archetypes-starter:sourcesJar' 
is a106f15937cacfee668e25636b705e03
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:sourcesJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:sourcesJar' is not up-to-date 
because:
  No history is available.
file or directory 
'
 not found
:beam-sdks-java-maven-archetypes-starter:sourcesJar (Thread[Task worker for ':' 
Thread 12,5,main]) completed. Took 0.003 secs.
:beam-sdks-java-maven-archetypes-starter:testSourcesJar (Thread[Task worker for 
':' Thread 12,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:testSourcesJar
file or directory 
'
 not found
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:testSourcesJar' is 
58715d6b8e221cace68f230ccfd69fd4
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:testSourcesJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:testSourcesJar' is not 
up-to-date because:
  No history is available.
file or directory 
'
 not found
:beam-sdks-java-maven-archetypes-starter:testSourcesJar (Thread[Task worker for 
':' Thread 12,5,main]) completed. Took 0.005 secs.
:beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication (Thread[Task 
worker for ':' Thread 3,5,main]) started.

> Task :beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication

Re: Performance issue in Beam 2.4 onwards

2018-07-17 Thread Ismaël Mejía
Given that the 2.6.0 cut is supposed to be today (or next days), what
is the status on this, has it been identified / reverted ? or is there
any other plan ?
On Tue, Jul 10, 2018 at 2:50 PM Reuven Lax  wrote:
>
> If we added something slow to the core library in order to better test 
> DirectRunner, that does sound like an unfortunate bug.
>
> On Mon, Jul 9, 2018 at 11:21 PM Vojtech Janota  wrote:
>>
>> Hi guys,
>>
>> Thank you for all of your feedback. I have created relevant issue in JIRA: 
>> https://issues.apache.org/jira/browse/BEAM-4750
>>
>> @Lukasz: me mentioning the DirectRunner was somewhat unfortunate - the 
>> bottleneck was introduced into the core library and so Flink and Spark 
>> runners would be impacted too
>>
>> Thanks,
>> Vojta
>>
>> On Mon, Jul 9, 2018 at 5:48 PM, Lukasz Cwik  wrote:
>>>
>>> Instead of reverting/working around specific checks/tests that the 
>>> DirectRunner is doing, have you considered using one of the other runners 
>>> like Flink or Spark with a local execution cluster. You won't hit the 
>>> validation/verification bottlenecks that DirectRunner specifically imposes.
>>>
>>> On Mon, Jul 9, 2018 at 8:46 AM Jean-Baptiste Onofré  
>>> wrote:

 Thanks for the update Eugene.

 @Vojta: do you mind to create a Jira ? I will tackle a fix for that.

 Regards
 JB

 On 09/07/2018 17:33, Eugene Kirpichov wrote:
 > Hi -
 >
 > If I remember correctly, the reason for this change was to ensure that
 > the state is encodable at all. Prior to the change, there had been
 > situations where the coder specified on a state cell is buggy, absent or
 > set incorrectly (due to some issue in coder inference), but direct
 > runner did not detect this because it never tried to encode the state
 > cells - this would have blown up in any distributed runner.
 >
 > I think it should be possible to relax this and clone only values being
 > added to the state, rather than cloning the whole state on copy(). I
 > don't have time to work on this change myself, but I can review a PR if
 > someone else does.
 >
 > On Mon, Jul 9, 2018 at 8:28 AM Jean-Baptiste Onofré >>> > > wrote:
 >
 > Hi Vojta,
 >
 > I fully agree, that's why it makes sense to wait Eugene's feedback.
 >
 > I remember we had some performance regression on the direct runner
 > identified thanks to Nexmark, but it has been addressed by reverting 
 > a
 > change.
 >
 > Good catch anyway !
 >
 > Regards
 > JB
 >
 > On 09/07/2018 17:20, Vojtech Janota wrote:
 > > Hi Reuven,
 > >
 > > I'm not really complaining about DirectRunner. In fact it seems to
 > me as
 > > if what previously was considered as part of the "expensive extra
 > > checks" done by the DirectRunner is now done within the
 > > beam-runners-core-java library. Considering that all objects 
 > involved
 > > are immutable (in our case at least) and simple assignment is
 > > sufficient, the serialization-deserialization really seems as 
 > unwanted
 > > and hugely expensive correctness check. If there was a problem with
 > > identity copy, wasn't DirectRunner supposed to reveal it?
 > >
 > > Regards,
 > > Vojta
 > >
 > > On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax >>> > 
 > > >> wrote:
 > >
 > > Hi Vojita,
 > >
 > > One problem is that the DirectRunner is designed for testing, 
 > not
 > > for performance. The DirectRunner currently does many
 > > purposely-inefficient things, the point of which is to better
 > expose
 > > potential bugs in tests. For example, the DirectRunner will
 > randomly
 > > shuffle the order of PCollections to ensure that your code
 > does not
 > > rely on ordering.  All of this adds cost, because the current
 > runner
 > > is designed for testing. There have been requests in the past
 > for an
 > > "optimized" local runner, however we don't currently have such
 > a thing.
 > >
 > > In this case, using coders to clone values is more correct. In 
 > a
 > > distributed environment using encode/decode is the only way to
 > copy
 > > values, and the DirectRunner is trying to ensure that your 
 > code is
 > > correct in a distributed environment.
 > >
 > > Reuven
 > >
 > > On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota
 > > mailto:vojta.jan...@gmail.com>
 > >> 

Re: BiqQueryIO.write and Wait.on

2018-07-17 Thread Carlos Alonso
All good so far. I've been a bit side tracked but more or less I have the
idea of using the JobStatus as part of the collection so that not only the
completion is signaled, but also the result (success/failure) can be
accessed, how does it sound?

Regards

On Tue, Jul 17, 2018 at 3:07 AM Eugene Kirpichov 
wrote:

> Hi Carlos,
>
> Any updates / roadblocks you hit?
>
>
> On Tue, Jul 3, 2018 at 7:13 AM Eugene Kirpichov 
> wrote:
>
>> Awesome!! Thanks for the heads up, very exciting, this is going to make a
>> lot of people happy :)
>>
>> On Tue, Jul 3, 2018, 3:40 AM Carlos Alonso  wrote:
>>
>>> + dev@beam.apache.org
>>>
>>> Just a quick email to let you know that I'm starting developing this.
>>>
>>> On Fri, Apr 20, 2018 at 10:30 PM Eugene Kirpichov 
>>> wrote:
>>>
 Hi Carlos,

 Thank you for expressing interest in taking this on! Let me give you a
 few pointers to start, and I'll be happy to help everywhere along the way.

 Basically we want BigQueryIO.write() to return something (e.g. a
 PCollection) that can be used as input to Wait.on().
 Currently it returns a WriteResult, which only contains a
 PCollection of failed inserts - that one can not be used
 directly, instead we should add another component to WriteResult that
 represents the result of successfully writing some data.

 Given that BQIO supports dynamic destination writes, I think it makes
 sense for that to be a PCollection> so that in theory
 we could sequence different destinations independently (currently Wait.on()
 does not provide such a feature, but it could); and it will require
 changing WriteResult to be WriteResult. As for what the "???"
 might be - it is something that represents the result of successfully
 writing a window of data. I think it can even be Void, or "?" (wildcard
 type) for now, until we figure out something better.

 Implementing this would require roughly the following work:
 - Add this PCollection> to WriteResult
 - Modify the BatchLoads transform to provide it on both codepaths:
 expandTriggered() and expandUntriggered()
 ...- expandTriggered() itself writes via 2 codepaths: single-partition
 and multi-partition. Both need to be handled - we need to get a
 PCollection> from each of them, and Flatten these two
 PCollections together to get the final result. The single-partition
 codepath (writeSinglePartition) under the hood already uses WriteTables
 that returns a KV so it's directly usable. The
 multi-partition codepath ends in WriteRenameTriggered - unfortunately, this
 codepath drops DestinationT along the way and will need to be refactored a
 bit to keep it until the end.
 ...- expandUntriggered() should be treated the same way.
 - Modify the StreamingWriteTables transform to provide it
 ...- Here also, the challenge is to propagate the DestinationT type all
 the way until the end of StreamingWriteTables - it will need to be
 refactored. After such a refactoring, returning a KV
 should be easy.

 Another challenge with all of this is backwards compatibility in terms
 of API and pipeline update.
 Pipeline update is much less of a concern for the BatchLoads codepath,
 because it's typically used in batch-mode pipelines that don't get updated.
 I would recommend to start with this, perhaps even with only the
 untriggered codepath (it is much more commonly used) - that will pave the
 way for future work.

 Hope this helps, please ask more if something is unclear!

 On Fri, Apr 20, 2018 at 12:48 AM Carlos Alonso 
 wrote:

> Hey Eugene!!
>
> I’d gladly take a stab on it although I’m not sure how much available
> time I might have to put into but... yeah, let’s try it.
>
> Where should I begin? Is there a Jira issue or shall I file one?
>
> Thanks!
> On Thu, 12 Apr 2018 at 00:41, Eugene Kirpichov 
> wrote:
>
>> Hi,
>>
>> Yes, you're both right - BigQueryIO.write() is currently not
>> implemented in a way that it can be used with Wait.on(). It would 
>> certainly
>> be a welcome contribution to change this - many people expressed interest
>> in specifically waiting for BigQuery writes. Is any of you interested in
>> helping out?
>>
>> Thanks.
>>
>> On Fri, Apr 6, 2018 at 12:36 AM Carlos Alonso 
>> wrote:
>>
>>> Hi Simon, I think your explanation was very accurate, at least to my
>>> understanding. I'd also be interested in getting batch load result's
>>> feedback on the pipeline... hopefully someone may suggest something,
>>> otherwise we could propose submitting a Jira, or even better, a PR!! :)
>>>
>>> Thanks!
>>>
>>> On Thu, Apr 5, 2018 at 2:01 PM Simon Kitching <
>>> simon.kitch...@unbelievable-machine.com> wrote:
>>>
 Hi All,

 I need to