date:20170906

Re: Beam 2.2.0 release

2017-09-06 Thread Jean-Baptiste Onofré


It sounds good to me.

By the way, you will need my help to complete the release process (as you need 
some permissions that you don't have).


Regards
JB

On 09/07/2017 01:00 AM, Reuven Lax wrote:

It sounds like SQL is still not in, and there are a couple of other PRs
that people have requested in 2.2.0. I am mostly out next week, so let's
set September 18 as a target date for cutting the first RC. That should
hopefully give plenty of time to get SQL and the remaining PRs merged into
master.

Reuven

On Thu, Aug 31, 2017 at 3:04 PM, Mingmin Xu  wrote:


Add https://issues.apache.org/jira/browse/BEAM-2833 which is a blocker to
merge DSL_SQL. There may be something wrong in the back-end(maybe
RunnerApi) to handle parametered CustomCoder in TestPipeline.

On Thu, Aug 31, 2017 at 10:38 AM, Jean-Baptiste Onofré 
wrote:


Fair enough.

That's fine for me.

Regards
JB

On Aug 31, 2017, 19:03, at 19:03, Steve Niemitz 
wrote:

I'll chime in as a user who would love to see 2.2.0 sooner than later,
specifically for the file IO Eugene mentioned.  We're using the AvroIO
enhancements extensively, but I am hesitant to run from HEAD in master
in
production.

On Thu, Aug 31, 2017 at 12:42 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:


There are a lot of users including very large production customers

who have

been asking specifically for the features that are in 2.2.0 (most of

them

accumulated while 2.1.0 was being iterated on) - mostly I'm referring

to

the vastly improved file IO - and they have been hesitant to use Beam

at

HEAD in production. I think the slight unusualness of having a

release

published soon after the previous release is a small price to pay for
helping those users :)

On Wed, Aug 30, 2017, 11:30 PM Jean-Baptiste Onofré 
wrote:


As we released 2.1.0 couple of weeks ago, it could sound weird to

the

users to
do a 2.2.0 so fast. If we have a blocking issue, we can do a 2.1.1

If

it's

new
features, why not having a release pace in October (2.2.0) ?

Thoughts ?

Regards
JB

On 08/31/2017 08:27 AM, Eugene Kirpichov wrote:

I'd suggest to do 2.2.0 as quickly as possible, and target 2.3.0

for

October. I don't see a reason to delay 2.2.0 until October:

there's a

huge

amount of features worth releasing between when 2.1.0 was cut and

the

current HEAD.

On Wed, Aug 30, 2017 at 10:18 PM Jean-Baptiste Onofré

 wrote:


Any chance to get the RedisIO in this release?
[BEAM-1017] Add RedisIO #1687

Its not my PR but ll be happy to assist if there is anything I

can

do

to

help.

On 30 Aug 2017 22:46, "Daniel Ribeiro"



dependency.

It

is
currently very outdated (v1-rev10-1.22.0, which was released

over a

year

ago

).


On Wed, Aug 30, 2017 at 1:27 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:


Thanks Ismael. I've marked these two issues for fix in

2.2.0.

Definitely

agree that at least the first one must be fixed.

Here's the current burndown list


https://issues.apache.org/jira/projects/BEAM/versions/12341044 -

we

should

clean it up.

On Wed, Aug 30, 2017 at 1:20 PM Ismaël Mejía



wrote:



The current master has accumulated a good amount of nice

features

since 2.1.0 so a new release is welcomed. I have two

JIRAs/PR

that

I

think are important to check/solve before the cut:

BEAM-2516 (this is a regression on the performance of

Direct

runner

on

Java). We had never really defined if a performance

regression is

critical to be a blocker. I executed WordCount with the

kinglear.txt

(170KB) file in version 2.1.0 vs the current 2.2.0-SNAPSHOT

and I

found that the execution time passed from 5s to 126s. So

maybe we

need

to review this one before the release. I can understand if

others

consider this a minor issue because the Direct runner is

not

supposed

to be used for production, but this performance regression

can

cause

a

bad impression for a casual user starting with Beam.

BEAM-2790 (fix reading from Amazon S3 via

HadoopFileSystem). I

think

this one is a nice to have. I am not sure that I can tackle

it

for

the

wednesday cut. I’m OOO until the beginning of next week,

but

maybe

someone else can take a look. In the worst case this is not

a

release

blocker but definitely a really

Re: [PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles

2017-09-06 Thread Jean-Baptiste Onofré


Fantastic.

Big +1 for this.

Regards
JB

On 09/07/2017 03:44 AM, Eugene Kirpichov wrote:

Hi,

Please take a look at the following proposal.

I believe, together with the (already available) FileIO.match() and
FileIO.readMatches() this proposal will empower Beam users to address all
use cases of file-based IO I'm aware of - which makes me quite excited.

http://s.apache.org/fileio-write

*We propose a new API for writing files in Beam: FileIO.write(). It is more
modular and cleaner to code against than FileBasedSink, and aims to
completely replace it.*

*FileIO.write() lets an IO author implement only logic and configuration
specific to a particular file format (e.g. Avro) and automatically get all
format-agnostic features, such as sharding, cleanup, windowed writes,
DynamicDestinations, compression, returning the successfully written
filenames, etc.*

TL;DR:

FileIO.write(FileSink { open(dest), write(input), close() })
   .to(input → dest)
   .withFilenamePolicy(dest → prefix, shard pattern)
   .withEverythingElse() // like in WriteFiles



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

[PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles

2017-09-06 Thread Eugene Kirpichov

Hi,

Please take a look at the following proposal.

I believe, together with the (already available) FileIO.match() and
FileIO.readMatches() this proposal will empower Beam users to address all
use cases of file-based IO I'm aware of - which makes me quite excited.

http://s.apache.org/fileio-write

*We propose a new API for writing files in Beam: FileIO.write(). It is more
modular and cleaner to code against than FileBasedSink, and aims to
completely replace it.*

*FileIO.write() lets an IO author implement only logic and configuration
specific to a particular file format (e.g. Avro) and automatically get all
format-agnostic features, such as sharding, cleanup, windowed writes,
DynamicDestinations, compression, returning the successfully written
filenames, etc.*

TL;DR:

FileIO.write(FileSink { open(dest), write(input), close() })
  .to(input → dest)
  .withFilenamePolicy(dest → prefix, shard pattern)
  .withEverythingElse() // like in WriteFiles

Re: Beam 2.2.0 release

2017-09-06 Thread Mingmin Xu

Regarding to BEAM-2833, a PR is there for review(
https://github.com/apache/beam/pull/3803). Once it's completed, I'll move
forward for SQL merge.

Mingmin

On Wed, Sep 6, 2017 at 4:00 PM, Reuven Lax  wrote:

> It sounds like SQL is still not in, and there are a couple of other PRs
> that people have requested in 2.2.0. I am mostly out next week, so let's
> set September 18 as a target date for cutting the first RC. That should
> hopefully give plenty of time to get SQL and the remaining PRs merged into
> master.
>
> Reuven
>
> On Thu, Aug 31, 2017 at 3:04 PM, Mingmin Xu  wrote:
>
> > Add https://issues.apache.org/jira/browse/BEAM-2833 which is a blocker
> to
> > merge DSL_SQL. There may be something wrong in the back-end(maybe
> > RunnerApi) to handle parametered CustomCoder in TestPipeline.
> >
> > On Thu, Aug 31, 2017 at 10:38 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Fair enough.
> > >
> > > That's fine for me.
> > >
> > > Regards
> > > JB
> > >
> > > On Aug 31, 2017, 19:03, at 19:03, Steve Niemitz 
> > > wrote:
> > > >I'll chime in as a user who would love to see 2.2.0 sooner than later,
> > > >specifically for the file IO Eugene mentioned.  We're using the AvroIO
> > > >enhancements extensively, but I am hesitant to run from HEAD in master
> > > >in
> > > >production.
> > > >
> > > >On Thu, Aug 31, 2017 at 12:42 PM, Eugene Kirpichov <
> > > >kirpic...@google.com.invalid> wrote:
> > > >
> > > >> There are a lot of users including very large production customers
> > > >who have
> > > >> been asking specifically for the features that are in 2.2.0 (most of
> > > >them
> > > >> accumulated while 2.1.0 was being iterated on) - mostly I'm
> referring
> > > >to
> > > >> the vastly improved file IO - and they have been hesitant to use
> Beam
> > > >at
> > > >> HEAD in production. I think the slight unusualness of having a
> > > >release
> > > >> published soon after the previous release is a small price to pay
> for
> > > >> helping those users :)
> > > >>
> > > >> On Wed, Aug 30, 2017, 11:30 PM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > > >> wrote:
> > > >>
> > > >> > As we released 2.1.0 couple of weeks ago, it could sound weird to
> > > >the
> > > >> > users to
> > > >> > do a 2.2.0 so fast. If we have a blocking issue, we can do a 2.1.1
> > > >If
> > > >> it's
> > > >> > new
> > > >> > features, why not having a release pace in October (2.2.0) ?
> > > >> >
> > > >> > Thoughts ?
> > > >> >
> > > >> > Regards
> > > >> > JB
> > > >> >
> > > >> > On 08/31/2017 08:27 AM, Eugene Kirpichov wrote:
> > > >> > > I'd suggest to do 2.2.0 as quickly as possible, and target 2.3.0
> > > >for
> > > >> > > October. I don't see a reason to delay 2.2.0 until October:
> > > >there's a
> > > >> > huge
> > > >> > > amount of features worth releasing between when 2.1.0 was cut
> and
> > > >the
> > > >> > > current HEAD.
> > > >> > >
> > > >> > > On Wed, Aug 30, 2017 at 10:18 PM Jean-Baptiste Onofré
> > > > > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > >> With a 2.2.0 in October, I think we can try to move forward on
> > > >> RedisIO.
> > > >> > >>
> > > >> > >> I'm now back from vacation and I will resume the work on this
> > > >IO.
> > > >> > >>
> > > >> > >> Regards
> > > >> > >> JB
> > > >> > >>
> > > >> > >> On 08/30/2017 11:27 PM, Eugene Kirpichov wrote:
> > > >> > >>> RedisIO in 2.2.0 is very unlikely. There's still a lot of
> > > >review
> > > >> > >> remaining
> > > >> > >>> last time I checked on the PR.
> > > >> > >>>
> > > >> > >>> On Wed, Aug 30, 2017 at 2:24 PM Vilhelm von Ehrenheim <
> > > >> > >>> vonehrenh...@gmail.com> wrote:
> > > >> > >>>
> > > >> >  Any chance to get the RedisIO in this release?
> > > >> >  [BEAM-1017] Add RedisIO #1687
> > > >> > 
> > > >> >  Its not my PR but ll be happy to assist if there is anything
> I
> > > >can
> > > >> do
> > > >> > to
> > > >> >  help.
> > > >> > 
> > > >> >  On 30 Aug 2017 22:46, "Daniel Ribeiro"
> > > >>  > > >> > >
> > > >> >  wrote:
> > > >> > 
> > > >> > > It would be great to get a bump on pubsub
> > > >> > > 
> > > >> > dependency.
> > > >> > >> It
> > > >> > > is
> > > >> > > currently very outdated (v1-rev10-1.22.0, which was released
> > > >over a
> > > >> > >> year
> > > >> > > ago
> > > >> > >  > > >> > > api-services-pubsub/v1-rev10-1.22.0/>
> > > >> > > ).
> > > >> > >
> > > >> > >
> > > >> > > On Wed, Aug 30, 2017 at 1:27 PM, Eugene Kirpichov <
> > > >> > > kirpic...@google.com.invalid> wrote:
> > > >> > >
> > > >> > >> Thanks Ismael. I've marked these two issues for fix in
> > > >2.2.0.
> > > >> >  Definitely
> > > >> > >> agree that at least the first one must be fixed.
> > > >> > >>

Re: Beam 2.2.0 release

2017-09-06 Thread Reuven Lax

It sounds like SQL is still not in, and there are a couple of other PRs
that people have requested in 2.2.0. I am mostly out next week, so let's
set September 18 as a target date for cutting the first RC. That should
hopefully give plenty of time to get SQL and the remaining PRs merged into
master.

Reuven

On Thu, Aug 31, 2017 at 3:04 PM, Mingmin Xu  wrote:

> Add https://issues.apache.org/jira/browse/BEAM-2833 which is a blocker to
> merge DSL_SQL. There may be something wrong in the back-end(maybe
> RunnerApi) to handle parametered CustomCoder in TestPipeline.
>
> On Thu, Aug 31, 2017 at 10:38 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Fair enough.
> >
> > That's fine for me.
> >
> > Regards
> > JB
> >
> > On Aug 31, 2017, 19:03, at 19:03, Steve Niemitz 
> > wrote:
> > >I'll chime in as a user who would love to see 2.2.0 sooner than later,
> > >specifically for the file IO Eugene mentioned.  We're using the AvroIO
> > >enhancements extensively, but I am hesitant to run from HEAD in master
> > >in
> > >production.
> > >
> > >On Thu, Aug 31, 2017 at 12:42 PM, Eugene Kirpichov <
> > >kirpic...@google.com.invalid> wrote:
> > >
> > >> There are a lot of users including very large production customers
> > >who have
> > >> been asking specifically for the features that are in 2.2.0 (most of
> > >them
> > >> accumulated while 2.1.0 was being iterated on) - mostly I'm referring
> > >to
> > >> the vastly improved file IO - and they have been hesitant to use Beam
> > >at
> > >> HEAD in production. I think the slight unusualness of having a
> > >release
> > >> published soon after the previous release is a small price to pay for
> > >> helping those users :)
> > >>
> > >> On Wed, Aug 30, 2017, 11:30 PM Jean-Baptiste Onofré 
> > >> wrote:
> > >>
> > >> > As we released 2.1.0 couple of weeks ago, it could sound weird to
> > >the
> > >> > users to
> > >> > do a 2.2.0 so fast. If we have a blocking issue, we can do a 2.1.1
> > >If
> > >> it's
> > >> > new
> > >> > features, why not having a release pace in October (2.2.0) ?
> > >> >
> > >> > Thoughts ?
> > >> >
> > >> > Regards
> > >> > JB
> > >> >
> > >> > On 08/31/2017 08:27 AM, Eugene Kirpichov wrote:
> > >> > > I'd suggest to do 2.2.0 as quickly as possible, and target 2.3.0
> > >for
> > >> > > October. I don't see a reason to delay 2.2.0 until October:
> > >there's a
> > >> > huge
> > >> > > amount of features worth releasing between when 2.1.0 was cut and
> > >the
> > >> > > current HEAD.
> > >> > >
> > >> > > On Wed, Aug 30, 2017 at 10:18 PM Jean-Baptiste Onofré
> > > > >> >
> > >> > > wrote:
> > >> > >
> > >> > >> With a 2.2.0 in October, I think we can try to move forward on
> > >> RedisIO.
> > >> > >>
> > >> > >> I'm now back from vacation and I will resume the work on this
> > >IO.
> > >> > >>
> > >> > >> Regards
> > >> > >> JB
> > >> > >>
> > >> > >> On 08/30/2017 11:27 PM, Eugene Kirpichov wrote:
> > >> > >>> RedisIO in 2.2.0 is very unlikely. There's still a lot of
> > >review
> > >> > >> remaining
> > >> > >>> last time I checked on the PR.
> > >> > >>>
> > >> > >>> On Wed, Aug 30, 2017 at 2:24 PM Vilhelm von Ehrenheim <
> > >> > >>> vonehrenh...@gmail.com> wrote:
> > >> > >>>
> > >> >  Any chance to get the RedisIO in this release?
> > >> >  [BEAM-1017] Add RedisIO #1687
> > >> > 
> > >> >  Its not my PR but ll be happy to assist if there is anything I
> > >can
> > >> do
> > >> > to
> > >> >  help.
> > >> > 
> > >> >  On 30 Aug 2017 22:46, "Daniel Ribeiro"
> > >>  > >> > >
> > >> >  wrote:
> > >> > 
> > >> > > It would be great to get a bump on pubsub
> > >> > > 
> > >> > dependency.
> > >> > >> It
> > >> > > is
> > >> > > currently very outdated (v1-rev10-1.22.0, which was released
> > >over a
> > >> > >> year
> > >> > > ago
> > >> > >  > >> > > api-services-pubsub/v1-rev10-1.22.0/>
> > >> > > ).
> > >> > >
> > >> > >
> > >> > > On Wed, Aug 30, 2017 at 1:27 PM, Eugene Kirpichov <
> > >> > > kirpic...@google.com.invalid> wrote:
> > >> > >
> > >> > >> Thanks Ismael. I've marked these two issues for fix in
> > >2.2.0.
> > >> >  Definitely
> > >> > >> agree that at least the first one must be fixed.
> > >> > >>
> > >> > >> Here's the current burndown list
> > >> > >>
> > >https://issues.apache.org/jira/projects/BEAM/versions/12341044 -
> > >> we
> > >> > > should
> > >> > >> clean it up.
> > >> > >>
> > >> > >> On Wed, Aug 30, 2017 at 1:20 PM Ismaël Mejía
> > >
> > >> >  wrote:
> > >> > >>
> > >> > >>> The current master has accumulated a good amount of nice
> > >features
> > >> > >>> since 2.1.0 so a new release is welcomed. I have two
> > >JIRAs/PR
> > >>

Re: Generating data stream for testing

2017-09-06 Thread Kenneth Knowles

You may find GenerateSequence can serve your needs:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/GenerateSequence.java

On Wed, Sep 6, 2017 at 2:18 PM, Thomas Weise  wrote:

> Hi Eugene,
>
> TestStream is great for functional testing. I was looking for a way to
> continuously generate data instead of specifying it upfront as collection,
> hence my question regarding the UnboundedSource hierarchy.
>
> Thomas
>
>
> On Tue, Sep 5, 2017 at 10:09 AM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Hi, did you look at TestStream?
> >
> > On Tue, Sep 5, 2017, 9:37 AM Thomas Weise  wrote:
> >
> > > Hi,
> > >
> > > I'm looking for the suitable starting point to create an unbounded
> source
> > > for testing of a pipeline. The source should generate data (let's say
> > > KV), but can also inject watermarks.
> > >
> > > I see couple implementations like TestCountingSource used for runner
> > > testing, is the starting point for users UnboundedSource?
> > >
> > > Thanks,
> > > Thomas
> > >
> >
>

Re: Generating data stream for testing

2017-09-06 Thread Thomas Weise

Hi Eugene,

TestStream is great for functional testing. I was looking for a way to
continuously generate data instead of specifying it upfront as collection,
hence my question regarding the UnboundedSource hierarchy.

Thomas


On Tue, Sep 5, 2017 at 10:09 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hi, did you look at TestStream?
>
> On Tue, Sep 5, 2017, 9:37 AM Thomas Weise  wrote:
>
> > Hi,
> >
> > I'm looking for the suitable starting point to create an unbounded source
> > for testing of a pipeline. The source should generate data (let's say
> > KV), but can also inject watermarks.
> >
> > I see couple implementations like TestCountingSource used for runner
> > testing, is the starting point for users UnboundedSource?
> >
> > Thanks,
> > Thomas
> >
>

Re: Using side inputs in any user code via thread-local side input accessor

2017-09-06 Thread Eugene Kirpichov

Hi,

On Wed, Sep 6, 2017 at 10:55 AM Kenneth Knowles 
wrote:

> On Wed, Sep 6, 2017 at 8:15 AM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> >
> > The differences are:
> > - The proposal in the doc allows wiring different side inputs to the same
> > Supplier, but I'm not convinced that this is important - you can just as
> > easily call the constructor of your DoFn passing different
> > PCollectionView's for it to capture.
> >
>
> I disagree with this bit about it being "just as easy". Passing the needed
> PCollectionViews to your constructor (or even having a constructor) is a
> pain. Every time I have to do it, it adds a ton of boilerplate that feels
> like pure noise. To make a DoFn reusable it must be made into a named class
> with a constructor, versus inlined with no constructor.

Hm, why? You can have the DoFn be an anonymous class capturing the
PCollectionView into a @SideInput field as a closure.


> A generous analogy
> is is that it is "just" manual closure conversion/currying, changing
> f(side, main) to f(side)(main). But in practice in Beam the second one has
> much more boilerplate.
>
> Also, Beam is worse. We present the user with higher-order functions, which
> is where the actual annoyance comes in. When you want to pardo(f) you have
> to write pardo(f(side))(side, main). Your proposal is to support
> pardo(f(side))(main) and mine is to support pardo(f)(side, main). I still
> propose that we support both (as they get implemented). If you buy in to my
> analogy, then there's decades of precedent and the burden of proof falls
> heavily on whoever doesn't want to support both.
>
I see your point. I think the proposal is compatible with what you're
suggesting too - in DoFn we could have @SideInput *parameters* of type
PCollectionView, with the same semantics as a field.


>
> - My proposal allows getting rid of .withSideInputs() entirely, because the
> > DoFn captures the PCollectionView so you don't need to specify it
> > explicitly for wiring.
> >
>
> I've decided to change to full +1 (whatever that means compared to 0.75 :-)
> to adding support for @SideInput fields, because the benefits outweigh this
> failure mode:
>
> new DoFn {
>   // forgot the annotation
>   private final PCollectionView whatever;
>
>   @ProcessElement public void process(...) {
> whatever.get(); // crash during execution
>   }
> }
>
> But ideas to mitigate that would be cool.

Hm, can't think of anything less hacky than "prohibit having fields of type
PCollectionView that are not public, final, and annotated with @SideInput"
- not sure we'd want to go down this road. I suppose a good error message
in .get() would be sufficient, saying "Did you forget to specify a
requirement for this side input via .withSideInputs() or by annotating the
field as @SideInput" or something like that.

>


> Kenn
>
>
> >
> > On Wed, Sep 6, 2017 at 6:03 AM Lukasz Cwik 
> > wrote:
> >
> > > My concern with the proposal is not the specifics of how it will work
> and
> > > more about it being yet another way on how our API is to be used even
> > > though we have a proposal [1] of an API style we were working towards
> in
> > > Java and Python. I would rather re-open that discussion now about what
> we
> > > want that API to look like for our major features and work towards
> > > consistency (or not if there is a strong argument as to why some
> feature
> > > should have a different style).
> > >
> > > 1: https://s.apache.org/a-new-dofn
> > >
> > > On Wed, Sep 6, 2017 at 12:22 AM, Kenneth Knowles
>  > >
> > > wrote:
> > >
> > > > +0.75 because I'd like to bring up invalid pipelines.
> > > >
> > > > I had proposed side inputs as parameters to DoFn in
> > > > https://s.apache.org/a-new-dofn (specifically at [1]) so the only
> > place
> > > > they are specified is in the graph construction, making the DoFn more
> > > > reusable and errors impossible. I've actually been noodling my way
> > > towards
> > > > this in a branch :-)
> > > >
> > > > Eugene's proposal is a sort of converse, where the side inputs are
> > values
> > > > captured in the closure and not parameters, yet the only place they
> are
> > > > specified is in the DoFn.
> > > >
> > > > I see no conflict between these two. It is very natural to have both
> > the
> > > > capability to accept parameters and the ability to capture variables
> in
> > > the
> > > > closure. Supporting both is totally standard in up-to-date
> programming
> > > > languages.
> > > >
> > > > Today we have the worse of both worlds: PCollectionView behaves as
> > > > something captured in the closure/constructor, but must still be
> > > explicitly
> > > > wired up.
> > > >
> > > > But if I use PCollectionView.get() and have not wired it up in any
> way,
> > > > what happens? Just like today, you can try to .sideInput(...) a thing
> > > that
> > > > is not available. With side inputs as parameters, this is not
>

Re: Using side inputs in any user code via thread-local side input accessor

2017-09-06 Thread Kenneth Knowles

On Wed, Sep 6, 2017 at 8:15 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

>
> The differences are:
> - The proposal in the doc allows wiring different side inputs to the same
> Supplier, but I'm not convinced that this is important - you can just as
> easily call the constructor of your DoFn passing different
> PCollectionView's for it to capture.
>

I disagree with this bit about it being "just as easy". Passing the needed
PCollectionViews to your constructor (or even having a constructor) is a
pain. Every time I have to do it, it adds a ton of boilerplate that feels
like pure noise. To make a DoFn reusable it must be made into a named class
with a constructor, versus inlined with no constructor. A generous analogy
is is that it is "just" manual closure conversion/currying, changing
f(side, main) to f(side)(main). But in practice in Beam the second one has
much more boilerplate.

Also, Beam is worse. We present the user with higher-order functions, which
is where the actual annoyance comes in. When you want to pardo(f) you have
to write pardo(f(side))(side, main). Your proposal is to support
pardo(f(side))(main) and mine is to support pardo(f)(side, main). I still
propose that we support both (as they get implemented). If you buy in to my
analogy, then there's decades of precedent and the burden of proof falls
heavily on whoever doesn't want to support both.

- My proposal allows getting rid of .withSideInputs() entirely, because the
> DoFn captures the PCollectionView so you don't need to specify it
> explicitly for wiring.
>

I've decided to change to full +1 (whatever that means compared to 0.75 :-)
to adding support for @SideInput fields, because the benefits outweigh this
failure mode:

new DoFn {
  // forgot the annotation
  private final PCollectionView whatever;

  @ProcessElement public void process(...) {
whatever.get(); // crash during execution
  }
}

But ideas to mitigate that would be cool.

Kenn


>
> On Wed, Sep 6, 2017 at 6:03 AM Lukasz Cwik 
> wrote:
>
> > My concern with the proposal is not the specifics of how it will work and
> > more about it being yet another way on how our API is to be used even
> > though we have a proposal [1] of an API style we were working towards in
> > Java and Python. I would rather re-open that discussion now about what we
> > want that API to look like for our major features and work towards
> > consistency (or not if there is a strong argument as to why some feature
> > should have a different style).
> >
> > 1: https://s.apache.org/a-new-dofn
> >
> > On Wed, Sep 6, 2017 at 12:22 AM, Kenneth Knowles  >
> > wrote:
> >
> > > +0.75 because I'd like to bring up invalid pipelines.
> > >
> > > I had proposed side inputs as parameters to DoFn in
> > > https://s.apache.org/a-new-dofn (specifically at [1]) so the only
> place
> > > they are specified is in the graph construction, making the DoFn more
> > > reusable and errors impossible. I've actually been noodling my way
> > towards
> > > this in a branch :-)
> > >
> > > Eugene's proposal is a sort of converse, where the side inputs are
> values
> > > captured in the closure and not parameters, yet the only place they are
> > > specified is in the DoFn.
> > >
> > > I see no conflict between these two. It is very natural to have both
> the
> > > capability to accept parameters and the ability to capture variables in
> > the
> > > closure. Supporting both is totally standard in up-to-date programming
> > > languages.
> > >
> > > Today we have the worse of both worlds: PCollectionView behaves as
> > > something captured in the closure/constructor, but must still be
> > explicitly
> > > wired up.
> > >
> > > But if I use PCollectionView.get() and have not wired it up in any way,
> > > what happens? Just like today, you can try to .sideInput(...) a thing
> > that
> > > is not available. With side inputs as parameters, this is not possible.
> > If
> > > you want to treat them as captured in a closure, while avoiding errors,
> > it
> > > seems like you might need to do some low-level magic, like the
> > > serialization-based detection that Luke has suggested before (there are
> > > known downsides that we haven't explored, like overcapture).
> > >
> > > Kenn
> > >
> > > [1]
> > > https://docs.google.com/document/d/1ClmQ6LqdnfseRzeSw3SL68DAO1f8j
> > > sWBL2FfzWErlbw/edit#heading=h.1budnm7l01ko
> > >
> > >
> > > On Tue, Sep 5, 2017 at 11:24 PM, Eugene Kirpichov <
> > > kirpic...@google.com.invalid> wrote:
> > >
> > > > Hm, I guess you're right - for outputs it could be indeed quite
> > valuable
> > > to
> > > > output to them without plumbing (e.g. outputting errors). Could be
> done
> > > > perhaps via TupleTag.output()? (assuming the same TupleTag can not be
> > > > reused to tag multiple PCollection's)
> > > >
> > > > For now I sent a PR for side input support
> > > > https://github.com/apache/beam/pull/3814 .
> > > >
> > > > On Tue, Sep 5, 2017 at

Re: Using side inputs in any user code via thread-local side input accessor

2017-09-06 Thread Eugene Kirpichov

Hi Luke,
I think my proposal is very similar in style to the one in A New DoFn
,
so I disagree that it is introducing a divergence from that discussion.
That code proposes:

class MyNewDoFn extends NewDoFn {
  @ProcessElement
  public void process(
  ProcessContext context,
  @SideInput Supplier frob,
  @SideInput Supplier bizzle) {
… frob.get() … bizzle.get() …
  }
}

PCollection input = …

(and it still proposes .withSideInputs() for wiring)

The similarities are:
- Using @SideInput to declare that something is a side input (this code
proposes that on a parameter, my code proposes it on a field, because
fields allow retaining a compile-time fixed signature for interface methods
- less of a concern for DoFn, but big concern for other user callbacks,
which are not considered in that proposal)
- Accessing via .get() - the code proposes using a Supplier, but I don't
think that's materially different from calling .get() on the
PCollectionView directly.

The differences are:
- The proposal in the doc allows wiring different side inputs to the same
Supplier, but I'm not convinced that this is important - you can just as
easily call the constructor of your DoFn passing different
PCollectionView's for it to capture.
- My proposal allows getting rid of .withSideInputs() entirely, because the
DoFn captures the PCollectionView so you don't need to specify it
explicitly for wiring.

On Wed, Sep 6, 2017 at 6:03 AM Lukasz Cwik  wrote:

> My concern with the proposal is not the specifics of how it will work and
> more about it being yet another way on how our API is to be used even
> though we have a proposal [1] of an API style we were working towards in
> Java and Python. I would rather re-open that discussion now about what we
> want that API to look like for our major features and work towards
> consistency (or not if there is a strong argument as to why some feature
> should have a different style).
>
> 1: https://s.apache.org/a-new-dofn
>
> On Wed, Sep 6, 2017 at 12:22 AM, Kenneth Knowles 
> wrote:
>
> > +0.75 because I'd like to bring up invalid pipelines.
> >
> > I had proposed side inputs as parameters to DoFn in
> > https://s.apache.org/a-new-dofn (specifically at [1]) so the only place
> > they are specified is in the graph construction, making the DoFn more
> > reusable and errors impossible. I've actually been noodling my way
> towards
> > this in a branch :-)
> >
> > Eugene's proposal is a sort of converse, where the side inputs are values
> > captured in the closure and not parameters, yet the only place they are
> > specified is in the DoFn.
> >
> > I see no conflict between these two. It is very natural to have both the
> > capability to accept parameters and the ability to capture variables in
> the
> > closure. Supporting both is totally standard in up-to-date programming
> > languages.
> >
> > Today we have the worse of both worlds: PCollectionView behaves as
> > something captured in the closure/constructor, but must still be
> explicitly
> > wired up.
> >
> > But if I use PCollectionView.get() and have not wired it up in any way,
> > what happens? Just like today, you can try to .sideInput(...) a thing
> that
> > is not available. With side inputs as parameters, this is not possible.
> If
> > you want to treat them as captured in a closure, while avoiding errors,
> it
> > seems like you might need to do some low-level magic, like the
> > serialization-based detection that Luke has suggested before (there are
> > known downsides that we haven't explored, like overcapture).
> >
> > Kenn
> >
> > [1]
> > https://docs.google.com/document/d/1ClmQ6LqdnfseRzeSw3SL68DAO1f8j
> > sWBL2FfzWErlbw/edit#heading=h.1budnm7l01ko
> >
> >
> > On Tue, Sep 5, 2017 at 11:24 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > Hm, I guess you're right - for outputs it could be indeed quite
> valuable
> > to
> > > output to them without plumbing (e.g. outputting errors). Could be done
> > > perhaps via TupleTag.output()? (assuming the same TupleTag can not be
> > > reused to tag multiple PCollection's)
> > >
> > > For now I sent a PR for side input support
> > > https://github.com/apache/beam/pull/3814 .
> > >
> > > On Tue, Sep 5, 2017 at 9:52 PM Lukasz Cwik 
> > > wrote:
> > >
> > > > I disagree, state may not care where it is used as well since a
> person
> > > may
> > > > call a function which needs to store/retrieve state and instead of
> > having
> > > > the DoFn declare the StateSpec and then pass in the state
> > implementation
> > > > down into the function everywhere. Similarly for outputs, the
> internal
> > > > functions could take the TupleTag and request an output manager or
> take
> > > an
> > > > "output" reference which give functions the ability to produce output
> > > >

Re: Using side inputs in any user code via thread-local side input accessor

2017-09-06 Thread Lukasz Cwik

My concern with the proposal is not the specifics of how it will work and
more about it being yet another way on how our API is to be used even
though we have a proposal [1] of an API style we were working towards in
Java and Python. I would rather re-open that discussion now about what we
want that API to look like for our major features and work towards
consistency (or not if there is a strong argument as to why some feature
should have a different style).

1: https://s.apache.org/a-new-dofn

On Wed, Sep 6, 2017 at 12:22 AM, Kenneth Knowles 
wrote:

> +0.75 because I'd like to bring up invalid pipelines.
>
> I had proposed side inputs as parameters to DoFn in
> https://s.apache.org/a-new-dofn (specifically at [1]) so the only place
> they are specified is in the graph construction, making the DoFn more
> reusable and errors impossible. I've actually been noodling my way towards
> this in a branch :-)
>
> Eugene's proposal is a sort of converse, where the side inputs are values
> captured in the closure and not parameters, yet the only place they are
> specified is in the DoFn.
>
> I see no conflict between these two. It is very natural to have both the
> capability to accept parameters and the ability to capture variables in the
> closure. Supporting both is totally standard in up-to-date programming
> languages.
>
> Today we have the worse of both worlds: PCollectionView behaves as
> something captured in the closure/constructor, but must still be explicitly
> wired up.
>
> But if I use PCollectionView.get() and have not wired it up in any way,
> what happens? Just like today, you can try to .sideInput(...) a thing that
> is not available. With side inputs as parameters, this is not possible. If
> you want to treat them as captured in a closure, while avoiding errors, it
> seems like you might need to do some low-level magic, like the
> serialization-based detection that Luke has suggested before (there are
> known downsides that we haven't explored, like overcapture).
>
> Kenn
>
> [1]
> https://docs.google.com/document/d/1ClmQ6LqdnfseRzeSw3SL68DAO1f8j
> sWBL2FfzWErlbw/edit#heading=h.1budnm7l01ko
>
>
> On Tue, Sep 5, 2017 at 11:24 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Hm, I guess you're right - for outputs it could be indeed quite valuable
> to
> > output to them without plumbing (e.g. outputting errors). Could be done
> > perhaps via TupleTag.output()? (assuming the same TupleTag can not be
> > reused to tag multiple PCollection's)
> >
> > For now I sent a PR for side input support
> > https://github.com/apache/beam/pull/3814 .
> >
> > On Tue, Sep 5, 2017 at 9:52 PM Lukasz Cwik 
> > wrote:
> >
> > > I disagree, state may not care where it is used as well since a person
> > may
> > > call a function which needs to store/retrieve state and instead of
> having
> > > the DoFn declare the StateSpec and then pass in the state
> implementation
> > > down into the function everywhere. Similarly for outputs, the internal
> > > functions could take the TupleTag and request an output manager or take
> > an
> > > "output" reference which give functions the ability to produce output
> > > directly without needing to pass everything that is needed to be output
> > > back to the caller.
> > >
> > > On Tue, Sep 5, 2017 at 9:23 PM, Eugene Kirpichov <
> > > kirpic...@google.com.invalid> wrote:
> > >
> > > > Hm, I think of these things (state, side outputs etc.), only side
> > inputs
> > > > make sense to access in arbitrary user callbacks without explicit
> > > knowledge
> > > > of the surrounding transform - so only side inputs can be implicit
> like
> > > > this.
> > > >
> > > > Ultimately we'll probably end up removing ProcessContext, and keeping
> > > only
> > > > annotations (on fields / methods / parameters). In that world, a
> field
> > > > annotation could be used (like per my previous email) to statically
> > > specify
> > > > which side inputs will be needed - while the value could still be
> > > accessed
> > > > via .get(), just like state cells are accessed via .read() and
> > .write():
> > > > i.e., #get() is not a new method of access.
> > > >
> > > > Overall, it seems like I should proceed with the idea. I filed
> > > > https://issues.apache.org/jira/browse/BEAM-2844.
> > > >
> > > > On Tue, Sep 5, 2017 at 9:08 PM Lukasz Cwik  >
> > > > wrote:
> > > >
> > > > > For API consistency reasons, it would be good if we did this
> > > holistically
> > > > > and expanded this approach to state, side outputs, ... so that a
> > person
> > > > can
> > > > > always call Something.get() to return something that they can
> access
> > > > > implementation wise. It will be confusing for our users to have
> many
> > > > > variations in our style of how all these concepts are used
> > > > (ProcessContext
> > > > > / Annotations / #get())
> > > > >
> > > > > On Tue, Sep 5, 2017 at 8:08 AM, Eugene Kirpichov <
> > > > >

Side inputs, state, outputs

2017-09-06 Thread Kenneth Knowles

The scope of the side input proposal expanded to include discussion of
state and outputs. I didn't want to pollute the other thread with this, but
I do want to emphasize what is different about these.

PCollectionView: read-only, time-evolving, per window, automatic GC based
on WindowMappingFn(s), independent of key and transform, with only
data-dependency for happens-before. It is a value, basically. Adding a
get() method with indirection to some mutated-into-place context might not
be too bad.

State: read/write, per transform+key+window, trivial concurrency control,
automatic GC. It is a naive mutable field, basically. Instead of StateSpec
separate from State, you could probably put the read/write methods on the
declaration and hack a mutated-into-place context. A bit gross but maybe
there's a usable API in that idea somewhere.

Output: write-only, obviously comes from a particular transform. It is a
control-inverted return value, basically. Omitted from the new DoFn
proposal for simplicity, but make sense as intrinsic read-only fields of
the DoFn if they are not dynamic. One could imagine designs in which a
PCollection is just a Receiver/Consumer so you can capture it in your
closure and write to it without plumbing (and likely with concurrency
control) but that's entirely unexplored AFAIK.

So I agree with Eugene's initial point that only side inputs make sense
independent of transform, until more investigation has taken place.

Kenn


On Tue, Sep 5, 2017 at 11:24 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hm, I guess you're right - for outputs it could be indeed quite valuable to
> output to them without plumbing (e.g. outputting errors). Could be done
> perhaps via TupleTag.output()? (assuming the same TupleTag can not be
> reused to tag multiple PCollection's)
>
> For now I sent a PR for side input support
> https://github.com/apache/beam/pull/3814 .
>
> On Tue, Sep 5, 2017 at 9:52 PM Lukasz Cwik 
> wrote:
>
> > I disagree, state may not care where it is used as well since a person
> may
> > call a function which needs to store/retrieve state and instead of having
> > the DoFn declare the StateSpec and then pass in the state implementation
> > down into the function everywhere. Similarly for outputs, the internal
> > functions could take the TupleTag and request an output manager or take
> an
> > "output" reference which give functions the ability to produce output
> > directly without needing to pass everything that is needed to be output
> > back to the caller.
> >
> > On Tue, Sep 5, 2017 at 9:23 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > Hm, I think of these things (state, side outputs etc.), only side
> inputs
> > > make sense to access in arbitrary user callbacks without explicit
> > knowledge
> > > of the surrounding transform - so only side inputs can be implicit like
> > > this.
> > >
> > > Ultimately we'll probably end up removing ProcessContext, and keeping
> > only
> > > annotations (on fields / methods / parameters). In that world, a field
> > > annotation could be used (like per my previous email) to statically
> > specify
> > > which side inputs will be needed - while the value could still be
> > accessed
> > > via .get(), just like state cells are accessed via .read() and
> .write():
> > > i.e., #get() is not a new method of access.
> > >
> > > Overall, it seems like I should proceed with the idea. I filed
> > > https://issues.apache.org/jira/browse/BEAM-2844.
> > >
> > > On Tue, Sep 5, 2017 at 9:08 PM Lukasz Cwik 
> > > wrote:
> > >
> > > > For API consistency reasons, it would be good if we did this
> > holistically
> > > > and expanded this approach to state, side outputs, ... so that a
> person
> > > can
> > > > always call Something.get() to return something that they can access
> > > > implementation wise. It will be confusing for our users to have many
> > > > variations in our style of how all these concepts are used
> > > (ProcessContext
> > > > / Annotations / #get())
> > > >
> > > > On Tue, Sep 5, 2017 at 8:08 AM, Eugene Kirpichov <
> > > > kirpic...@google.com.invalid> wrote:
> > > >
> > > > > Also, I think my approach is compatible with annotations and future
> > > > removal
> > > > > of .withSideInputs if we annotate a field:
> > > > > final PCollectionView foo = ...;
> > > > >
> > > > > class MyDoFn {
> > > > >   @SideInput
> > > > >   PCollectionView foo = foo;
> > > > >
> > > > >   ...foo.get()...
> > > > > }
> > > > >
> > > > > We can extract the accessed views from the DoFn instance using
> > > > reflection.
> > > > > Still not compatible with lambdas, but compatible automatically
> with
> > > all
> > > > > anonymous classes.
> > > > >
> > > > > On Tue, Sep 5, 2017, 8:02 AM Eugene Kirpichov <
> kirpic...@google.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Luke,
> > > > > >
> > > > > > I know this (annotations) is the pattern we were considering for
> > side
> >

Re: Using side inputs in any user code via thread-local side input accessor

2017-09-06 Thread Kenneth Knowles

+0.75 because I'd like to bring up invalid pipelines.

I had proposed side inputs as parameters to DoFn in
https://s.apache.org/a-new-dofn (specifically at [1]) so the only place
they are specified is in the graph construction, making the DoFn more
reusable and errors impossible. I've actually been noodling my way towards
this in a branch :-)

Eugene's proposal is a sort of converse, where the side inputs are values
captured in the closure and not parameters, yet the only place they are
specified is in the DoFn.

I see no conflict between these two. It is very natural to have both the
capability to accept parameters and the ability to capture variables in the
closure. Supporting both is totally standard in up-to-date programming
languages.

Today we have the worse of both worlds: PCollectionView behaves as
something captured in the closure/constructor, but must still be explicitly
wired up.

But if I use PCollectionView.get() and have not wired it up in any way,
what happens? Just like today, you can try to .sideInput(...) a thing that
is not available. With side inputs as parameters, this is not possible. If
you want to treat them as captured in a closure, while avoiding errors, it
seems like you might need to do some low-level magic, like the
serialization-based detection that Luke has suggested before (there are
known downsides that we haven't explored, like overcapture).

Kenn

[1]
https://docs.google.com/document/d/1ClmQ6LqdnfseRzeSw3SL68DAO1f8jsWBL2FfzWErlbw/edit#heading=h.1budnm7l01ko


On Tue, Sep 5, 2017 at 11:24 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hm, I guess you're right - for outputs it could be indeed quite valuable to
> output to them without plumbing (e.g. outputting errors). Could be done
> perhaps via TupleTag.output()? (assuming the same TupleTag can not be
> reused to tag multiple PCollection's)
>
> For now I sent a PR for side input support
> https://github.com/apache/beam/pull/3814 .
>
> On Tue, Sep 5, 2017 at 9:52 PM Lukasz Cwik 
> wrote:
>
> > I disagree, state may not care where it is used as well since a person
> may
> > call a function which needs to store/retrieve state and instead of having
> > the DoFn declare the StateSpec and then pass in the state implementation
> > down into the function everywhere. Similarly for outputs, the internal
> > functions could take the TupleTag and request an output manager or take
> an
> > "output" reference which give functions the ability to produce output
> > directly without needing to pass everything that is needed to be output
> > back to the caller.
> >
> > On Tue, Sep 5, 2017 at 9:23 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > Hm, I think of these things (state, side outputs etc.), only side
> inputs
> > > make sense to access in arbitrary user callbacks without explicit
> > knowledge
> > > of the surrounding transform - so only side inputs can be implicit like
> > > this.
> > >
> > > Ultimately we'll probably end up removing ProcessContext, and keeping
> > only
> > > annotations (on fields / methods / parameters). In that world, a field
> > > annotation could be used (like per my previous email) to statically
> > specify
> > > which side inputs will be needed - while the value could still be
> > accessed
> > > via .get(), just like state cells are accessed via .read() and
> .write():
> > > i.e., #get() is not a new method of access.
> > >
> > > Overall, it seems like I should proceed with the idea. I filed
> > > https://issues.apache.org/jira/browse/BEAM-2844.
> > >
> > > On Tue, Sep 5, 2017 at 9:08 PM Lukasz Cwik 
> > > wrote:
> > >
> > > > For API consistency reasons, it would be good if we did this
> > holistically
> > > > and expanded this approach to state, side outputs, ... so that a
> person
> > > can
> > > > always call Something.get() to return something that they can access
> > > > implementation wise. It will be confusing for our users to have many
> > > > variations in our style of how all these concepts are used
> > > (ProcessContext
> > > > / Annotations / #get())
> > > >
> > > > On Tue, Sep 5, 2017 at 8:08 AM, Eugene Kirpichov <
> > > > kirpic...@google.com.invalid> wrote:
> > > >
> > > > > Also, I think my approach is compatible with annotations and future
> > > > removal
> > > > > of .withSideInputs if we annotate a field:
> > > > > final PCollectionView foo = ...;
> > > > >
> > > > > class MyDoFn {
> > > > >   @SideInput
> > > > >   PCollectionView foo = foo;
> > > > >
> > > > >   ...foo.get()...
> > > > > }
> > > > >
> > > > > We can extract the accessed views from the DoFn instance using
> > > > reflection.
> > > > > Still not compatible with lambdas, but compatible automatically
> with
> > > all
> > > > > anonymous classes.
> > > > >
> > > > > On Tue, Sep 5, 2017, 8:02 AM Eugene Kirpichov <
> kirpic...@google.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Luke,
> > > > > >
> > > > > > I know this

Re: Using side inputs in any user code via thread-local side input accessor

2017-09-06 Thread Eugene Kirpichov

Hm, I guess you're right - for outputs it could be indeed quite valuable to
output to them without plumbing (e.g. outputting errors). Could be done
perhaps via TupleTag.output()? (assuming the same TupleTag can not be
reused to tag multiple PCollection's)

For now I sent a PR for side input support
https://github.com/apache/beam/pull/3814 .

On Tue, Sep 5, 2017 at 9:52 PM Lukasz Cwik  wrote:

> I disagree, state may not care where it is used as well since a person may
> call a function which needs to store/retrieve state and instead of having
> the DoFn declare the StateSpec and then pass in the state implementation
> down into the function everywhere. Similarly for outputs, the internal
> functions could take the TupleTag and request an output manager or take an
> "output" reference which give functions the ability to produce output
> directly without needing to pass everything that is needed to be output
> back to the caller.
>
> On Tue, Sep 5, 2017 at 9:23 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Hm, I think of these things (state, side outputs etc.), only side inputs
> > make sense to access in arbitrary user callbacks without explicit
> knowledge
> > of the surrounding transform - so only side inputs can be implicit like
> > this.
> >
> > Ultimately we'll probably end up removing ProcessContext, and keeping
> only
> > annotations (on fields / methods / parameters). In that world, a field
> > annotation could be used (like per my previous email) to statically
> specify
> > which side inputs will be needed - while the value could still be
> accessed
> > via .get(), just like state cells are accessed via .read() and .write():
> > i.e., #get() is not a new method of access.
> >
> > Overall, it seems like I should proceed with the idea. I filed
> > https://issues.apache.org/jira/browse/BEAM-2844.
> >
> > On Tue, Sep 5, 2017 at 9:08 PM Lukasz Cwik 
> > wrote:
> >
> > > For API consistency reasons, it would be good if we did this
> holistically
> > > and expanded this approach to state, side outputs, ... so that a person
> > can
> > > always call Something.get() to return something that they can access
> > > implementation wise. It will be confusing for our users to have many
> > > variations in our style of how all these concepts are used
> > (ProcessContext
> > > / Annotations / #get())
> > >
> > > On Tue, Sep 5, 2017 at 8:08 AM, Eugene Kirpichov <
> > > kirpic...@google.com.invalid> wrote:
> > >
> > > > Also, I think my approach is compatible with annotations and future
> > > removal
> > > > of .withSideInputs if we annotate a field:
> > > > final PCollectionView foo = ...;
> > > >
> > > > class MyDoFn {
> > > >   @SideInput
> > > >   PCollectionView foo = foo;
> > > >
> > > >   ...foo.get()...
> > > > }
> > > >
> > > > We can extract the accessed views from the DoFn instance using
> > > reflection.
> > > > Still not compatible with lambdas, but compatible automatically with
> > all
> > > > anonymous classes.
> > > >
> > > > On Tue, Sep 5, 2017, 8:02 AM Eugene Kirpichov 
> > > > wrote:
> > > >
> > > > > Hi Luke,
> > > > >
> > > > > I know this (annotations) is the pattern we were considering for
> side
> > > > > inputs, but I no longer think it is the best way to access them.
> > > > > Annotations help getting rid of the .withSideInputs() call, but
> this
> > is
> > > > > where their advantage ends.
> > > > >
> > > > > The advantages of the proposed approach are that it automatically
> > works
> > > > > with all existing callback or lambda code. No need to further
> develop
> > > the
> > > > > reflection machinery to support side input annotations - and
> > especially
> > > > to
> > > > > support arbitrary user interfaces, no need to change existing
> > > transforms,
> > > > > no need for transform authors to even know that the machinery
> exists
> > to
> > > > > make side inputs usable in their transforms (and no need for
> authors
> > to
> > > > > think about whether or not they should support side inputs).
> > > > >
> > > > > Moreover, like Reuven says, annotations don't work with lambdas at
> > all:
> > > > > creating a lambda with a flexible set of annotation arguments
> appears
> > > to
> > > > be
> > > > > currently impossible, and even capturing the annotations on
> arguments
> > > of
> > > > a
> > > > > lambda is I believe also impossible because the Java compiler drops
> > > them
> > > > in
> > > > > the generated class or method handle.
> > > > >
> > > > > On Tue, Sep 5, 2017 at 6:57 AM Lukasz Cwik
>  > >
> > > > > wrote:
> > > > >
> > > > >> I believe we should follow the pattern that state uses and add a
> > type
> > > > >> annotation to link the side input definition to its usage
> directly.
> > > This
> > > > >> would allow us to know that the side input was definitely being
> > > accessed
> > > > >> and perform validation during graph construction for any used

Re: Beam 2.2.0 release

Re: [PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles

[PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles

Re: Beam 2.2.0 release

Re: Beam 2.2.0 release

Re: Generating data stream for testing

Re: Generating data stream for testing

Re: Using side inputs in any user code via thread-local side input accessor

Re: Using side inputs in any user code via thread-local side input accessor

Re: Using side inputs in any user code via thread-local side input accessor

Re: Using side inputs in any user code via thread-local side input accessor

Side inputs, state, outputs

Re: Using side inputs in any user code via thread-local side input accessor

Re: Using side inputs in any user code via thread-local side input accessor

14 matches

Site Navigation

Mail list logo

Footer information