Re: Requiring PTransform to set a coder on its resulting collections

2017-08-11 Thread Robert Bradshaw
On Thu, Aug 10, 2017 at 5:06 PM, Reuven Lax wrote: > Interestingly I've seen examples of PTransforms where the transform itself > is unable to easily set its own coder. This happens when the transform is > parametrized in such a way that its ouput coder is not

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-10 Thread Reuven Lax
Interestingly I've seen examples of PTransforms where the transform itself is unable to easily set its own coder. This happens when the transform is parametrized in such a way that its ouput coder is not determinable except by the caller of the PTransform. The caller can of course pass a coder

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-10 Thread Eugene Kirpichov
I've updated the guidance in PTransform Style Guide on setting coders https://beam.apache.org/contribute/ptransform-style-guide/#coders according to this discussion. https://github.com/apache/beam-site/pull/279 On Thu, Aug 3, 2017 at 6:27 PM Robert Bradshaw wrote: >

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-03 Thread Eugene Kirpichov
https://github.com/apache/beam/pull/3649 has landed. The main contribution of this PR is deprecating PTransform.getDefaultOutputCoder(). Next steps are to get rid of all setCoder() calls in the SDK, and deprecate setCoder(). Nearly all setCoder() calls (perhaps simply all?) I found are on the

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-01 Thread Reuven Lax
One interesting wrinkle: I'm about to propose a set of semantics for snapshotting/in-place updating pipelines. Part of this proposal is the ability to write pipelines to "upgrade" snapshots to make them compatible with new graphs. This relies on the ability to have two separate coders for the same

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Kenneth Knowles
Another thought on this: setting a custom coder to support a special data distribution is likely often a property of the input to the pipeline. So setting a coder during pipeline construction - more generally, when writing a composite transform for reuse - you might not actually have the needed

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Kenneth Knowles
On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh wrote: > introduce a > new, specialized type to represent the restricted > (alternatively-distributed?) data. The TypeDescriptor for this type can map > to the specialized coder, without having to perform a significant

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Jean-Baptiste Onofré
Fully agree Thomas, I think it's way better. Regards JB On 07/27/2017 08:00 PM, Thomas Groh wrote: +1 on getting rid of setCoder; just from a Java SDK perspective, my ideal world contains PCollections which don't have a user-visible way to mutate them. My preference would be to use

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Thomas Groh
I think there's a simpler solution than encoding to byte[]: introduce a new, specialized type to represent the restricted (alternatively-distributed?) data. The TypeDescriptor for this type can map to the specialized coder, without having to perform a significant degree of potentially wasted

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Reuven Lax
I tend to agree with Robert - it would be unfortunate if a single TypeDescrictor was forced to have the same encoding all through the pipeline. However it's also unfortunate if this flexibility impacted every part of the programming model. I also think that our experience has been that "large

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Thomas Groh
+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal world contains PCollections which don't have a user-visible way to mutate them. My preference would be to use TypeDescriptors everywhere within Pipeline construction (where possible), and utilize the CoderRegistry

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Robert Bradshaw
On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles wrote: > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik > wrote: >> >> Ken/Robert, I believe users will want the ability to set the output coder >> because coders may have intrinsic properties

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Jean-Baptiste Onofré
Hi, That's an interesting thread and I was wondering the relationship between type descriptor and coder for a while ;) Today, in a PCollection, we can set the coder and we also have a getTypeDescriptor(). It sounds weird to me: it should be one or the other. Basically, if the Coder is not

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Kenneth Knowles
On Wed, Jul 26, 2017 at 8:58 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method, > and I wonder how much will break if all call sites of setCoder() will call > setTypeDescriptor() instead - i.e. how far are we

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method, and I wonder how much will break if all call sites of setCoder() will call setTypeDescriptor() instead - i.e. how far are we from the ideal state of having a coder inferrable for every sufficiently concrete type

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Kenneth Knowles
+1 but maybe go ever further On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > Hello, > > I've worked on a few different things recently and ran repeatedly into the > same issue: that we do not have clear guidance on who should set the Coder > on a

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
Okay, first PR is in review https://github.com/apache/beam/pull/3649 On Wed, Jul 26, 2017 at 11:58 AM Robert Bradshaw wrote: > +1, I'm a huge fan of moving this direction. Right now there's also > the ugliness that setCoder() may be called any number of times before

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Robert Bradshaw
+1, I'm a huge fan of moving this direction. Right now there's also the ugliness that setCoder() may be called any number of times before a PCollection is used (the last setter winning) but is an error to call it once it has been used (and here "used" is not clear--if a PCollection is returned

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Mingmin Xu
Second that 'it's responsibility of the transform'. For the case when a PTransform doesn't have enough information(PTransform developer should have the knowledge), I would prefer a strict way so users won't forget to call withSomethingCoder(), like - a Coder is required to new the PTransform; - or

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
Hm, can you elaborate? I'm not sure how this relates to my suggestion, the gist of which is "PTransform's should set the coder on all of their outputs, and the user should never have to .setCoder() on a PCollection obtained from a PTransform" On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Lukasz Cwik
I'm split between our current one pass model of pipeline construction and a two pass model where all information is gathered and then PTransform expansions are performed. On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > Hello, > > I've worked on a few

Requiring PTransform to set a coder on its resulting collections

2017-07-25 Thread Eugene Kirpichov
Hello, I've worked on a few different things recently and ran repeatedly into the same issue: that we do not have clear guidance on who should set the Coder on a PCollection: is it responsibility of the PTransform that outputs it, or is it responsibility of the user, or is it sometimes one and