On Thu, Aug 10, 2017 at 5:06 PM, Reuven Lax wrote:
> Interestingly I've seen examples of PTransforms where the transform itself
> is unable to easily set its own coder. This happens when the transform is
> parametrized in such a way that its ouput coder is not
Interestingly I've seen examples of PTransforms where the transform itself
is unable to easily set its own coder. This happens when the transform is
parametrized in such a way that its ouput coder is not determinable except
by the caller of the PTransform. The caller can of course pass a coder
I've updated the guidance in PTransform Style Guide on setting coders
https://beam.apache.org/contribute/ptransform-style-guide/#coders
according to this discussion.
https://github.com/apache/beam-site/pull/279
On Thu, Aug 3, 2017 at 6:27 PM Robert Bradshaw
wrote:
>
https://github.com/apache/beam/pull/3649 has landed. The main contribution
of this PR is deprecating PTransform.getDefaultOutputCoder().
Next steps are to get rid of all setCoder() calls in the SDK, and deprecate
setCoder().
Nearly all setCoder() calls (perhaps simply all?) I found are on the
One interesting wrinkle: I'm about to propose a set of semantics for
snapshotting/in-place updating pipelines. Part of this proposal is the
ability to write pipelines to "upgrade" snapshots to make them compatible
with new graphs. This relies on the ability to have two separate coders for
the same
Another thought on this: setting a custom coder to support a special data
distribution is likely often a property of the input to the pipeline. So
setting a coder during pipeline construction - more generally, when writing
a composite transform for reuse - you might not actually have the needed
On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh
wrote:
> introduce a
> new, specialized type to represent the restricted
> (alternatively-distributed?) data. The TypeDescriptor for this type can map
> to the specialized coder, without having to perform a significant
Fully agree Thomas, I think it's way better.
Regards
JB
On 07/27/2017 08:00 PM, Thomas Groh wrote:
+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal
world contains PCollections which don't have a user-visible way to mutate
them.
My preference would be to use
I think there's a simpler solution than encoding to byte[]: introduce a
new, specialized type to represent the restricted
(alternatively-distributed?) data. The TypeDescriptor for this type can map
to the specialized coder, without having to perform a significant degree of
potentially wasted
I tend to agree with Robert - it would be unfortunate if a single
TypeDescrictor was forced to have the same encoding all through the
pipeline. However it's also unfortunate if this flexibility impacted every
part of the programming model. I also think that our experience has been
that "large
+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal
world contains PCollections which don't have a user-visible way to mutate
them.
My preference would be to use TypeDescriptors everywhere within Pipeline
construction (where possible), and utilize the CoderRegistry
On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
wrote:
> On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik
> wrote:
>>
>> Ken/Robert, I believe users will want the ability to set the output coder
>> because coders may have intrinsic properties
Hi,
That's an interesting thread and I was wondering the relationship between type
descriptor and coder for a while ;)
Today, in a PCollection, we can set the coder and we also have a
getTypeDescriptor(). It sounds weird to me: it should be one or the other.
Basically, if the Coder is not
On Wed, Jul 26, 2017 at 8:58 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:
> Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method,
> and I wonder how much will break if all call sites of setCoder() will call
> setTypeDescriptor() instead - i.e. how far are we
Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method,
and I wonder how much will break if all call sites of setCoder() will call
setTypeDescriptor() instead - i.e. how far are we from the ideal state of
having a coder inferrable for every sufficiently concrete type
+1 but maybe go ever further
On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:
> Hello,
>
> I've worked on a few different things recently and ran repeatedly into the
> same issue: that we do not have clear guidance on who should set the Coder
> on a
Okay, first PR is in review https://github.com/apache/beam/pull/3649
On Wed, Jul 26, 2017 at 11:58 AM Robert Bradshaw
wrote:
> +1, I'm a huge fan of moving this direction. Right now there's also
> the ugliness that setCoder() may be called any number of times before
+1, I'm a huge fan of moving this direction. Right now there's also
the ugliness that setCoder() may be called any number of times before
a PCollection is used (the last setter winning) but is an error to
call it once it has been used (and here "used" is not clear--if a
PCollection is returned
Second that 'it's responsibility of the transform'. For the case when a
PTransform doesn't have enough information(PTransform developer should have
the knowledge), I would prefer a strict way so users won't forget to call
withSomethingCoder(), like
- a Coder is required to new the PTransform;
- or
Hm, can you elaborate? I'm not sure how this relates to my suggestion, the
gist of which is "PTransform's should set the coder on all of their
outputs, and the user should never have to .setCoder() on a PCollection
obtained from a PTransform"
On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik
I'm split between our current one pass model of pipeline construction and a
two pass model where all information is gathered and then PTransform
expansions are performed.
On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:
> Hello,
>
> I've worked on a few
Hello,
I've worked on a few different things recently and ran repeatedly into the
same issue: that we do not have clear guidance on who should set the Coder
on a PCollection: is it responsibility of the PTransform that outputs it,
or is it responsibility of the user, or is it sometimes one and
22 matches
Mail list logo