Re: Questions about coders

Eugene Kirpichov Wed, 30 Nov 2016 16:53:57 -0800

Thanks Kenn! This is very helpful.

On Wed, Nov 30, 2016 at 4:13 PM Kenneth Knowles <k...@google.com.invalid>
wrote:


> On Wed, Nov 30, 2016 at 3:52 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Hello,
> >
> > Do we have anywhere a set of recommendations for developing new coders?
> I'm
> > confused by a couple of things:
> >
> > - Why are coders serialized by JSON serialization instead of by regular
> > Java serialization, unlike other elements of the pipeline such as DoFn's
> > and transforms?
> >
>
> Big picture: a Coder is actually an agreed-upon binary format that should
> be cross-language when possible. It (the coder itself) needs to be able to
> be deserialized/implemented by any SDK. Alluded to a bit in
> https://s.apache.org/beam-runner-api but not developed at length. There is
> also some fragmentary bits moving in this direction (the encodingId which
> should be a URN in Beam).
>
> It is still very convenient to use Java serialization which is why we
> provided CustomCoder, but this presents a portability barrier. We have
> options here: We could remove it, or we could have language-specific coders
> that cannot be used for PCollections consumed by diverse languages. Maybe
> there are other good options.
>
>
>
> > - Which should one inherit from: CustomCoder, AtomicCoder,
> StandardCoder? I
> > looked at their direct subclasses and didn't see a clear distinction.
> Seems
> > like, when the encoded type is not parameterized then it's CustomCoder,
> and
> > when it's parameterized then it's StandardCoder? [but then again
> > CustomCoder isolates you from JSON weirdness, and this seems useful for
> > non-atomic coders too]
> >
>
> Within the SDK, one should not inherit from CustomCoder. It is bulky and
> non-portable. If a coder has component coders then it should inherit from
> StandardCoder. If not, then AtomicCoder adds convenience.
>
> These APIs are not perfect. For coder inference, the component coders are
> really expected to correspond to type variables. For example, List<T> has a
> component coder that is a Coder<T>. But for the purposes of construction
> and update-compatibility, component coders should include *all* coders that
> can be passed in that would modify the overall binary format, whether or
> not they correspond to a type variable, for example
> FullWindowedValueCoder<T> accepts not just a Coder<T> but also a
> Coder<BoundedWindow>.
>
> There is work to be done here to support both uses of "component" coders,
> probably by separating them entirely.
>
>
> - Which methods are necessary to implement? E.g. should I implement
> > verifyDeterministic? Should I implement the "byte size observer" methods?
> >
>
> You should implement verifyDeterministic if your coder is deterministic so
> it can be used for th ekeys in GroupByKey.
>
>
> I'm actually even more confused by the hierarchy between Coder =>
> > StandardCoder => DeterministicStandardCoder => AtomicCoder =>
> CustomCoder.
> > DeterministicStandardCoder implements verifyDeterministic(), but it has
> > subclasses that override this method...
> >
>
> This is broken. It is for backwards compatibility reasons, I believe.
> Certainly DeterministicStandardCoder is a bit questionable, as it ignores
> components, and also overriding it so that subclasses cannot safely be used
> as DeterministicStandardCoder is wrong.
>
> Kenn
>

Re: Questions about coders

Reply via email to