On Wed, Nov 30, 2016 at 3:52 PM, Eugene Kirpichov <
[email protected]> wrote:

> Hello,
>
> Do we have anywhere a set of recommendations for developing new coders? I'm
> confused by a couple of things:
>
> - Why are coders serialized by JSON serialization instead of by regular
> Java serialization, unlike other elements of the pipeline such as DoFn's
> and transforms?
>

Big picture: a Coder is actually an agreed-upon binary format that should
be cross-language when possible. It (the coder itself) needs to be able to
be deserialized/implemented by any SDK. Alluded to a bit in
https://s.apache.org/beam-runner-api but not developed at length. There is
also some fragmentary bits moving in this direction (the encodingId which
should be a URN in Beam).

It is still very convenient to use Java serialization which is why we
provided CustomCoder, but this presents a portability barrier. We have
options here: We could remove it, or we could have language-specific coders
that cannot be used for PCollections consumed by diverse languages. Maybe
there are other good options.



> - Which should one inherit from: CustomCoder, AtomicCoder, StandardCoder? I
> looked at their direct subclasses and didn't see a clear distinction. Seems
> like, when the encoded type is not parameterized then it's CustomCoder, and
> when it's parameterized then it's StandardCoder? [but then again
> CustomCoder isolates you from JSON weirdness, and this seems useful for
> non-atomic coders too]
>

Within the SDK, one should not inherit from CustomCoder. It is bulky and
non-portable. If a coder has component coders then it should inherit from
StandardCoder. If not, then AtomicCoder adds convenience.

These APIs are not perfect. For coder inference, the component coders are
really expected to correspond to type variables. For example, List<T> has a
component coder that is a Coder<T>. But for the purposes of construction
and update-compatibility, component coders should include *all* coders that
can be passed in that would modify the overall binary format, whether or
not they correspond to a type variable, for example
FullWindowedValueCoder<T> accepts not just a Coder<T> but also a
Coder<BoundedWindow>.

There is work to be done here to support both uses of "component" coders,
probably by separating them entirely.


- Which methods are necessary to implement? E.g. should I implement
> verifyDeterministic? Should I implement the "byte size observer" methods?
>

You should implement verifyDeterministic if your coder is deterministic so
it can be used for th ekeys in GroupByKey.


I'm actually even more confused by the hierarchy between Coder =>
> StandardCoder => DeterministicStandardCoder => AtomicCoder => CustomCoder.
> DeterministicStandardCoder implements verifyDeterministic(), but it has
> subclasses that override this method...
>

This is broken. It is for backwards compatibility reasons, I believe.
Certainly DeterministicStandardCoder is a bit questionable, as it ignores
components, and also overriding it so that subclasses cannot safely be used
as DeterministicStandardCoder is wrong.

Kenn

Reply via email to