On Wed, Nov 30, 2016 at 3:52 PM, Eugene Kirpichov < [email protected]> wrote:
> Hello, > > Do we have anywhere a set of recommendations for developing new coders? I'm > confused by a couple of things: > > - Why are coders serialized by JSON serialization instead of by regular > Java serialization, unlike other elements of the pipeline such as DoFn's > and transforms? > Big picture: a Coder is actually an agreed-upon binary format that should be cross-language when possible. It (the coder itself) needs to be able to be deserialized/implemented by any SDK. Alluded to a bit in https://s.apache.org/beam-runner-api but not developed at length. There is also some fragmentary bits moving in this direction (the encodingId which should be a URN in Beam). It is still very convenient to use Java serialization which is why we provided CustomCoder, but this presents a portability barrier. We have options here: We could remove it, or we could have language-specific coders that cannot be used for PCollections consumed by diverse languages. Maybe there are other good options. > - Which should one inherit from: CustomCoder, AtomicCoder, StandardCoder? I > looked at their direct subclasses and didn't see a clear distinction. Seems > like, when the encoded type is not parameterized then it's CustomCoder, and > when it's parameterized then it's StandardCoder? [but then again > CustomCoder isolates you from JSON weirdness, and this seems useful for > non-atomic coders too] > Within the SDK, one should not inherit from CustomCoder. It is bulky and non-portable. If a coder has component coders then it should inherit from StandardCoder. If not, then AtomicCoder adds convenience. These APIs are not perfect. For coder inference, the component coders are really expected to correspond to type variables. For example, List<T> has a component coder that is a Coder<T>. But for the purposes of construction and update-compatibility, component coders should include *all* coders that can be passed in that would modify the overall binary format, whether or not they correspond to a type variable, for example FullWindowedValueCoder<T> accepts not just a Coder<T> but also a Coder<BoundedWindow>. There is work to be done here to support both uses of "component" coders, probably by separating them entirely. - Which methods are necessary to implement? E.g. should I implement > verifyDeterministic? Should I implement the "byte size observer" methods? > You should implement verifyDeterministic if your coder is deterministic so it can be used for th ekeys in GroupByKey. I'm actually even more confused by the hierarchy between Coder => > StandardCoder => DeterministicStandardCoder => AtomicCoder => CustomCoder. > DeterministicStandardCoder implements verifyDeterministic(), but it has > subclasses that override this method... > This is broken. It is for backwards compatibility reasons, I believe. Certainly DeterministicStandardCoder is a bit questionable, as it ignores components, and also overriding it so that subclasses cannot safely be used as DeterministicStandardCoder is wrong. Kenn
