Thanks Kenn! This is very helpful. On Wed, Nov 30, 2016 at 4:13 PM Kenneth Knowles <k...@google.com.invalid> wrote:
> On Wed, Nov 30, 2016 at 3:52 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hello, > > > > Do we have anywhere a set of recommendations for developing new coders? > I'm > > confused by a couple of things: > > > > - Why are coders serialized by JSON serialization instead of by regular > > Java serialization, unlike other elements of the pipeline such as DoFn's > > and transforms? > > > > Big picture: a Coder is actually an agreed-upon binary format that should > be cross-language when possible. It (the coder itself) needs to be able to > be deserialized/implemented by any SDK. Alluded to a bit in > https://s.apache.org/beam-runner-api but not developed at length. There is > also some fragmentary bits moving in this direction (the encodingId which > should be a URN in Beam). > > It is still very convenient to use Java serialization which is why we > provided CustomCoder, but this presents a portability barrier. We have > options here: We could remove it, or we could have language-specific coders > that cannot be used for PCollections consumed by diverse languages. Maybe > there are other good options. > > > > > - Which should one inherit from: CustomCoder, AtomicCoder, > StandardCoder? I > > looked at their direct subclasses and didn't see a clear distinction. > Seems > > like, when the encoded type is not parameterized then it's CustomCoder, > and > > when it's parameterized then it's StandardCoder? [but then again > > CustomCoder isolates you from JSON weirdness, and this seems useful for > > non-atomic coders too] > > > > Within the SDK, one should not inherit from CustomCoder. It is bulky and > non-portable. If a coder has component coders then it should inherit from > StandardCoder. If not, then AtomicCoder adds convenience. > > These APIs are not perfect. For coder inference, the component coders are > really expected to correspond to type variables. For example, List<T> has a > component coder that is a Coder<T>. But for the purposes of construction > and update-compatibility, component coders should include *all* coders that > can be passed in that would modify the overall binary format, whether or > not they correspond to a type variable, for example > FullWindowedValueCoder<T> accepts not just a Coder<T> but also a > Coder<BoundedWindow>. > > There is work to be done here to support both uses of "component" coders, > probably by separating them entirely. > > > - Which methods are necessary to implement? E.g. should I implement > > verifyDeterministic? Should I implement the "byte size observer" methods? > > > > You should implement verifyDeterministic if your coder is deterministic so > it can be used for th ekeys in GroupByKey. > > > I'm actually even more confused by the hierarchy between Coder => > > StandardCoder => DeterministicStandardCoder => AtomicCoder => > CustomCoder. > > DeterministicStandardCoder implements verifyDeterministic(), but it has > > subclasses that override this method... > > > > This is broken. It is for backwards compatibility reasons, I believe. > Certainly DeterministicStandardCoder is a bit questionable, as it ignores > components, and also overriding it so that subclasses cannot safely be used > as DeterministicStandardCoder is wrong. > > Kenn >