I kind of agree on all of that and brings me to the interesting point of that topic: why coders are that enforced if not used most of the time - flat processor chain to caricature it?
Shouldnt it be relaxed a bit and just enforced at split or shuffle points? Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit : > It sounds like in your specific case you're saying that the same encoding > can be viewed by the Java type system two different ways. For instance, if > you have an object Person that is convertible to JSON using Jackson, than > that JSON encoding can be viewed as either a Person or a Map<String, > Object> looking at the JSON fields. In that case, there needs to be some > kind of "view change" change transform to change the type of the > PCollection. > > I'm not sure an untyped API would be better here. Requiring the "view > change" be explicit means we can ensure the types are compatible, and also > makes it very clear when this kind of change is desired. > > Some background on Coders that may be relevant: > > It might help to to think about Coders as the specification of how > elements in a PCollection are encoded if/when the runner needs to. If you > are trying to read JSON or XML records from a source, that is part of the > source transform (reading JSON or XML records) and not part of the > collection produced by the transform. > > Consider further -- even if you read XML records from a source, you likely > *wouldn't* want to use an XML Coder for those records within the pipeline, > as every time the pipeline needed to serialize them you would produce much > larger amounts of data (XML is not an efficient/compact encoding). Instead, > you likely want to read XML records from the source and then encode those > within the pipeline using something more efficient. Then convert them to > something more readable but possibly less-efficient before they exit the > pipeline at a sink. > > On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com> wrote: > >> Ah, this is a point that Robert brings up quite often: one reason we put >> coders on PCollections instead of doing that work in PTransforms is that >> the runner (plus SDK harness) can automatically only serialize when >> necessary. So the default in Beam is that the thing you want to happen is >> already done. There are some corner cases when you get to the portability >> framework but I am pretty sure it already works this way. If you show what >> is a PTransform and PCollection in your example it might show where we can >> fix things. >> >> On Tue, Jan 30, 2018 at 12:17 PM, Romain Manni-Bucau < >> rmannibu...@gmail.com> wrote: >> >>> Indeed, >>> >>> I'll take a stupid example to make it shorter. >>> I have a source emitting Person objects ({name:...,id:...}) serialized >>> with jackson as JSON. >>> Then my pipeline processes them with a DoFn taking a Map<String, >>> String>. Here I set the coder to read json as a map. >>> >>> However a Map<String, String> is not a Person so my pipeline needs an >>> intermediate step to convert one into the other and has in the design an >>> useless serialization round trip. >>> >>> If you check the chain you have: Person -> JSON -> Map<String, String> >>> -> JSON -> Map<String, String> whereas Person -> JSON -> Map<String, >>> String> is fully enough cause there is equivalence of JSON in this example. >>> >>> In other words if an coder output is readable from another coder input, >>> the java strong typing doesn't know about it and can enforce some fake >>> steps. >>> >>> >>> >>> Romain Manni-Bucau >>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>> <https://rmannibucau.metawerx.net/> | Old Blog >>> <http://rmannibucau.wordpress.com> | Github >>> <https://github.com/rmannibucau> | LinkedIn >>> <https://www.linkedin.com/in/rmannibucau> >>> >>> 2018-01-30 21:07 GMT+01:00 Kenneth Knowles <k...@google.com>: >>> >>>> I'm not sure I understand your question. Can you explain more? >>>> >>>> On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau < >>>> rmannibu...@gmail.com> wrote: >>>> >>>>> Hi guys, >>>>> >>>>> just encountered an issue with the pipeline API and wondered if you >>>>> thought about it. >>>>> >>>>> It can happen the Coders are compatible between them. Simple example >>>>> is a text coder like JSON or XML will be able to read text. However with >>>>> the pipeline API you can't support this directly and >>>>> enforce the user to use an intermediate state to be typed. >>>>> >>>>> Is there already a way to avoid these useless round trips? >>>>> >>>>> Said otherwise: how to handle coders transitivity? >>>>> >>>>> Romain Manni-Bucau >>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>> <http://rmannibucau.wordpress.com> | Github >>>>> <https://github.com/rmannibucau> | LinkedIn >>>>> <https://www.linkedin.com/in/rmannibucau> >>>>> >>>> >>>> >>> >>