Hmm starts to smell like the old question "how to enforce runner constraints without enforcing too much" :(.
Anyway, that is enough for me for this topic. Thanks for the clarification and reminders guys. Le 30 janv. 2018 22:29, "Reuven Lax" <re...@google.com> a écrit : > Where the split points are depends on the runner. Runners are free to > split at any point (and often do to prevent cycles from appearing in the > graph). > > On Tue, Jan 30, 2018 at 1:27 PM, Romain Manni-Bucau <rmannibu...@gmail.com > > wrote: > >> I kind of agree on all of that and brings me to the interesting point of >> that topic: why coders are that enforced if not used most of the time - >> flat processor chain to caricature it? >> >> Shouldnt it be relaxed a bit and just enforced at split or shuffle points? >> >> >> Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit : >> >>> It sounds like in your specific case you're saying that the same >>> encoding can be viewed by the Java type system two different ways. For >>> instance, if you have an object Person that is convertible to JSON using >>> Jackson, than that JSON encoding can be viewed as either a Person or a >>> Map<String, Object> looking at the JSON fields. In that case, there needs >>> to be some kind of "view change" change transform to change the type of the >>> PCollection. >>> >>> I'm not sure an untyped API would be better here. Requiring the "view >>> change" be explicit means we can ensure the types are compatible, and also >>> makes it very clear when this kind of change is desired. >>> >>> Some background on Coders that may be relevant: >>> >>> It might help to to think about Coders as the specification of how >>> elements in a PCollection are encoded if/when the runner needs to. If you >>> are trying to read JSON or XML records from a source, that is part of the >>> source transform (reading JSON or XML records) and not part of the >>> collection produced by the transform. >>> >>> Consider further -- even if you read XML records from a source, you >>> likely *wouldn't* want to use an XML Coder for those records within the >>> pipeline, as every time the pipeline needed to serialize them you would >>> produce much larger amounts of data (XML is not an efficient/compact >>> encoding). Instead, you likely want to read XML records from the source and >>> then encode those within the pipeline using something more efficient. Then >>> convert them to something more readable but possibly less-efficient before >>> they exit the pipeline at a sink. >>> >>> On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com> wrote: >>> >>>> Ah, this is a point that Robert brings up quite often: one reason we >>>> put coders on PCollections instead of doing that work in PTransforms is >>>> that the runner (plus SDK harness) can automatically only serialize when >>>> necessary. So the default in Beam is that the thing you want to happen is >>>> already done. There are some corner cases when you get to the portability >>>> framework but I am pretty sure it already works this way. If you show what >>>> is a PTransform and PCollection in your example it might show where we can >>>> fix things. >>>> >>>> On Tue, Jan 30, 2018 at 12:17 PM, Romain Manni-Bucau < >>>> rmannibu...@gmail.com> wrote: >>>> >>>>> Indeed, >>>>> >>>>> I'll take a stupid example to make it shorter. >>>>> I have a source emitting Person objects ({name:...,id:...}) serialized >>>>> with jackson as JSON. >>>>> Then my pipeline processes them with a DoFn taking a Map<String, >>>>> String>. Here I set the coder to read json as a map. >>>>> >>>>> However a Map<String, String> is not a Person so my pipeline needs an >>>>> intermediate step to convert one into the other and has in the design an >>>>> useless serialization round trip. >>>>> >>>>> If you check the chain you have: Person -> JSON -> Map<String, String> >>>>> -> JSON -> Map<String, String> whereas Person -> JSON -> Map<String, >>>>> String> is fully enough cause there is equivalence of JSON in this >>>>> example. >>>>> >>>>> In other words if an coder output is readable from another coder >>>>> input, the java strong typing doesn't know about it and can enforce some >>>>> fake steps. >>>>> >>>>> >>>>> >>>>> Romain Manni-Bucau >>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>> <http://rmannibucau.wordpress.com> | Github >>>>> <https://github.com/rmannibucau> | LinkedIn >>>>> <https://www.linkedin.com/in/rmannibucau> >>>>> >>>>> 2018-01-30 21:07 GMT+01:00 Kenneth Knowles <k...@google.com>: >>>>> >>>>>> I'm not sure I understand your question. Can you explain more? >>>>>> >>>>>> On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau < >>>>>> rmannibu...@gmail.com> wrote: >>>>>> >>>>>>> Hi guys, >>>>>>> >>>>>>> just encountered an issue with the pipeline API and wondered if you >>>>>>> thought about it. >>>>>>> >>>>>>> It can happen the Coders are compatible between them. Simple example >>>>>>> is a text coder like JSON or XML will be able to read text. However with >>>>>>> the pipeline API you can't support this directly and >>>>>>> enforce the user to use an intermediate state to be typed. >>>>>>> >>>>>>> Is there already a way to avoid these useless round trips? >>>>>>> >>>>>>> Said otherwise: how to handle coders transitivity? >>>>>>> >>>>>>> Romain Manni-Bucau >>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>>>> <http://rmannibucau.wordpress.com> | Github >>>>>>> <https://github.com/rmannibucau> | LinkedIn >>>>>>> <https://www.linkedin.com/in/rmannibucau> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >