Well guess it was a wording issue more than anything else. That said it is not true for all runners so can still need some more love later but i dont have a solution yet for it. Just wondered if a better way to solve it was here already.
Le 30 janv. 2018 22:36, "Reuven Lax" <re...@google.com> a écrit : > The point isn't runner constraints, the point is that runners might > decisions to break fusion at unexpected points (e.g. the decision might be > made because the runner has profile data about previous runs of the > pipeline, and knows it should break it at that point). The SDK has no good > way of knowing what those decisions will be, so needs to conservatively > assume it could happen anywhere. > > On Tue, Jan 30, 2018 at 1:31 PM, Romain Manni-Bucau <rmannibu...@gmail.com > > wrote: > >> Hmm starts to smell like the old question "how to enforce runner >> constraints without enforcing too much" :(. >> >> Anyway, that is enough for me for this topic. >> Thanks for the clarification and reminders guys. >> >> Le 30 janv. 2018 22:29, "Reuven Lax" <re...@google.com> a écrit : >> >>> Where the split points are depends on the runner. Runners are free to >>> split at any point (and often do to prevent cycles from appearing in the >>> graph). >>> >>> On Tue, Jan 30, 2018 at 1:27 PM, Romain Manni-Bucau < >>> rmannibu...@gmail.com> wrote: >>> >>>> I kind of agree on all of that and brings me to the interesting point >>>> of that topic: why coders are that enforced if not used most of the time - >>>> flat processor chain to caricature it? >>>> >>>> Shouldnt it be relaxed a bit and just enforced at split or shuffle >>>> points? >>>> >>>> >>>> Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit : >>>> >>>>> It sounds like in your specific case you're saying that the same >>>>> encoding can be viewed by the Java type system two different ways. For >>>>> instance, if you have an object Person that is convertible to JSON using >>>>> Jackson, than that JSON encoding can be viewed as either a Person or a >>>>> Map<String, Object> looking at the JSON fields. In that case, there needs >>>>> to be some kind of "view change" change transform to change the type of >>>>> the >>>>> PCollection. >>>>> >>>>> I'm not sure an untyped API would be better here. Requiring the "view >>>>> change" be explicit means we can ensure the types are compatible, and also >>>>> makes it very clear when this kind of change is desired. >>>>> >>>>> Some background on Coders that may be relevant: >>>>> >>>>> It might help to to think about Coders as the specification of how >>>>> elements in a PCollection are encoded if/when the runner needs to. If you >>>>> are trying to read JSON or XML records from a source, that is part of the >>>>> source transform (reading JSON or XML records) and not part of the >>>>> collection produced by the transform. >>>>> >>>>> Consider further -- even if you read XML records from a source, you >>>>> likely *wouldn't* want to use an XML Coder for those records within the >>>>> pipeline, as every time the pipeline needed to serialize them you would >>>>> produce much larger amounts of data (XML is not an efficient/compact >>>>> encoding). Instead, you likely want to read XML records from the source >>>>> and >>>>> then encode those within the pipeline using something more efficient. Then >>>>> convert them to something more readable but possibly less-efficient before >>>>> they exit the pipeline at a sink. >>>>> >>>>> On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com> >>>>> wrote: >>>>> >>>>>> Ah, this is a point that Robert brings up quite often: one reason we >>>>>> put coders on PCollections instead of doing that work in PTransforms is >>>>>> that the runner (plus SDK harness) can automatically only serialize when >>>>>> necessary. So the default in Beam is that the thing you want to happen is >>>>>> already done. There are some corner cases when you get to the portability >>>>>> framework but I am pretty sure it already works this way. If you show >>>>>> what >>>>>> is a PTransform and PCollection in your example it might show where we >>>>>> can >>>>>> fix things. >>>>>> >>>>>> On Tue, Jan 30, 2018 at 12:17 PM, Romain Manni-Bucau < >>>>>> rmannibu...@gmail.com> wrote: >>>>>> >>>>>>> Indeed, >>>>>>> >>>>>>> I'll take a stupid example to make it shorter. >>>>>>> I have a source emitting Person objects ({name:...,id:...}) >>>>>>> serialized with jackson as JSON. >>>>>>> Then my pipeline processes them with a DoFn taking a Map<String, >>>>>>> String>. Here I set the coder to read json as a map. >>>>>>> >>>>>>> However a Map<String, String> is not a Person so my pipeline needs >>>>>>> an intermediate step to convert one into the other and has in the >>>>>>> design an >>>>>>> useless serialization round trip. >>>>>>> >>>>>>> If you check the chain you have: Person -> JSON -> Map<String, >>>>>>> String> -> JSON -> Map<String, String> whereas Person -> JSON -> >>>>>>> Map<String, String> is fully enough cause there is equivalence of JSON >>>>>>> in >>>>>>> this example. >>>>>>> >>>>>>> In other words if an coder output is readable from another coder >>>>>>> input, the java strong typing doesn't know about it and can enforce some >>>>>>> fake steps. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Romain Manni-Bucau >>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>>>> <http://rmannibucau.wordpress.com> | Github >>>>>>> <https://github.com/rmannibucau> | LinkedIn >>>>>>> <https://www.linkedin.com/in/rmannibucau> >>>>>>> >>>>>>> 2018-01-30 21:07 GMT+01:00 Kenneth Knowles <k...@google.com>: >>>>>>> >>>>>>>> I'm not sure I understand your question. Can you explain more? >>>>>>>> >>>>>>>> On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau < >>>>>>>> rmannibu...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi guys, >>>>>>>>> >>>>>>>>> just encountered an issue with the pipeline API and wondered if >>>>>>>>> you thought about it. >>>>>>>>> >>>>>>>>> It can happen the Coders are compatible between them. Simple >>>>>>>>> example is a text coder like JSON or XML will be able to read text. >>>>>>>>> However >>>>>>>>> with the pipeline API you can't support this directly and >>>>>>>>> enforce the user to use an intermediate state to be typed. >>>>>>>>> >>>>>>>>> Is there already a way to avoid these useless round trips? >>>>>>>>> >>>>>>>>> Said otherwise: how to handle coders transitivity? >>>>>>>>> >>>>>>>>> Romain Manni-Bucau >>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>>>>>> <http://rmannibucau.wordpress.com> | Github >>>>>>>>> <https://github.com/rmannibucau> | LinkedIn >>>>>>>>> <https://www.linkedin.com/in/rmannibucau> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >