The point isn't runner constraints, the point is that runners might
decisions to break fusion at unexpected points (e.g. the decision might be
made because the runner has profile data about previous runs of the
pipeline, and knows it should break it at that point). The SDK has no good
way of knowing what those decisions will be, so needs to conservatively
assume it could happen anywhere.

On Tue, Jan 30, 2018 at 1:31 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Hmm starts to smell like the old question "how to enforce runner
> constraints without enforcing too much" :(.
>
> Anyway, that is enough for me for this topic.
> Thanks for the clarification and reminders guys.
>
> Le 30 janv. 2018 22:29, "Reuven Lax" <re...@google.com> a écrit :
>
>> Where the split points are depends on the runner. Runners are free to
>> split at any point (and often do to prevent cycles from appearing in the
>> graph).
>>
>> On Tue, Jan 30, 2018 at 1:27 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> I kind of agree on all of that and brings me to the interesting point of
>>> that topic: why coders are that enforced if not used most of the time -
>>> flat processor chain to caricature it?
>>>
>>> Shouldnt it be relaxed a bit and just enforced at split or shuffle
>>> points?
>>>
>>>
>>> Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit :
>>>
>>>> It sounds like in your specific case you're saying that the same
>>>> encoding can be viewed by the Java type system two different ways. For
>>>> instance, if you have an object Person that is convertible to JSON using
>>>> Jackson, than that JSON encoding can be viewed as either a Person or a
>>>> Map<String, Object> looking at the JSON fields. In that case, there needs
>>>> to be some kind of "view change" change transform to change the type of the
>>>> PCollection.
>>>>
>>>> I'm not sure an untyped API would be better here. Requiring the "view
>>>> change" be explicit means we can ensure the types are compatible, and also
>>>> makes it very clear when this kind of change is desired.
>>>>
>>>> Some background on Coders that may be relevant:
>>>>
>>>> It might help to to think about Coders as the specification of how
>>>> elements in a PCollection are encoded if/when the runner needs to. If you
>>>> are trying to read JSON or XML records from a source, that is part of the
>>>> source transform (reading JSON or XML records) and not part of the
>>>> collection produced by the transform.
>>>>
>>>> Consider further -- even if you read XML records from a source, you
>>>> likely *wouldn't* want to use an XML Coder for those records within the
>>>> pipeline, as every time the pipeline needed to serialize them you would
>>>> produce much larger amounts of data (XML is not an efficient/compact
>>>> encoding). Instead, you likely want to read XML records from the source and
>>>> then encode those within the pipeline using something more efficient. Then
>>>> convert them to something more readable but possibly less-efficient before
>>>> they exit the pipeline at a sink.
>>>>
>>>> On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com>
>>>> wrote:
>>>>
>>>>> Ah, this is a point that Robert brings up quite often: one reason we
>>>>> put coders on PCollections instead of doing that work in PTransforms is
>>>>> that the runner (plus SDK harness) can automatically only serialize when
>>>>> necessary. So the default in Beam is that the thing you want to happen is
>>>>> already done. There are some corner cases when you get to the portability
>>>>> framework but I am pretty sure it already works this way. If you show what
>>>>> is a PTransform and PCollection in your example it might show where we can
>>>>> fix things.
>>>>>
>>>>> On Tue, Jan 30, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>> Indeed,
>>>>>>
>>>>>> I'll take a stupid example to make it shorter.
>>>>>> I have a source emitting Person objects ({name:...,id:...})
>>>>>> serialized with jackson as JSON.
>>>>>> Then my pipeline processes them with a DoFn taking a Map<String,
>>>>>> String>. Here I set the coder to read json as a map.
>>>>>>
>>>>>> However a Map<String, String> is not a Person so my pipeline needs an
>>>>>> intermediate step to convert one into the other and has in the design an
>>>>>> useless serialization round trip.
>>>>>>
>>>>>> If you check the chain you have: Person -> JSON -> Map<String,
>>>>>> String> -> JSON -> Map<String, String> whereas Person -> JSON ->
>>>>>> Map<String, String> is fully enough cause there is equivalence of JSON in
>>>>>> this example.
>>>>>>
>>>>>> In other words if an coder output is readable from another coder
>>>>>> input, the java strong typing doesn't know about it and can enforce some
>>>>>> fake steps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>>
>>>>>> 2018-01-30 21:07 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>>>>>
>>>>>>> I'm not sure I understand your question. Can you explain more?
>>>>>>>
>>>>>>> On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau <
>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> just encountered an issue with the pipeline API and wondered if you
>>>>>>>> thought about it.
>>>>>>>>
>>>>>>>> It can happen the Coders are compatible between them. Simple
>>>>>>>> example is a text coder like JSON or XML will be able to read text. 
>>>>>>>> However
>>>>>>>> with the pipeline API you can't support this directly and
>>>>>>>> enforce the user to use an intermediate state to be typed.
>>>>>>>>
>>>>>>>> Is there already a way to avoid these useless round trips?
>>>>>>>>
>>>>>>>> Said otherwise: how to handle coders transitivity?
>>>>>>>>
>>>>>>>> Romain Manni-Bucau
>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>

Reply via email to