Re: untyped pipeline API?

Romain Manni-Bucau Tue, 30 Jan 2018 13:32:16 -0800

Hmm starts to smell like the old question "how to enforce runner
constraints without enforcing too much" :(.


Anyway, that is enough for me for this topic.
Thanks for the clarification and reminders guys.

Le 30 janv. 2018 22:29, "Reuven Lax" <re...@google.com> a écrit :

> Where the split points are depends on the runner. Runners are free to
> split at any point (and often do to prevent cycles from appearing in the
> graph).
>
> On Tue, Jan 30, 2018 at 1:27 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> I kind of agree on all of that and brings me to the interesting point of
>> that topic: why coders are that enforced if not used most of the time -
>> flat processor chain to caricature it?
>>
>> Shouldnt it be relaxed a bit and just enforced at split or shuffle points?
>>
>>
>> Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit :
>>
>>> It sounds like in your specific case you're saying that the same
>>> encoding can be viewed by the Java type system two different ways. For
>>> instance, if you have an object Person that is convertible to JSON using
>>> Jackson, than that JSON encoding can be viewed as either a Person or a
>>> Map<String, Object> looking at the JSON fields. In that case, there needs
>>> to be some kind of "view change" change transform to change the type of the
>>> PCollection.
>>>
>>> I'm not sure an untyped API would be better here. Requiring the "view
>>> change" be explicit means we can ensure the types are compatible, and also
>>> makes it very clear when this kind of change is desired.
>>>
>>> Some background on Coders that may be relevant:
>>>
>>> It might help to to think about Coders as the specification of how
>>> elements in a PCollection are encoded if/when the runner needs to. If you
>>> are trying to read JSON or XML records from a source, that is part of the
>>> source transform (reading JSON or XML records) and not part of the
>>> collection produced by the transform.
>>>
>>> Consider further -- even if you read XML records from a source, you
>>> likely *wouldn't* want to use an XML Coder for those records within the
>>> pipeline, as every time the pipeline needed to serialize them you would
>>> produce much larger amounts of data (XML is not an efficient/compact
>>> encoding). Instead, you likely want to read XML records from the source and
>>> then encode those within the pipeline using something more efficient. Then
>>> convert them to something more readable but possibly less-efficient before
>>> they exit the pipeline at a sink.
>>>
>>> On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com> wrote:
>>>
>>>> Ah, this is a point that Robert brings up quite often: one reason we
>>>> put coders on PCollections instead of doing that work in PTransforms is
>>>> that the runner (plus SDK harness) can automatically only serialize when
>>>> necessary. So the default in Beam is that the thing you want to happen is
>>>> already done. There are some corner cases when you get to the portability
>>>> framework but I am pretty sure it already works this way. If you show what
>>>> is a PTransform and PCollection in your example it might show where we can
>>>> fix things.
>>>>
>>>> On Tue, Jan 30, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Indeed,
>>>>>
>>>>> I'll take a stupid example to make it shorter.
>>>>> I have a source emitting Person objects ({name:...,id:...}) serialized
>>>>> with jackson as JSON.
>>>>> Then my pipeline processes them with a DoFn taking a Map<String,
>>>>> String>. Here I set the coder to read json as a map.
>>>>>
>>>>> However a Map<String, String> is not a Person so my pipeline needs an
>>>>> intermediate step to convert one into the other and has in the design an
>>>>> useless serialization round trip.
>>>>>
>>>>> If you check the chain you have: Person -> JSON -> Map<String, String>
>>>>> -> JSON -> Map<String, String> whereas Person -> JSON -> Map<String,
>>>>> String> is fully enough cause there is equivalence of JSON in this 
>>>>> example.
>>>>>
>>>>> In other words if an coder output is readable from another coder
>>>>> input, the java strong typing doesn't know about it and can enforce some
>>>>> fake steps.
>>>>>
>>>>>
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>> 2018-01-30 21:07 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>>>>
>>>>>> I'm not sure I understand your question. Can you explain more?
>>>>>>
>>>>>> On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau <
>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> just encountered an issue with the pipeline API and wondered if you
>>>>>>> thought about it.
>>>>>>>
>>>>>>> It can happen the Coders are compatible between them. Simple example
>>>>>>> is a text coder like JSON or XML will be able to read text. However with
>>>>>>> the pipeline API you can't support this directly and
>>>>>>> enforce the user to use an intermediate state to be typed.
>>>>>>>
>>>>>>> Is there already a way to avoid these useless round trips?
>>>>>>>
>>>>>>> Said otherwise: how to handle coders transitivity?
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>

Re: untyped pipeline API?

Reply via email to