Re: untyped pipeline API?

Romain Manni-Bucau Tue, 30 Jan 2018 13:27:23 -0800

I kind of agree on all of that and brings me to the interesting point of
that topic: why coders are that enforced if not used most of the time -
flat processor chain to caricature it?


Shouldnt it be relaxed a bit and just enforced at split or shuffle points?


Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit :

> It sounds like in your specific case you're saying that the same encoding
> can be viewed by the Java type system two different ways. For instance, if
> you have an object Person that is convertible to JSON using Jackson, than
> that JSON encoding can be viewed as either a Person or a Map<String,
> Object> looking at the JSON fields. In that case, there needs to be some
> kind of "view change" change transform to change the type of the
> PCollection.
>
> I'm not sure an untyped API would be better here. Requiring the "view
> change" be explicit means we can ensure the types are compatible, and also
> makes it very clear when this kind of change is desired.
>
> Some background on Coders that may be relevant:
>
> It might help to to think about Coders as the specification of how
> elements in a PCollection are encoded if/when the runner needs to. If you
> are trying to read JSON or XML records from a source, that is part of the
> source transform (reading JSON or XML records) and not part of the
> collection produced by the transform.
>
> Consider further -- even if you read XML records from a source, you likely
> *wouldn't* want to use an XML Coder for those records within the pipeline,
> as every time the pipeline needed to serialize them you would produce much
> larger amounts of data (XML is not an efficient/compact encoding). Instead,
> you likely want to read XML records from the source and then encode those
> within the pipeline using something more efficient. Then convert them to
> something more readable but possibly less-efficient before they exit the
> pipeline at a sink.
>
> On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com> wrote:
>
>> Ah, this is a point that Robert brings up quite often: one reason we put
>> coders on PCollections instead of doing that work in PTransforms is that
>> the runner (plus SDK harness) can automatically only serialize when
>> necessary. So the default in Beam is that the thing you want to happen is
>> already done. There are some corner cases when you get to the portability
>> framework but I am pretty sure it already works this way. If you show what
>> is a PTransform and PCollection in your example it might show where we can
>> fix things.
>>
>> On Tue, Jan 30, 2018 at 12:17 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Indeed,
>>>
>>> I'll take a stupid example to make it shorter.
>>> I have a source emitting Person objects ({name:...,id:...}) serialized
>>> with jackson as JSON.
>>> Then my pipeline processes them with a DoFn taking a Map<String,
>>> String>. Here I set the coder to read json as a map.
>>>
>>> However a Map<String, String> is not a Person so my pipeline needs an
>>> intermediate step to convert one into the other and has in the design an
>>> useless serialization round trip.
>>>
>>> If you check the chain you have: Person -> JSON -> Map<String, String>
>>> -> JSON -> Map<String, String> whereas Person -> JSON -> Map<String,
>>> String> is fully enough cause there is equivalence of JSON in this example.
>>>
>>> In other words if an coder output is readable from another coder input,
>>> the java strong typing doesn't know about it and can enforce some fake
>>> steps.
>>>
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau>
>>>
>>> 2018-01-30 21:07 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>>
>>>> I'm not sure I understand your question. Can you explain more?
>>>>
>>>> On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> just encountered an issue with the pipeline API and wondered if you
>>>>> thought about it.
>>>>>
>>>>> It can happen the Coders are compatible between them. Simple example
>>>>> is a text coder like JSON or XML will be able to read text. However with
>>>>> the pipeline API you can't support this directly and
>>>>> enforce the user to use an intermediate state to be typed.
>>>>>
>>>>> Is there already a way to avoid these useless round trips?
>>>>>
>>>>> Said otherwise: how to handle coders transitivity?
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>
>>>>
>>>
>>

Re: untyped pipeline API?

Reply via email to