Sadly yes, and why the PR is actually WIP. As mentionned it modifies it and
requires some updates in other languages and the standard_coders.yml file
(I didn't find how this file was generated).
Since coders must be about volatile data I don't think it is a big deal to
change it though.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:

> One question - does this change the actual byte encoding of elements?
> We've tried hard not to do that so far for reasons of compatibility.
>
> Reuven
>
> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> I submitted a PR on coders to enhance 1. the user experience 2. the
>> determinism and handling of coders.
>>
>> 1. the user experience is linked to what i sent some days ago: close
>> handling of the streams from a coder code. Long story short I add a
>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>> or output) in flavors skipping close() calls. This avoids to do it by
>> default (which had my preference if you read the related thread but not the
>> one of everybody) but also makes the usage of a coder with this issue easy
>> since the of() of the coder just wraps itself in this delagating coder.
>>
>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> ones use this kind of algorithm (keep in mind they work on a list):
>>
>> writeSize()
>> for all element e {
>>     elementCoder.write(e)
>> }
>> writeMagicNumber() // this one depends the size
>>
>> The decoding is symmetric so I bypass it here.
>>
>> Indeed all these writes (reads) are done on the same stream. Therefore it
>> assumes you read as much bytes than you write...which is a huge assumption
>> for a coder which should by contract assume it can read the stream...as a
>> stream (until -1).
>>
>> The idea of the fix is to change this encoding to this kind of algorithm:
>>
>> writeSize()
>> for all element e {
>>     writeElementByteCount(e)
>>     elementCoder.write(e)
>> }
>> writeMagicNumber() // still optionally
>>
>> This way on the decode size you can wrap the stream by element to enforce
>> the limitation of the byte count.
>>
>> Side note: this indeed enforce a limitation due to java byte limitation
>> but if you check coder code it is already here at the higher level so it is
>> not a big deal for now.
>>
>> In terms of implementation it uses a LengthAwareCoder which delegates to
>> another coder the encoding and just adds the byte count before the actual
>> serialization. Not perfect but should be more than enough in terms of
>> support and perf for beam if you think real pipelines (we try to avoid
>> serializations or it is done on some well known points where this algo
>> should be enough...worse case it is not a huge overhead, mainly just some
>> memory overhead).
>>
>>
>> The PR is available at https://github.com/apache/beam/pull/4594. If you
>> check you will see I put it "WIP". The main reason is that it changes the
>> encoding format for containers (lists, iterable, ...) and therefore breaks
>> python/go/... tests and the standard_coders.yml definition. Some help on
>> that would be very welcomed.
>>
>> Technical side note if you wonder: UnownedInputStream doesn't even allow
>> to mark the stream so there is no real fast way to read the stream as fast
>> as possible with standard buffering strategies and to support this
>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>> wants to support any coder, including the ones not requiring to write the
>> size of the output - most of the codecs - then we need to change the way it
>> works to something like that which does it for the user which doesn't know
>> its coder got wrapped.
>>
>> Hope it makes sense, if not, don't hesitate to ask questions.
>>
>> Happy end of week-end.
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>
>

Reply via email to