Re: coder evolutions?

Raghu Angadi Mon, 05 Feb 2018 10:47:10 -0800

Could you describe 2nd issue bit more in detail may be with a short example?
LengthAwareCoder in the PR adds another buffer copy..
(BufferedElementCountingOutputStream already has extra buffer copy).


On Mon, Feb 5, 2018 at 10:34 AM, Romain Manni-Bucau <[email protected]>
wrote:

> Would this work for everyone - can update the pr if so:
>
> If coder is not built in
>     Prefix with byte size
> Else
>     Current behavior
>
> ?
>
> Le 5 févr. 2018 19:21, "Romain Manni-Bucau" <[email protected]> a
> écrit :
>
>> Answered inlined but I want to highlight beam is a portable API on top of
>> well known vendors API which have friendly shortcuts. So the background
>> here is to make beam at least user friendly.
>>
>> Im fine if the outcome of the discussion is coder concept is wrong or
>> something like that but Im not fine to say we dont want to solve an API
>> issue, to not say bug, of a project which has an API as added value.
>>
>> I understand the perf concern which must be balanced with the fact
>> derialization is not used for each step/transform and that currently the
>> coder API is already intrusive and heavy for dev but also not usable by
>> most existing codecs out there. Even some jaxb or plain xml flavors dont
>> work with it :(.
>>
>> Le 5 févr. 2018 18:46, "Robert Bradshaw" <[email protected]> a écrit :
>>
>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>> <[email protected]> wrote:
>> > Hi guys,
>> >
>> > I submitted a PR on coders to enhance 1. the user experience 2. the
>> > determinism and handling of coders.
>> >
>> > 1. the user experience is linked to what i sent some days ago: close
>> > handling of the streams from a coder code. Long story short I add a
>> > SkipCloseCoder which can decorate a coder and just wraps the stream
>> (input
>> > or output) in flavors skipping close() calls. This avoids to do it by
>> > default (which had my preference if you read the related thread but not
>> the
>> > one of everybody) but also makes the usage of a coder with this issue
>> easy
>> > since the of() of the coder just wraps itself in this delagating coder.
>> >
>> > 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> ones
>> > use this kind of algorithm (keep in mind they work on a list):
>> >
>> > writeSize()
>> > for all element e {
>> >     elementCoder.write(e)
>> > }
>> > writeMagicNumber() // this one depends the size
>> >
>> > The decoding is symmetric so I bypass it here.
>> >
>> > Indeed all these writes (reads) are done on the same stream. Therefore
>> it
>> > assumes you read as much bytes than you write...which is a huge
>> assumption
>> > for a coder which should by contract assume it can read the stream...as
>> a
>> > stream (until -1).
>> >
>> > The idea of the fix is to change this encoding to this kind of
>> algorithm:
>> >
>> > writeSize()
>> > for all element e {
>> >     writeElementByteCount(e)
>> >     elementCoder.write(e)
>> > }
>> > writeMagicNumber() // still optionally
>>
>> Regardless of the backwards incompatibility issues, I'm unconvinced
>> that prefixing every element with its length is a good idea. It can
>> lead to blow-up in size (e.g. a list of ints, and it should be noted
>> that containers with lots of elements bias towards having small
>> elements) and also writeElementByteCount(e) could be very inefficient
>> for many type e (e.g. a list of lists).
>>
>>
>> What is your proposal Robert then? Current restriction is clearly a
>> blocker for portability, users, determinism and is unsafe and only
>> checkable at runtime so not something we should lead to keep.
>>
>> Alternative i thought about was to forbid implicit coders but it doesnt
>> help users.
>>
>>
>>
>> > This way on the decode size you can wrap the stream by element to
>> enforce
>> > the limitation of the byte count.
>> >
>> > Side note: this indeed enforce a limitation due to java byte limitation
>> but
>> > if you check coder code it is already here at the higher level so it is
>> not
>> > a big deal for now.
>> >
>> > In terms of implementation it uses a LengthAwareCoder which delegates to
>> > another coder the encoding and just adds the byte count before the
>> actual
>> > serialization. Not perfect but should be more than enough in terms of
>> > support and perf for beam if you think real pipelines (we try to avoid
>> > serializations or it is done on some well known points where this algo
>> > should be enough...worse case it is not a huge overhead, mainly just
>> some
>> > memory overhead).
>> >
>> >
>> > The PR is available at https://github.com/apache/beam/pull/4594. If you
>> > check you will see I put it "WIP". The main reason is that it changes
>> the
>> > encoding format for containers (lists, iterable, ...) and therefore
>> breaks
>> > python/go/... tests and the standard_coders.yml definition. Some help on
>> > that would be very welcomed.
>> >
>> > Technical side note if you wonder: UnownedInputStream doesn't even
>> allow to
>> > mark the stream so there is no real fast way to read the stream as fast
>> as
>> > possible with standard buffering strategies and to support this
>> automatic
>> > IterableCoder wrapping which is implicit. In other words, if beam wants
>> to
>> > support any coder, including the ones not requiring to write the size
>> of the
>> > output - most of the codecs - then we need to change the way it works to
>> > something like that which does it for the user which doesn't know its
>> coder
>> > got wrapped.
>> >
>> > Hope it makes sense, if not, don't hesitate to ask questions.
>> >
>> > Happy end of week-end.
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>
>>
>>
>

Re: coder evolutions?

Reply via email to