Re: coder evolutions?

Eugene Kirpichov Mon, 05 Feb 2018 11:32:07 -0800

>From a brief reading of this discussion: if I understand correctly, we want
something to help deal with libraries that assume that they own the stream
(e.g. some common xml or json parsers), when using them in a context where
they don't (inside a Coder).


Setting aside the questions of "why would one even use an xml or json
library in a coder" (coders should be efficient, and the wire format of a
coder is not intended to be readable by anything except this exact coder
itself), and the question that ideally users would simply never write new
coders (I'm hoping schemas can move us in that direction) - I think we just
want an adapter for input and output streams that does this, and put this
adapter in front of both reading and writing using the library.

One way to build such an adapter is length prefixing:
- output stream collects all bytes into a byte array, and when closed,
writes the array length-prefixed
- input stream reads the length, and then lets the consumer read only as
many bytes
This adds an extra copy (extra GC pressure + extra maximum memory usage). I
suppose there has to be some price for integrating with a misbehaving
library, but I, like some others in this thread, would not be comfortable
having this overhead unconditionally even when it's not needed.

There are other ways to nest streams with considerably less overhead: e.g.
chunked length-prefixing (also copying, but less), escaping (having an
"escape" byte, e.g. 0xFF, and when the input contains this byte then write
0xFF 0xFF). However I think we still don't need to apply them
unconditionally - I see nothing wrong with applying them on a case-by-case
basis, only when your coder is actually using a misbehaving library.

On Mon, Feb 5, 2018 at 11:00 AM Robert Bradshaw <rober...@google.com> wrote:

> Just to clarify, the issue is that for some types (byte array being
> the simplest) one needs to know the length of the data in order to
> decode it from the stream. In particular, the claim is that many
> libraries out there that do encoding/decoding assume they can gather
> this information from the end of the stream and so don't explicitly
> record it. For nested values, someone needs to record these lengths.
> Note that in the Fn API, nearly everything is nested, as the elements
> are sent as a large byte stream of concatenated encoded elements.
>
> Your proposed solution is to require all container coders (though I
> think your PR only considers IterableLikeCoder, there's others, and
> there's the Elements proto itself) to prefix element encodings with
> sizes so it can give truncated streams on decoding. I think this
> places an undue burden (and code redundancy in) container coders, and
> disallows optimization on those coders that don't need to be length
> prefixed (and note that *prefixing* with length is not the only way to
> delimit a stream, we shouldn't impose that restriction as well).
> Instead, I'd keep thing the way they are, but offer a new Coder
> subclass that users can subclass if they want to write an "easy" Coder
> that does the delimiting for them (on encode and decode). We would
> point users to this for writing custom coders in the easiest way
> possible as a good option, and keeps the current Coder API the same.
>
> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
> <rmannibu...@gmail.com> wrote:
> > Answered inlined but I want to highlight beam is a portable API on top of
> > well known vendors API which have friendly shortcuts. So the background
> here
> > is to make beam at least user friendly.
> >
> > Im fine if the outcome of the discussion is coder concept is wrong or
> > something like that but Im not fine to say we dont want to solve an API
> > issue, to not say bug, of a project which has an API as added value.
> >
> > I understand the perf concern which must be balanced with the fact
> > derialization is not used for each step/transform and that currently the
> > coder API is already intrusive and heavy for dev but also not usable by
> most
> > existing codecs out there. Even some jaxb or plain xml flavors dont work
> > with it :(.
> >
> >
> > Le 5 févr. 2018 18:46, "Robert Bradshaw" <rober...@google.com> a écrit :
> >
> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
> > <rmannibu...@gmail.com> wrote:
> >> Hi guys,
> >>
> >> I submitted a PR on coders to enhance 1. the user experience 2. the
> >> determinism and handling of coders.
> >>
> >> 1. the user experience is linked to what i sent some days ago: close
> >> handling of the streams from a coder code. Long story short I add a
> >> SkipCloseCoder which can decorate a coder and just wraps the stream
> (input
> >> or output) in flavors skipping close() calls. This avoids to do it by
> >> default (which had my preference if you read the related thread but not
> >> the
> >> one of everybody) but also makes the usage of a coder with this issue
> easy
> >> since the of() of the coder just wraps itself in this delagating coder.
> >>
> >> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
> >> ones
> >> use this kind of algorithm (keep in mind they work on a list):
> >>
> >> writeSize()
> >> for all element e {
> >>     elementCoder.write(e)
> >> }
> >> writeMagicNumber() // this one depends the size
> >>
> >> The decoding is symmetric so I bypass it here.
> >>
> >> Indeed all these writes (reads) are done on the same stream. Therefore
> it
> >> assumes you read as much bytes than you write...which is a huge
> assumption
> >> for a coder which should by contract assume it can read the stream...as
> a
> >> stream (until -1).
> >>
> >> The idea of the fix is to change this encoding to this kind of
> algorithm:
> >>
> >> writeSize()
> >> for all element e {
> >>     writeElementByteCount(e)
> >>     elementCoder.write(e)
> >> }
> >> writeMagicNumber() // still optionally
> >
> > Regardless of the backwards incompatibility issues, I'm unconvinced
> > that prefixing every element with its length is a good idea. It can
> > lead to blow-up in size (e.g. a list of ints, and it should be noted
> > that containers with lots of elements bias towards having small
> > elements) and also writeElementByteCount(e) could be very inefficient
> > for many type e (e.g. a list of lists).
> >
> >
> > What is your proposal Robert then? Current restriction is clearly a
> blocker
> > for portability, users, determinism and is unsafe and only checkable at
> > runtime so not something we should lead to keep.
> >
> > Alternative i thought about was to forbid implicit coders but it doesnt
> help
> > users.
> >
> >
> >
> >> This way on the decode size you can wrap the stream by element to
> enforce
> >> the limitation of the byte count.
> >>
> >> Side note: this indeed enforce a limitation due to java byte limitation
> >> but
> >> if you check coder code it is already here at the higher level so it is
> >> not
> >> a big deal for now.
> >>
> >> In terms of implementation it uses a LengthAwareCoder which delegates to
> >> another coder the encoding and just adds the byte count before the
> actual
> >> serialization. Not perfect but should be more than enough in terms of
> >> support and perf for beam if you think real pipelines (we try to avoid
> >> serializations or it is done on some well known points where this algo
> >> should be enough...worse case it is not a huge overhead, mainly just
> some
> >> memory overhead).
> >>
> >>
> >> The PR is available at https://github.com/apache/beam/pull/4594. If you
> >> check you will see I put it "WIP". The main reason is that it changes
> the
> >> encoding format for containers (lists, iterable, ...) and therefore
> breaks
> >> python/go/... tests and the standard_coders.yml definition. Some help on
> >> that would be very welcomed.
> >>
> >> Technical side note if you wonder: UnownedInputStream doesn't even allow
> >> to
> >> mark the stream so there is no real fast way to read the stream as fast
> as
> >> possible with standard buffering strategies and to support this
> automatic
> >> IterableCoder wrapping which is implicit. In other words, if beam wants
> to
> >> support any coder, including the ones not requiring to write the size of
> >> the
> >> output - most of the codecs - then we need to change the way it works to
> >> something like that which does it for the user which doesn't know its
> >> coder
> >> got wrapped.
> >>
> >> Hope it makes sense, if not, don't hesitate to ask questions.
> >>
> >> Happy end of week-end.
> >>
> >> Romain Manni-Bucau
> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
> >
> >
>

Re: coder evolutions?

Reply via email to