Re: coder evolutions?

Romain Manni-Bucau Mon, 05 Feb 2018 12:30:10 -0800

Does it mean we would change the implicit resolution? Do you see it being
backward compatible? If so sounds a good solution.


Le 5 févr. 2018 20:36, "Kenneth Knowles" <k...@google.com> a écrit :

> TL;DR: create _new_ coders is not a problem. If you have a new idea for an
> encoding, you can build it alongside and users can use it. We also need
> data migration, and this is probably the easy way to be ready for that.
>
> We made a pretty big mistake in our naming of ListCoder, SetCoder, and
> IterableLikeCoder because they make users think it is the
> only/best/canonical encoding. We did it right with e.g. VarLongCoder and
> BigEndianLongCoder. There is a default, but it is just a default.
>
> We actually already need "SetIterableLikeCoder" (aka SetCoder) and perhaps
> "LexicallySortedBytesSetCoder" so we can change coder inference to ask for
> a deterministic coder when it is needed instead of first asking for "any"
> coder and then crashing when we get the wrong type.
>
> Kenn
>
> On Mon, Feb 5, 2018 at 11:00 AM, Robert Bradshaw <rober...@google.com>
> wrote:
>
>> Just to clarify, the issue is that for some types (byte array being
>> the simplest) one needs to know the length of the data in order to
>> decode it from the stream. In particular, the claim is that many
>> libraries out there that do encoding/decoding assume they can gather
>> this information from the end of the stream and so don't explicitly
>> record it. For nested values, someone needs to record these lengths.
>> Note that in the Fn API, nearly everything is nested, as the elements
>> are sent as a large byte stream of concatenated encoded elements.
>>
>> Your proposed solution is to require all container coders (though I
>> think your PR only considers IterableLikeCoder, there's others, and
>> there's the Elements proto itself) to prefix element encodings with
>> sizes so it can give truncated streams on decoding. I think this
>> places an undue burden (and code redundancy in) container coders, and
>> disallows optimization on those coders that don't need to be length
>> prefixed (and note that *prefixing* with length is not the only way to
>> delimit a stream, we shouldn't impose that restriction as well).
>> Instead, I'd keep thing the way they are, but offer a new Coder
>> subclass that users can subclass if they want to write an "easy" Coder
>> that does the delimiting for them (on encode and decode). We would
>> point users to this for writing custom coders in the easiest way
>> possible as a good option, and keeps the current Coder API the same.
>>
>> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
>> <rmannibu...@gmail.com> wrote:
>> > Answered inlined but I want to highlight beam is a portable API on top
>> of
>> > well known vendors API which have friendly shortcuts. So the background
>> here
>> > is to make beam at least user friendly.
>> >
>> > Im fine if the outcome of the discussion is coder concept is wrong or
>> > something like that but Im not fine to say we dont want to solve an API
>> > issue, to not say bug, of a project which has an API as added value.
>> >
>> > I understand the perf concern which must be balanced with the fact
>> > derialization is not used for each step/transform and that currently the
>> > coder API is already intrusive and heavy for dev but also not usable by
>> most
>> > existing codecs out there. Even some jaxb or plain xml flavors dont work
>> > with it :(.
>> >
>> >
>> > Le 5 févr. 2018 18:46, "Robert Bradshaw" <rober...@google.com> a écrit
>> :
>> >
>> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>> > <rmannibu...@gmail.com> wrote:
>> >> Hi guys,
>> >>
>> >> I submitted a PR on coders to enhance 1. the user experience 2. the
>> >> determinism and handling of coders.
>> >>
>> >> 1. the user experience is linked to what i sent some days ago: close
>> >> handling of the streams from a coder code. Long story short I add a
>> >> SkipCloseCoder which can decorate a coder and just wraps the stream
>> (input
>> >> or output) in flavors skipping close() calls. This avoids to do it by
>> >> default (which had my preference if you read the related thread but not
>> >> the
>> >> one of everybody) but also makes the usage of a coder with this issue
>> easy
>> >> since the of() of the coder just wraps itself in this delagating coder.
>> >>
>> >> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> >> ones
>> >> use this kind of algorithm (keep in mind they work on a list):
>> >>
>> >> writeSize()
>> >> for all element e {
>> >>     elementCoder.write(e)
>> >> }
>> >> writeMagicNumber() // this one depends the size
>> >>
>> >> The decoding is symmetric so I bypass it here.
>> >>
>> >> Indeed all these writes (reads) are done on the same stream. Therefore
>> it
>> >> assumes you read as much bytes than you write...which is a huge
>> assumption
>> >> for a coder which should by contract assume it can read the
>> stream...as a
>> >> stream (until -1).
>> >>
>> >> The idea of the fix is to change this encoding to this kind of
>> algorithm:
>> >>
>> >> writeSize()
>> >> for all element e {
>> >>     writeElementByteCount(e)
>> >>     elementCoder.write(e)
>> >> }
>> >> writeMagicNumber() // still optionally
>> >
>> > Regardless of the backwards incompatibility issues, I'm unconvinced
>> > that prefixing every element with its length is a good idea. It can
>> > lead to blow-up in size (e.g. a list of ints, and it should be noted
>> > that containers with lots of elements bias towards having small
>> > elements) and also writeElementByteCount(e) could be very inefficient
>> > for many type e (e.g. a list of lists).
>> >
>> >
>> > What is your proposal Robert then? Current restriction is clearly a
>> blocker
>> > for portability, users, determinism and is unsafe and only checkable at
>> > runtime so not something we should lead to keep.
>> >
>> > Alternative i thought about was to forbid implicit coders but it doesnt
>> help
>> > users.
>> >
>> >
>> >
>> >> This way on the decode size you can wrap the stream by element to
>> enforce
>> >> the limitation of the byte count.
>> >>
>> >> Side note: this indeed enforce a limitation due to java byte limitation
>> >> but
>> >> if you check coder code it is already here at the higher level so it is
>> >> not
>> >> a big deal for now.
>> >>
>> >> In terms of implementation it uses a LengthAwareCoder which delegates
>> to
>> >> another coder the encoding and just adds the byte count before the
>> actual
>> >> serialization. Not perfect but should be more than enough in terms of
>> >> support and perf for beam if you think real pipelines (we try to avoid
>> >> serializations or it is done on some well known points where this algo
>> >> should be enough...worse case it is not a huge overhead, mainly just
>> some
>> >> memory overhead).
>> >>
>> >>
>> >> The PR is available at https://github.com/apache/beam/pull/4594. If
>> you
>> >> check you will see I put it "WIP". The main reason is that it changes
>> the
>> >> encoding format for containers (lists, iterable, ...) and therefore
>> breaks
>> >> python/go/... tests and the standard_coders.yml definition. Some help
>> on
>> >> that would be very welcomed.
>> >>
>> >> Technical side note if you wonder: UnownedInputStream doesn't even
>> allow
>> >> to
>> >> mark the stream so there is no real fast way to read the stream as
>> fast as
>> >> possible with standard buffering strategies and to support this
>> automatic
>> >> IterableCoder wrapping which is implicit. In other words, if beam
>> wants to
>> >> support any coder, including the ones not requiring to write the size
>> of
>> >> the
>> >> output - most of the codecs - then we need to change the way it works
>> to
>> >> something like that which does it for the user which doesn't know its
>> >> coder
>> >> got wrapped.
>> >>
>> >> Hope it makes sense, if not, don't hesitate to ask questions.
>> >>
>> >> Happy end of week-end.
>> >>
>> >> Romain Manni-Bucau
>> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>> >
>> >
>>
>
>

Re: coder evolutions?

Reply via email to