Could you describe 2nd issue bit more in detail may be with a short example? LengthAwareCoder in the PR adds another buffer copy.. (BufferedElementCountingOutputStream already has extra buffer copy).
On Mon, Feb 5, 2018 at 10:34 AM, Romain Manni-Bucau <[email protected]> wrote: > Would this work for everyone - can update the pr if so: > > If coder is not built in > Prefix with byte size > Else > Current behavior > > ? > > Le 5 févr. 2018 19:21, "Romain Manni-Bucau" <[email protected]> a > écrit : > >> Answered inlined but I want to highlight beam is a portable API on top of >> well known vendors API which have friendly shortcuts. So the background >> here is to make beam at least user friendly. >> >> Im fine if the outcome of the discussion is coder concept is wrong or >> something like that but Im not fine to say we dont want to solve an API >> issue, to not say bug, of a project which has an API as added value. >> >> I understand the perf concern which must be balanced with the fact >> derialization is not used for each step/transform and that currently the >> coder API is already intrusive and heavy for dev but also not usable by >> most existing codecs out there. Even some jaxb or plain xml flavors dont >> work with it :(. >> >> Le 5 févr. 2018 18:46, "Robert Bradshaw" <[email protected]> a écrit : >> >> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau >> <[email protected]> wrote: >> > Hi guys, >> > >> > I submitted a PR on coders to enhance 1. the user experience 2. the >> > determinism and handling of coders. >> > >> > 1. the user experience is linked to what i sent some days ago: close >> > handling of the streams from a coder code. Long story short I add a >> > SkipCloseCoder which can decorate a coder and just wraps the stream >> (input >> > or output) in flavors skipping close() calls. This avoids to do it by >> > default (which had my preference if you read the related thread but not >> the >> > one of everybody) but also makes the usage of a coder with this issue >> easy >> > since the of() of the coder just wraps itself in this delagating coder. >> > >> > 2. this one is more nasty and mainly concerns IterableLikeCoders. These >> ones >> > use this kind of algorithm (keep in mind they work on a list): >> > >> > writeSize() >> > for all element e { >> > elementCoder.write(e) >> > } >> > writeMagicNumber() // this one depends the size >> > >> > The decoding is symmetric so I bypass it here. >> > >> > Indeed all these writes (reads) are done on the same stream. Therefore >> it >> > assumes you read as much bytes than you write...which is a huge >> assumption >> > for a coder which should by contract assume it can read the stream...as >> a >> > stream (until -1). >> > >> > The idea of the fix is to change this encoding to this kind of >> algorithm: >> > >> > writeSize() >> > for all element e { >> > writeElementByteCount(e) >> > elementCoder.write(e) >> > } >> > writeMagicNumber() // still optionally >> >> Regardless of the backwards incompatibility issues, I'm unconvinced >> that prefixing every element with its length is a good idea. It can >> lead to blow-up in size (e.g. a list of ints, and it should be noted >> that containers with lots of elements bias towards having small >> elements) and also writeElementByteCount(e) could be very inefficient >> for many type e (e.g. a list of lists). >> >> >> What is your proposal Robert then? Current restriction is clearly a >> blocker for portability, users, determinism and is unsafe and only >> checkable at runtime so not something we should lead to keep. >> >> Alternative i thought about was to forbid implicit coders but it doesnt >> help users. >> >> >> >> > This way on the decode size you can wrap the stream by element to >> enforce >> > the limitation of the byte count. >> > >> > Side note: this indeed enforce a limitation due to java byte limitation >> but >> > if you check coder code it is already here at the higher level so it is >> not >> > a big deal for now. >> > >> > In terms of implementation it uses a LengthAwareCoder which delegates to >> > another coder the encoding and just adds the byte count before the >> actual >> > serialization. Not perfect but should be more than enough in terms of >> > support and perf for beam if you think real pipelines (we try to avoid >> > serializations or it is done on some well known points where this algo >> > should be enough...worse case it is not a huge overhead, mainly just >> some >> > memory overhead). >> > >> > >> > The PR is available at https://github.com/apache/beam/pull/4594. If you >> > check you will see I put it "WIP". The main reason is that it changes >> the >> > encoding format for containers (lists, iterable, ...) and therefore >> breaks >> > python/go/... tests and the standard_coders.yml definition. Some help on >> > that would be very welcomed. >> > >> > Technical side note if you wonder: UnownedInputStream doesn't even >> allow to >> > mark the stream so there is no real fast way to read the stream as fast >> as >> > possible with standard buffering strategies and to support this >> automatic >> > IterableCoder wrapping which is implicit. In other words, if beam wants >> to >> > support any coder, including the ones not requiring to write the size >> of the >> > output - most of the codecs - then we need to change the way it works to >> > something like that which does it for the user which doesn't know its >> coder >> > got wrapped. >> > >> > Hope it makes sense, if not, don't hesitate to ask questions. >> > >> > Happy end of week-end. >> > >> > Romain Manni-Bucau >> > @rmannibucau | Blog | Old Blog | Github | LinkedIn | Book >> >> >> >
