Does it mean we would change the implicit resolution? Do you see it being backward compatible? If so sounds a good solution.
Le 5 févr. 2018 20:36, "Kenneth Knowles" <k...@google.com> a écrit : > TL;DR: create _new_ coders is not a problem. If you have a new idea for an > encoding, you can build it alongside and users can use it. We also need > data migration, and this is probably the easy way to be ready for that. > > We made a pretty big mistake in our naming of ListCoder, SetCoder, and > IterableLikeCoder because they make users think it is the > only/best/canonical encoding. We did it right with e.g. VarLongCoder and > BigEndianLongCoder. There is a default, but it is just a default. > > We actually already need "SetIterableLikeCoder" (aka SetCoder) and perhaps > "LexicallySortedBytesSetCoder" so we can change coder inference to ask for > a deterministic coder when it is needed instead of first asking for "any" > coder and then crashing when we get the wrong type. > > Kenn > > On Mon, Feb 5, 2018 at 11:00 AM, Robert Bradshaw <rober...@google.com> > wrote: > >> Just to clarify, the issue is that for some types (byte array being >> the simplest) one needs to know the length of the data in order to >> decode it from the stream. In particular, the claim is that many >> libraries out there that do encoding/decoding assume they can gather >> this information from the end of the stream and so don't explicitly >> record it. For nested values, someone needs to record these lengths. >> Note that in the Fn API, nearly everything is nested, as the elements >> are sent as a large byte stream of concatenated encoded elements. >> >> Your proposed solution is to require all container coders (though I >> think your PR only considers IterableLikeCoder, there's others, and >> there's the Elements proto itself) to prefix element encodings with >> sizes so it can give truncated streams on decoding. I think this >> places an undue burden (and code redundancy in) container coders, and >> disallows optimization on those coders that don't need to be length >> prefixed (and note that *prefixing* with length is not the only way to >> delimit a stream, we shouldn't impose that restriction as well). >> Instead, I'd keep thing the way they are, but offer a new Coder >> subclass that users can subclass if they want to write an "easy" Coder >> that does the delimiting for them (on encode and decode). We would >> point users to this for writing custom coders in the easiest way >> possible as a good option, and keeps the current Coder API the same. >> >> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau >> <rmannibu...@gmail.com> wrote: >> > Answered inlined but I want to highlight beam is a portable API on top >> of >> > well known vendors API which have friendly shortcuts. So the background >> here >> > is to make beam at least user friendly. >> > >> > Im fine if the outcome of the discussion is coder concept is wrong or >> > something like that but Im not fine to say we dont want to solve an API >> > issue, to not say bug, of a project which has an API as added value. >> > >> > I understand the perf concern which must be balanced with the fact >> > derialization is not used for each step/transform and that currently the >> > coder API is already intrusive and heavy for dev but also not usable by >> most >> > existing codecs out there. Even some jaxb or plain xml flavors dont work >> > with it :(. >> > >> > >> > Le 5 févr. 2018 18:46, "Robert Bradshaw" <rober...@google.com> a écrit >> : >> > >> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau >> > <rmannibu...@gmail.com> wrote: >> >> Hi guys, >> >> >> >> I submitted a PR on coders to enhance 1. the user experience 2. the >> >> determinism and handling of coders. >> >> >> >> 1. the user experience is linked to what i sent some days ago: close >> >> handling of the streams from a coder code. Long story short I add a >> >> SkipCloseCoder which can decorate a coder and just wraps the stream >> (input >> >> or output) in flavors skipping close() calls. This avoids to do it by >> >> default (which had my preference if you read the related thread but not >> >> the >> >> one of everybody) but also makes the usage of a coder with this issue >> easy >> >> since the of() of the coder just wraps itself in this delagating coder. >> >> >> >> 2. this one is more nasty and mainly concerns IterableLikeCoders. These >> >> ones >> >> use this kind of algorithm (keep in mind they work on a list): >> >> >> >> writeSize() >> >> for all element e { >> >> elementCoder.write(e) >> >> } >> >> writeMagicNumber() // this one depends the size >> >> >> >> The decoding is symmetric so I bypass it here. >> >> >> >> Indeed all these writes (reads) are done on the same stream. Therefore >> it >> >> assumes you read as much bytes than you write...which is a huge >> assumption >> >> for a coder which should by contract assume it can read the >> stream...as a >> >> stream (until -1). >> >> >> >> The idea of the fix is to change this encoding to this kind of >> algorithm: >> >> >> >> writeSize() >> >> for all element e { >> >> writeElementByteCount(e) >> >> elementCoder.write(e) >> >> } >> >> writeMagicNumber() // still optionally >> > >> > Regardless of the backwards incompatibility issues, I'm unconvinced >> > that prefixing every element with its length is a good idea. It can >> > lead to blow-up in size (e.g. a list of ints, and it should be noted >> > that containers with lots of elements bias towards having small >> > elements) and also writeElementByteCount(e) could be very inefficient >> > for many type e (e.g. a list of lists). >> > >> > >> > What is your proposal Robert then? Current restriction is clearly a >> blocker >> > for portability, users, determinism and is unsafe and only checkable at >> > runtime so not something we should lead to keep. >> > >> > Alternative i thought about was to forbid implicit coders but it doesnt >> help >> > users. >> > >> > >> > >> >> This way on the decode size you can wrap the stream by element to >> enforce >> >> the limitation of the byte count. >> >> >> >> Side note: this indeed enforce a limitation due to java byte limitation >> >> but >> >> if you check coder code it is already here at the higher level so it is >> >> not >> >> a big deal for now. >> >> >> >> In terms of implementation it uses a LengthAwareCoder which delegates >> to >> >> another coder the encoding and just adds the byte count before the >> actual >> >> serialization. Not perfect but should be more than enough in terms of >> >> support and perf for beam if you think real pipelines (we try to avoid >> >> serializations or it is done on some well known points where this algo >> >> should be enough...worse case it is not a huge overhead, mainly just >> some >> >> memory overhead). >> >> >> >> >> >> The PR is available at https://github.com/apache/beam/pull/4594. If >> you >> >> check you will see I put it "WIP". The main reason is that it changes >> the >> >> encoding format for containers (lists, iterable, ...) and therefore >> breaks >> >> python/go/... tests and the standard_coders.yml definition. Some help >> on >> >> that would be very welcomed. >> >> >> >> Technical side note if you wonder: UnownedInputStream doesn't even >> allow >> >> to >> >> mark the stream so there is no real fast way to read the stream as >> fast as >> >> possible with standard buffering strategies and to support this >> automatic >> >> IterableCoder wrapping which is implicit. In other words, if beam >> wants to >> >> support any coder, including the ones not requiring to write the size >> of >> >> the >> >> output - most of the codecs - then we need to change the way it works >> to >> >> something like that which does it for the user which doesn't know its >> >> coder >> >> got wrapped. >> >> >> >> Hope it makes sense, if not, don't hesitate to ask questions. >> >> >> >> Happy end of week-end. >> >> >> >> Romain Manni-Bucau >> >> @rmannibucau | Blog | Old Blog | Github | LinkedIn | Book >> > >> > >> > >