Re: coder evolutions?

Jean-Baptiste Onofré Sun, 04 Feb 2018 11:41:49 -0800

I would prefer to use Jira, with "wish"/"ideas", and adding Beam 3.0.0 version.


WDYT ?

Regards
JB

On 02/04/2018 07:55 PM, Reuven Lax wrote:
> Do we have a good place to track the items for Beam 3.0, or is Jira the best
> place? Romain has a good point - if this gets forgotten when we do Beam 3.0,
> then we're stuck waiting around till Beam 4.0.
> 
> Reuven
> 
> On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     That's a good point. In the roadmap for Beam 3, I think it makes sense to 
> add a
>     point about this.
> 
>     Regards
>     JB
> 
>     On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
>     > I think doing a change that would break pipeline update for every 
> single user of
>     > Flink and Dataflow needs to be postponed until a next major version. 
> Pipeline
>     > update is a very frequently used feature, especially by the largest 
> users. We've
>     > had those users get significantly upset even when we accidentally broke 
> update
>     > compatibility for some special cases of individual transforms; breaking 
> it
>     > intentionally and project-wide is too extreme to be justified by the 
> benefits of
>     > the current change.
>     >
>     > That said, I think concerns about coder APIs are reasonable, and it is
>     > unfortunate that we effectively can't make changes to them right now. 
> It would
>     > be great if in the next major version we were better prepared for 
> evolution of
>     > coders, e.g. by having coders support a version marker or something 
> like that,
>     > with an API for detecting the version of data on wire and reading or 
> writing
>     > data of an old version. Such a change (introducing versioning) would 
> also, of
>     > course, be incompatible and would need to be postponed until a major 
> version -
>     > but, at least, subsequent changes wouldn't.
>     >
>     > ...And as I was typing this email, seems that this is what the thread 
> already
>     > came to!
>     >
>     > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau 
> <[email protected] <mailto:[email protected]>
>     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>     >
>     >     I like this idea of migration support at coder level. It would 
> require to
>     >     add a metadata in all outputs which would represent the version 
> then coders
>     >     can handle the logic properly depending the version - we can assume 
> a coder
>     >     dev upgrade the version when he breaks the representation I hope ;).
>     >     With this: no runner impact at all :).
>     >
>     >
>     >     Romain Manni-Bucau
>     >     @rmannibucau <https://twitter.com/rmannibucau
>     <https://twitter.com/rmannibucau>> |  Blog
>     >     <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old Blog
>     >     <http://rmannibucau.wordpress.com 
> <http://rmannibucau.wordpress.com>>
>     | Github
>     >     <https://github.com/rmannibucau <https://github.com/rmannibucau>> |
>     LinkedIn
>     >     <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>> | Book
>     >   
>      
> <https://www.packtpub.com/application-development/java-ee-8-high-performance 
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>     >
>     >     2018-02-04 18:09 GMT+01:00 Reuven Lax <[email protected] 
> <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>:
>     >
>     >         It would already break quite a number of users at this point.
>     >
>     >         I think what we should be doing is moving forward on the 
> snapshot/update
>     >         proposal. That proposal actually provides a way forward when 
> coders
>     >         change (it proposes a way to map an old snapshot to one using 
> the new
>     >         coder, so changes to coders in the future will be much easier 
> to make.
>     >         However much of the implementation for this will likely be at 
> the runner
>     >         level, not the SDK level.
>     >
>     >         Reuven
>     >
>     >         On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau
>     >         <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>> wrote:
>     >
>     >             I fully understand that, and this is one of the reason 
> managing to
>     >             solve these issues is very important and ASAP. My 
> conclusion is that
>     >             we must break it now to avoid to do it later when usage 
> will be way
>     >             more developped - I would be very happy to be wrong on that 
> point -
>     >             so I started this PR and this thread. We can postpone it 
> but it
>     >             would break later so for probably more users.
>     >
>     >
>     >             Romain Manni-Bucau
>     >             @rmannibucau <https://twitter.com/rmannibucau
>     <https://twitter.com/rmannibucau>> |  Blog
>     >             <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old Blog
>     >             <http://rmannibucau.wordpress.com
>     <http://rmannibucau.wordpress.com>> | Github
>     >             <https://github.com/rmannibucau
>     <https://github.com/rmannibucau>> | LinkedIn
>     >             <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>> | Book
>     >           
>      
> <https://www.packtpub.com/application-development/java-ee-8-high-performance 
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>     >
>     >             2018-02-04 17:49 GMT+01:00 Reuven Lax <[email protected] 
> <mailto:[email protected]>
>     >             <mailto:[email protected] <mailto:[email protected]>>>:
>     >
>     >                 Unfortunately several runners (at least Flink and 
> Dataflow)
>     >                 support in-place update of streaming pipelines as a key 
> feature,
>     >                 and changing coder format breaks this. This is a very 
> important
>     >                 feature of both runners, and we should endeavor not to 
> break them.
>     >
>     >                 In-place snapshot and update is also a top-level Beam 
> proposal
>     >                 that was received positively, though neither of those 
> runners
>     >                 yet implement the proposed interface.
>     >
>     >                 Reuven
>     >
>     >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau
>     >                 <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>> wrote:
>     >
>     >                     Sadly yes, and why the PR is actually WIP. As 
> mentionned it
>     >                     modifies it and requires some updates in other 
> languages and
>     >                     the standard_coders.yml file (I didn't find how 
> this file
>     >                     was generated).
>     >                     Since coders must be about volatile data I don't 
> think it is
>     >                     a big deal to change it though.
>     >
>     >
>     >                     Romain Manni-Bucau
>     >                     @rmannibucau <https://twitter.com/rmannibucau
>     <https://twitter.com/rmannibucau>> |  Blog
>     >                     <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old Blog
>     >                     <http://rmannibucau.wordpress.com
>     <http://rmannibucau.wordpress.com>> | Github
>     >                     <https://github.com/rmannibucau
>     <https://github.com/rmannibucau>> | LinkedIn
>     >                     <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>> | Book
>     >                   
>      
> <https://www.packtpub.com/application-development/java-ee-8-high-performance 
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>     >
>     >                     2018-02-04 17:34 GMT+01:00 Reuven Lax 
> <[email protected] <mailto:[email protected]>
>     >                     <mailto:[email protected] 
> <mailto:[email protected]>>>:
>     >
>     >                         One question - does this change the actual byte 
> encoding
>     >                         of elements? We've tried hard not to do that so 
> far for
>     >                         reasons of compatibility.
>     >
>     >                         Reuven
>     >
>     >                         On Sun, Feb 4, 2018 at 6:44 AM, Romain 
> Manni-Bucau
>     >                         <[email protected]
>     <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>>
>     >                         wrote:
>     >
>     >                             Hi guys,
>     >
>     >                             I submitted a PR on coders to enhance 1. 
> the user
>     >                             experience 2. the determinism and handling 
> of
>     coders.
>     >
>     >                             1. the user experience is linked to what i
>     sent some
>     >                             days ago: close handling of the streams 
> from a
>     coder
>     >                             code. Long story short I add a 
> SkipCloseCoder
>     which
>     >                             can decorate a coder and just wraps the 
> stream
>     >                             (input or output) in flavors skipping 
> close()
>     calls.
>     >                             This avoids to do it by default (which had 
> my
>     >                             preference if you read the related thread 
> but not
>     >                             the one of everybody) but also makes the 
> usage
>     of a
>     >                             coder with this issue easy since the of() 
> of the
>     >                             coder just wraps itself in this delagating 
> coder.
>     >
>     >                             2. this one is more nasty and mainly 
> concerns
>     >                             IterableLikeCoders. These ones use this 
> kind of
>     >                             algorithm (keep in mind they work on a 
> list):
>     >
>     >                             writeSize()
>     >                             for all element e {
>     >                                 elementCoder.write(e)
>     >                             }
>     >                             writeMagicNumber() // this one depends the 
> size
>     >
>     >                             The decoding is symmetric so I bypass it 
> here.
>     >
>     >                             Indeed all these writes (reads) are done on
>     the same
>     >                             stream. Therefore it assumes you read as 
> much
>     bytes
>     >                             than you write...which is a huge assumption 
> for a
>     >                             coder which should by contract assume it 
> can read
>     >                             the stream...as a stream (until -1).
>     >
>     >                             The idea of the fix is to change this 
> encoding to
>     >                             this kind of algorithm:
>     >
>     >                             writeSize()
>     >                             for all element e {
>     >                                 writeElementByteCount(e)
>     >                                 elementCoder.write(e)
>     >                             }
>     >                             writeMagicNumber() // still optionally
>     >
>     >                             This way on the decode size you can wrap the
>     stream
>     >                             by element to enforce the limitation of the
>     byte count.
>     >
>     >                             Side note: this indeed enforce a limitation 
> due to
>     >                             java byte limitation but if you check coder
>     code it
>     >                             is already here at the higher level so it 
> is not a
>     >                             big deal for now.
>     >
>     >                             In terms of implementation it uses a
>     >                             LengthAwareCoder which delegates to another 
> coder
>     >                             the encoding and just adds the byte count
>     before the
>     >                             actual serialization. Not perfect but should
>     be more
>     >                             than enough in terms of support and perf for
>     beam if
>     >                             you think real pipelines (we try to avoid
>     >                             serializations or it is done on some well 
> known
>     >                             points where this algo should be
>     enough...worse case
>     >                             it is not a huge overhead, mainly just some 
> memory
>     >                             overhead).
>     >
>     >
>     >                             The PR is available
>     >                             at https://github.com/apache/beam/pull/4594
>     <https://github.com/apache/beam/pull/4594>. If you
>     >                             check you will see I put it "WIP". The main 
> reason
>     >                             is that it changes the encoding format for
>     >                             containers (lists, iterable, ...) and 
> therefore
>     >                             breaks python/go/... tests and the
>     >                             standard_coders.yml definition. Some help 
> on that
>     >                             would be very welcomed.
>     >
>     >                             Technical side note if you
>     >                             wonder: UnownedInputStream doesn't even 
> allow to
>     >                             mark the stream so there is no real fast way
>     to read
>     >                             the stream as fast as possible with standard
>     >                             buffering strategies and to support this 
> automatic
>     >                             IterableCoder wrapping which is implicit. 
> In other
>     >                             words, if beam wants to support any coder,
>     including
>     >                             the ones not requiring to write the size of 
> the
>     >                             output - most of the codecs - then we need 
> to
>     change
>     >                             the way it works to something like that 
> which does
>     >                             it for the user which doesn't know its 
> coder got
>     >                             wrapped.
>     >
>     >                             Hope it makes sense, if not, don't hesitate 
> to ask
>     >                             questions.
>     >
>     >                             Happy end of week-end.
>     >
>     >                             Romain Manni-Bucau
>     >                             @rmannibucau 
> <https://twitter.com/rmannibucau
>     <https://twitter.com/rmannibucau>> |
>     >                              Blog <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old Blog
>     >                             <http://rmannibucau.wordpress.com
>     <http://rmannibucau.wordpress.com>> | Github
>     >                             <https://github.com/rmannibucau
>     <https://github.com/rmannibucau>> | LinkedIn
>     >                             <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>> | Book
>     >                           
>      
> <https://www.packtpub.com/application-development/java-ee-8-high-performance 
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>     >
>     >
>     >
>     >
>     >
>     >
>     >
> 
>     --
>     Jean-Baptiste Onofré
>     [email protected] <mailto:[email protected]>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: coder evolutions?

Reply via email to