Re: Coder Evolution

Maximilian Michels Fri, 10 May 2019 06:38:30 -0700

Thanks for the references Luke! I thought that there may have been priordiscussions, so this thread could be a good place to consolidate.

Dataflow also has an update feature, but it's limited by the fact that Beam 
does not have a good concept of Coder evolution. As a result we try very hard 
never to change import Coders,

Trying not to break Coders is a fair approach and could work fine forBeam itself, if the Coders were designed really carefully. But whatabout custom Coders users may have written? AvroCoder or ProtoCoderwould be good candidates for forwards-compatibility, but even these donot have migration functionality built in.

Is schema evolution already part of SchemaCoder? It's definitely a goodcandidate for evolution because a schema provides the insight-view for aCoder, but as for how to actually perform the evolution, it looks likethis is still an open question.


-Max

On 09.05.19 18:56, Reuven Lax wrote:

Dataflow also has an update feature, but it's limited by the fact thatBeam does not have a good concept of Coder evolution. As a result we tryvery hard never to change import Coders, which sometime makesdevelopment of parts of Beam much more difficult. I think Beam wouldbenefit greatly by having a first-class concept of Coder evolution.

BTW for schemas there is a natural way of defining evolution that can behandled by SchemaCoder.

On Wed, May 8, 2019 at 12:58 PM Lukasz Cwik <[email protected]<mailto:[email protected]>> wrote:


    There was a thread about coder update in the past here[1]. Also,
    Reuven sent out a doc[2] about pipeline drain and update which was
    discussed in this thread[3]. I believe there have been more
    references to pipeline update in other threads when people tried to
    change coder encodings in the past as well.

    Reuven/Dan are the best contacts about this on how this works inside
    of Google, the limitations and other ideas that had been proposed.

    1:
    
https://lists.apache.org/thread.html/f3b2daa740075cc39dc04f08d1eaacfcc2d550cecc04e27be024cf52@%3Cdev.beam.apache.org%3E
    2:
    
https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8/edit#
    3:
    
https://lists.apache.org/thread.html/37c9aa4aa5011801a7f060bf3f53e687e539fa6154cc9f1c544d4f7a@%3Cdev.beam.apache.org%3E

    On Wed, May 8, 2019 at 11:45 AM Maximilian Michels <[email protected]
    <mailto:[email protected]>> wrote:

        Hi,

        I'm looking into updating the Flink Runner to Flink version 1.8.
        Since
        version 1.7 Flink has a new optional interface for Coder evolution*.

        When a Flink pipeline is checkpointed, CoderSnapshots are
        written out
        alongside with the checkpointed data. When the pipeline is
        restored from
        that checkpoint, the CoderSnapshots are restored and used to
        reinstantiate the Coders.

        Furthermore, there is a compatibility and migration check
        between the
        old and the new Coder. This allows to determine whether

           - The serializer did not change or is compatible (ok)
           - The serialization format of the coder changed (ok after
        migration)
           - The coder needs to be reconfigured and we know how to that
        based on
             the old version (ok after reconfiguration)
           - The coder is incompatible (error)

        I was wondering about the Coder evolution story in Beam. The
        current
        state is that checkpointed Beam pipelines are only guaranteed to
        run
        with the same Beam version and pipeline version. A newer version of
        either might break the checkpoint format without any way to
        migrate the
        state.

        Should we start thinking about supporting Coder evolution in Beam?

        Thanks,
        Max


        * Coders are called TypeSerializers in Flink land. The interface is
        TypeSerializerSnapshot.

Re: Coder Evolution

Reply via email to