Matthieu, how would that work with enums? I think it may make sense for enums to add a special value for it.
rb On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch <[email protected]> wrote: > > I think that substituting different data for > > unrecognized branches in a union isn't the way to fix the problem, but I > > have a proposal for a way to fix it that looks more like you'd expect in > > the OO example above: by adding the Vehicle class to your read schema. > > > > Right now, unions are first resolved by full class name as required by > the > > spec. But after that, we have some additional rules to match schemas. > These > > rules are how we reconcile name differences from situations like writing > > with a generic class and reading with a specific class. I'm proposing you > > use a catch-all class (the superclass) with fields that are in all of the > > union's branches, and we update schema resolution to allow it. > > > That sounds good. The only thing I’d add is to make this catch-all > behavior explicit in the schema (similar to how field defaults must be > explicitly added). > To help fix another common writer evolution issue, we could also add a > similar catch-all for `enum`s (optional, to be explicitly specified in the > schema). > -Matthieu > > > > > On Apr 12, 2016, at 2:41 PM, Ryan Blue <[email protected]> > wrote: > > > > Yacine, > > > > Thanks for the extra information. Sorry for my delay in replying, I > wanted > > to time think about your suggestion. > > > > I see what you mean that you can think of a union as the superclass of > its > > options. The reflect object model has test code that does just that, > where > > classes B and C inherit from A and the schema for A is created as a union > > of B and C. But, I don't think that your suggestion aligns with the > > expectations of object oriented design. Maybe that's an easier way to > > present my concern: > > > > Say I have a class, Vehicle, with subclasses Car and Truck. I have > > applications that work with my dataset, the vehicles that my company > owns, > > and we buy a bus. If I add a Bus class, what normally happens is that an > > older application can work with it. A maintenance tracker would call > > getVehicles and can still get the lastMaintenanceDate for my Bus, even > > though it doesn't know about busses. But what you suggest is that it is > > replaced with a default, say null, in cases like this. > > > > I think that the problem is that Avro no equivalent concept of > inheritance. > > There's only one way to represent it for what you need right now, like > > Matthieu suggested. I think that substituting different data for > > unrecognized branches in a union isn't the way to fix the problem, but I > > have a proposal for a way to fix it that looks more like you'd expect in > > the OO example above: by adding the Vehicle class to your read schema. > > > > Right now, unions are first resolved by full class name as required by > the > > spec. But after that, we have some additional rules to match schemas. > These > > rules are how we reconcile name differences from situations like writing > > with a generic class and reading with a specific class. I'm proposing you > > use a catch-all class (the superclass) with fields that are in all of the > > union's branches, and we update schema resolution to allow it. So you'd > > have... > > > > record Vehicle { > > long lastMaintenanceDate; > > } > > > > record Car { > > long lastMaintenanceDate; > > int maxChildSeats; > > } > > > > record Truck { > > long lastMainetanceDate; > > long lastWashedDate; > > } > > > > Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The > > catch-all class could provide the behavior you want, without silently > > giving you incorrect data. > > > > Does that sound like a reasonable solution for your use case? > > > > rb > > > > On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <[email protected]> > > wrote: > > > >> Hi Yacine, > >> > >> I believe Ryan was mentioning that if you start from a schema `[“null”, > >> “Car”]` then rather than add a new bus branch to the union, you could > >> update the car’s schema to include new bus fields. For example (using > IDL > >> notation): > >> > >>> // Initial approach > >>> // Evolved to union { null, Car, Bus } > >>> record Car { > >>> string vin; > >>> } > >>> record Bus { > >>> string vin; > >>> int capacity; > >>> } > >>> > >>> // Alternative approach > >>> // Here evolution wouldn't change the union { null, Vehicle } > >>> // Note also that this is actually a compatible evolution from the > first > >> approach, even under current rules (using aliases for name changes). > >>> record Vehicle { > >>> string vin; // (Would be optional if not all branches had it.) > >>> union { null, int } capacity; // The new field is added here. > >>> } > >> > >> Pushing this further, you could also directly start upfront with just a > >> “Vehicle” record schema with an optional `car` field and add new > optional > >> fields as necessary (for example a `bus` field). This gives you much > >> greater flexibility for evolving readers and writers separately (but you > >> lose the ability to enforce that at most one field is ever present in > the > >> record). > >> > >> I agree that Avro’s current evolution rules are more geared towards > reader > >> evolution: evolving writers while keeping readers constant is difficult > >> (namely because they don't support adding branches to unions or symbols > to > >> enums). However, adding a completely new and separate "compatibility > mode” > >> has a high complexity cost; both for implementors (the separate classes > you > >> mention) and users (who must invoke them specifically). It would be > best to > >> keep evolution rules universal. > >> > >> Maybe we could extend the logic behind field defaults? Enum schemas > could > >> simply contain a default symbol attribute. For unions, it’s a bit > trickier, > >> but we should be able to get around with a default branch attribute > which > >> would allow evolution if and only if the newly added branches can be > read > >> as the default branch. Assuming the attribute names don’t already exist, > >> this would be compatible with the current behavior. > >> > >> I think this use-case is common enough that it’d be worth exploring (for > >> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address this > >> issue; though not completely unfortunately [2]). > >> > >> -Matthieu > >> > >> [1] https://github.com/linkedin/rest.li > >> [2] > >> > https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible > >> > >> > >> > >>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane < > >> [email protected]> wrote: > >>> > >>> Hi Ryan, > >>> > >>> Just a little up^^. Could you please (or anyone else) give me a little > >>> feedback ? > >>> Thanks in advance. > >>> > >>> Regards, > >>> Yacine. > >>> > >>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane < > >> [email protected] > >>>> : > >>> > >>>> Hi Ryan, > >>>> > >>>> Thank you for giving feedback. I will try in the following to provide > >> you > >>>> some more details about the addressed problem. > >>>> > >>>> But before that, just a brief reminder of the context. Avro has been > >>>> chosen in this project (and by many other ones for sure) especially > for > >> a > >>>> very important feature: enabling forward and backward compatibility > >>>> management through schema life-cycle. Our development model involves > >>>> intensive usage of this feature, and many heavy developments are made > in > >>>> parallel streams inside feature teams that share the same schema, > >> provided > >>>> the evolution of the latter complies with the stated compatibility > >> rules. > >>>> This implies that all the entities supported by the Avro schema must > >>>> support the two-way compatibility, including unions. However, in the > >>>> special case of the union, this two-way compatibility is not well > >> supported > >>>> by the current rules. Let me explain you the basement of our point of > >> view, > >>>> it remains quite simple. > >>>> > >>>> The use case is to have, for example, > >>>> - a first union version A: > >>>> { "name": "Vehicle", > >>>> "type": ["null", "Car"] } > >>>> - a new version of it B: > >>>> { "name": "Vehicle", > >>>> "type": ["null", "Car", "Bus"] } > >>>> For being forward compatible, an evolution of the union schema must > >>>> guarantee that an old reader reading with A can read the data written > >> with > >>>> the new schema B. Getting an error just means that the forward > >>>> compatibility feature is broken. But this is not actually the case > (and > >>>> this behavior is not suitable), because the old reader has a correct > >> schema > >>>> and this schema has evolved naturally to version B to incorporate a > new > >>>> Vehicle type. Not knowing this new type must not produce an error, but > >> just > >>>> give the reader a default value, which means: "Either the data is not > >> there > >>>> or you do not know how to handle it". > >>>> > >>>> This is thought while keeping in mind that in an object-oriented code > >>>> modeling, a union field is seen as a class member with the higher > level > >>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore, it is > >>>> natural for a modeler / programmer to implement the ability of not > >> getting > >>>> the awaited types and using some default value of known type. To give > a > >>>> more complete specification, the new mode of compatibility has to > impose > >>>> one rule: the union default value must not change through versions and > >> the > >>>> corresponding type must be placed at the top of the types list. This > is > >>>> much easier to handle by development streams, because it is addressed > >> once > >>>> for all in the very beginning of the schema life-cycle, than the fact > to > >>>> oblige a number of teams, among which some are just not in place > >> anymore, > >>>> to update the whole code just because another dev team has deployed a > >> new > >>>> version of the union in the schema. > >>>> > >>>> Now, for being backward compatible, the reader with B must always be > >> able > >>>> to read data written with schema A. Even if the type included in the > >> data > >>>> is not known, so it gets the default value and not an error. > >>>> > >>>> I understand that getting an error could make sense when the requested > >>>> field is not present. However, this behavior: > >>>> > >>>> - is very restrictive, meaning: this obliges the old reader to update > >>>> its code for integrating the new schema, while he is not managing to > >> do it > >>>> for many reasons: development stream of next delivery is not > >> finished, or > >>>> not engaged, or not even planned - in the case of old and stable code > >>>> - breaks the forward compatibility feature: the older reader is not > >>>> able to read the new version of the union without getting an error > >>>> - breaks the backward compatibility feature: the new reader is not > >>>> able to read an old version containing unknown types of the union > >> without > >>>> getting an error > >>>> > >>>> By the way, what do you exactly mean by "pushing evolution lower" and > >>>> "update the record"? Could you please give me an example of the trick > >> you > >>>> are talking about? > >>>> > >>>> Just to be a bit more precise, we are not targeting to use a "trick". > >> This > >>>> life-cycle management should be included in a standard so to keep the > >>>> software development clean, production safe and compliant with a > complex > >>>> product road-map. > >>>> > >>>> Finally, you seem to be concerned by the "significant change to the > >>>> current evolution rules". Well, we actually do not change these rules, > >> they > >>>> keep just the same. All we are proposing is to introduce a *mode* > where > >>>> the rules of union compatibility change. This mode is materialized by > a > >>>> minimum and thin impact of the existing classes without any change in > >> the > >>>> behavior, all the logic of the new compatibility mode is implemented > by > >> new > >>>> classes that must be invoked specifically. But you would better see it > >> in > >>>> the code patch. > >>>> > >>>> Looking forward to reading your feedback and answers. > >>>> > >>>> Regards, > >>>> Yacine. > >>>> > >>>> > >>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <[email protected]>: > >>>> > >>>>> Hi Yacine, > >>>>> > >>>>> Thanks for the proposal. It sounds interesting, but I want to make > sure > >>>>> there's a clear use case for this because it's a significant change > to > >> the > >>>>> current evolution rules. Right now we guarantee that a reader will > get > >> an > >>>>> error if the data has an unknown union branch rather than getting a > >>>>> default > >>>>> value. I think that makes sense: if the reader is requesting a field, > >> it > >>>>> should get the actual datum for it rather than a default because it > >>>>> doesn't > >>>>> know how to handle it. > >>>>> > >>>>> Could you give us an example use case that requires this new logic? > >>>>> > >>>>> I just want to make sure we can't solve your problem another way. For > >>>>> example, pushing evolution lower in the schema usually does the > trick: > >>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1", > >>>>> "RecordV2"], it is usually better to update the record so that older > >>>>> readers can ignore the new fields. > >>>>> > >>>>> Thanks, > >>>>> > >>>>> rb > >>>>> > >>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> In order to provide a solution to the union schema evolution > problem, > >>>>> as it > >>>>>> was earlier notified in the thread "add a type to a union > >>>>>> < > >>>>>> > >>>>> > >> > http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union > >>>>>>> " > >>>>>> of the user mailing list, we decided, for the needs of the reactive > >>>>>> architecture we have implemented for one of our clients, to > implement > >> an > >>>>>> evolution of the compatibility principle of Avro when using Unions. > >> For > >>>>>> reminder, the asked question was about the way to handle the case > >> where > >>>>> a > >>>>>> reader, using an old version of a schema that includes a union, > reads > >>>>> some > >>>>>> data written with a new version of the schema where a type has been > >>>>> added > >>>>>> to the union. > >>>>>> > >>>>>> As answered by Martin Kleppman in that thread, one way to handle > this > >>>>> kind > >>>>>> of evolution (a new version of the schema adds a new type type in a > >>>>> union) > >>>>>> would be to ensure that all the development streams have integrated > >> the > >>>>> new > >>>>>> schema B before deploying it in the IT schema referential. > >>>>>> However, in big structures involving strongly uncorrelated teams (in > >> the > >>>>>> product life-cycle point of view), this approach appears to be quite > >>>>>> impracticable, causing production stream congestion, blocking > behavior > >>>>>> between teams, and a bunch of other > >>>>>> unwanted-counter-agile-/-reactive-phenomena... > >>>>>> > >>>>>> Therefore, we had to implement a new *compatibility* *mode* for the > >>>>> unions, > >>>>>> while taking care to comply with the following rules: > >>>>>> > >>>>>> 1. Clear rules of compatibility are stated and integrated for this > >>>>>> compatibility mode > >>>>>> 2. The standard Avro behavior must be kept intact > >>>>>> 3. All the evolution implementation must be done without > introducing > >>>>> any > >>>>>> regression (all existing tests of the Avro stack must succeed) > >>>>>> 4. The code impact on Avro stack must be minimized > >>>>>> > >>>>>> Just to give you a very brief overview (as I don't know if this is > >>>>> actually > >>>>>> the place for a full detailed description), the evolution addresses > >> the > >>>>>> typical problem where two development streams use the same schema > but > >> in > >>>>>> different versions, in the case described shortly as follows: > >>>>>> > >>>>>> - The first development stream, called "DevA", uses the version A > of > >>>>> a > >>>>>> schema which integrates a union referencing two types, say "null" > >> and > >>>>>> "string". The default value is set to null. > >>>>>> - The second development team, called "DevB", uses the version B, > >>>>> which > >>>>>> is an evolution of the version A, as it adds a reference to a new > >>>>> type > >>>>>> in > >>>>>> the former union, say "long" (which makes it "null", string" and > >>>>> "long") > >>>>>> - When the schema B is deployed on the schema referential (in our > >>>>> case, > >>>>>> the IO Confluent Schema Registry) subsequently to the version A > >>>>>> - The stream "DevA" must be able to read with schema A, even if > >>>>> the > >>>>>> data has been written using the schema B with the type "long" in > >>>>>> the union. > >>>>>> In the latter case, the read value is the union default value > >>>>>> - The stream "DevB" must be able to read/write with schema B, > >>>>> even if > >>>>>> it writes the data using the type "long" in the union > >>>>>> > >>>>>> The evolution that we implemented for this mode includes some rules > >> that > >>>>>> are based on the principles stated in the Avro documentation. It is > >> even > >>>>>> more powerful than showed in the few lines above, as it enables the > >>>>> readers > >>>>>> to get the default value of the union if the schema used for reading > >>>>> does > >>>>>> not contain the type used by the writer in the union. This achieves > a > >>>>> new > >>>>>> mode of forward / backward compatibility. This evolution is for now > >>>>> working > >>>>>> perfectly, and should be on production in the few coming weeks. We > >> have > >>>>>> also made an evolution of the IO Confluent Schema Registry stack to > >>>>> support > >>>>>> it, again in a transparent manner (we also intend to contribute to > >> this > >>>>>> stack in a second / parallel step). > >>>>>> > >>>>>> In the objective of contributing to the Avro stack with this new > >>>>>> compatibility mode for unions, I have some questions about the > >>>>> procedure: > >>>>>> > >>>>>> 1. How can I achieve the contribution proposal? Should I directly > >>>>>> provide a patch in JIRA and dive into the details right there? > >>>>>> 2. The base version of this evolution is 1.7.7, is it eligible to > >>>>>> contribution evaluation anyway? > >>>>>> > >>>>>> Thanks in advance, looking forward to hearing from you and giving > you > >>>>> more > >>>>>> details. > >>>>>> > >>>>>> Kind Regards, > >>>>>> -- > >>>>>> *Yacine Benabderrahmane* > >>>>>> Architect > >>>>>> *OCTO Technology* > >>>>>> <http://www.octo.com> > >>>>>> ----------------------------------------------- > >>>>>> Tel : +33 6 10 88 25 98 > >>>>>> 50 avenue des Champs Elysées > >>>>>> 75008 PARIS > >>>>>> www.octo.com > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Ryan Blue > >>>>> Software Engineer > >>>>> Netflix > >>>>> > >>>> > >>>> > >> > >> > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > -- Ryan Blue Software Engineer Netflix
