[ https://issues.apache.org/jira/browse/AVRO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607122#comment-13607122 ]
Scott Carey commented on AVRO-1124: ----------------------------------- After some time off working on other priorities, I have returned to work on this. We have been running a version of this internally for ~4 months (not too different than the November patch). It is missing a few key features I am adding now to make subject configuration flexible and allow for configuration of custom or built-in validation rules. I will be updating this ticket with my latest work this week, and plan to push it to completion/contribution this month. > RESTful service for holding schemas > ----------------------------------- > > Key: AVRO-1124 > URL: https://issues.apache.org/jira/browse/AVRO-1124 > Project: Avro > Issue Type: New Feature > Reporter: Jay Kreps > Assignee: Jay Kreps > Attachments: AVRO-1124-draft.patch, AVRO-1124.patch > > > Motivation: It is nice to be able to pass around data in serialized form but > still know the exact schema that was used to serialize it. The overhead of > storing the schema with each record is too high unless the individual records > are very large. There are workarounds for some common cases: in the case of > files a schema can be stored once with a file of many records amortizing the > per-record cost, and in the case of RPC the schema can be negotiated ahead of > time and used for many requests. For other uses, though it is nice to be able > to pass a reference to a given schema using a small id and allow this to be > looked up. Since only a small number of schemas are likely to be active for a > given data source, these can easily be cached, so the number of remote > lookups is very small (one per active schema version). > Basically this would consist of two things: > 1. A simple REST service that stores and retrieves schemas > 2. Some helper java code for fetching and caching schemas for people using > the registry > We have used something like this at LinkedIn for a few years now, and it > would be nice to standardize this facility to be able to build up common > tooling around it. This proposal will be based on what we have, but we can > change it as ideas come up. > The facilities this provides are super simple, basically you can register a > schema which gives back a unique id for it or you can query for a schema. > There is almost no code, and nothing very complex. The contract is that > before emitting/storing a record you must first publish its schema to the > registry or know that it has already been published (by checking your cache > of published schemas). When reading you check your cache and if you don't > find the id/schema pair there you query the registry to look it up. I will > explain some of the nuances in more detail below. > An added benefit of such a repository is that it makes a few other things > possible: > 1. A graphical browser of the various data types that are currently used and > all their previous forms. > 2. Automatic enforcement of compatibility rules. Data is always compatible in > the sense that the reader will always deserialize it (since they are using > the same schema as the writer) but this does not mean it is compatible with > the expectations of the reader. For example if an int field is changed to a > string that will almost certainly break anyone relying on that field. This > definition of compatibility can differ for different use cases and should > likely be pluggable. > Here is a description of one of our uses of this facility at LinkedIn. We use > this to retain a schema with "log" data end-to-end from the producing app to > various real-time consumers as well as a set of resulting AvroFile in Hadoop. > This schema metadata can then be used to auto-create hive tables (or add new > fields to existing tables), or inferring pig fields, all without manual > intervention. One important definition of compatibility that is nice to > enforce is compatibility with historical data for a given "table". Log data > is usually loaded in an append-only manner, so if someone changes an int > field in a particular data set to be a string, tools like pig or hive that > expect static columns will be unusable. Even using plain-vanilla map/reduce > processing data where columns and types change willy nilly is painful. > However the person emitting this kind of data may not know all the details of > compatible schema evolution. We use the schema repository to validate that > any change made to a schema don't violate the compatibility model, and reject > the update if it does. We do this check both at run time, and also as part of > the ant task that generates specific record code (as an early warning). > Some details to consider: > Deployment > This can just be programmed against the servlet API and deploy as a standard > war. You have lots of instances and load balance traffic over them. > Persistence > The storage needs are not very heavy. The clients are expected to cache the > id=>schema mapping, and the server can cache as well. Even after several > years of heavy use we have <50k schemas, each of which is pretty small. I > think this part can be made pluggable and we can provide a jdbc- and > file-based implementation as these don't require outlandish dependencies. > People can easily plug in their favorite key-value store thingy if they like > by implementing the right plugin interface. Actual reads will virtually > always be cached in memory so this is not too important. > Group > In order to get the "latest" schema or handle compatibility enforcement on > changes there has to be some way to group a set of schemas together and > reason about the ordering of changes over these. I am going to call the > grouping the "group". In our usage it is always the table or topic to which > the schema is associated. For most of our usage the group name also happens > to be the Record name as all of our schemas are records and our default is to > have these match. There are use cases, though, where a single schema is used > for multiple topics, each which is modeled independently. The proposal is not > to enforce a particular convention but just to expose the group designator in > the API. It would be possible to make the concept of group optional, but I > can't come up with an example where that would be useful. > Compatibility > There are really different requirements for different use cases on what is > considered an allowable change. Likewise it is useful to be able to extend > this to have other kinds of checks (for example, in retrospect, I really wish > we had required doc fields to be present so we could require documentation of > fields as well as naming conventions). There can be some kind of general > pluggable interface for this like > SchemaChangeValidator.isValidChange(currentLatest, proposedNew) > A reasonable implementation can be provided that does checks based on the > rules in http://avro.apache.org/docs/current/spec.html#Schema+Resolution. Be > default no checks need to be done. Ideally you should be able to have more > than one policy (say one treatment for database schemas, one for logging > event schemas, and one which does no checks at all). I can't imagine a need > for more than a handful of these which would be statically configured > (db_policy=com.mycompany.DBSchemaChangePolicy, > noop=org.apache.avro.NoOpPolicy,...). Each group can configure the policy it > wants to be used going forward with the default being none. > Security and Authentication > There isn't any of this. The assumption is that this service is not publicly > available and those accessing it are honest (though perhaps accident prone). > These are just schemas, after all. > Ids > There are a couple of questions about ids how we make ids to represent the > schemas: > 1. Are they sequential (1,2,3..) or hash based? If hash based, what is > sufficient collision probability? > 2. Are they global or per-group? That is, if I know the id do I also need to > know the group to look up the schema? > 3. What kind of change triggers a new id? E.g. if I update a doc field does > that give a new id? If not then that doc field will not be stored. > For the id generation there are various options: > - A sequential integer > - AVRO-1006 creates a schema-specific 64-bit hash. > - Our current implementation at LinkedIn uses the MD5 of the schema as the id. > Our current implementation at LinkedIn uses the MD5 of the schema text after > removing whitespace. The additional attributes like doc fields (and a few we > made up) are actually important to us and we want them maintained (we add > metadata fields of our own). This does mean we have some updates that > generate a new schema id but don't cause a very meaningful semantic change to > the schema (say because someone tweaked their doc string), but this doesn't > hurt anything and it is nice to have the exact schema text represented. An > example of uses these metadata fields is using the schema doc fields as the > hive column doc fields. > The id is actually just a unique identifier, and the id generation algorithm > can be made pluggable if there is a real trade-off. In retrospect I don't > think using the md5 is good because it is 16 bytes, which for a small message > is bulkier than needed. Since the id is retained with each message, size is a > concern. > The AVRO-1006 fingerprint is super cool, but I have a couple concerns > (possibly just due to misunderstanding): > 1. Seems to produce a 64-bit id. For a large number of schemas, 64 bits makes > collisions unlikely but not unthinkable. Whether or not this matters depends > on whether schemas are versioned per group or globally. If they are per group > it may be okay, since most groups should only have a few hundred schema > versions at most. If they are global I think it will be a problem. > Probabilities for collision are given here under the assumption of perfect > uniformity of the hash (it may be worse, but can't be better) > http://en.wikipedia.org/wiki/Birthday_attack. If we did have a collision we > would be dead in the water, since our data would be unreadable. If this > becomes a standard mechanism for storing schemas people will run into this > problem. > 2. Even 64-bits is a bit bulky. Since this id needs to be stored with every > row size is a concern, though a minor one. > 3. The notion of equivalence seems to throw away many things in the schema > (doc, attributes, etc). This is unfortunate. One nice thing about avro is you > can add your own made-up attributes to the schema since it is just JSON. This > acts as a kind of poor-mans metadata repository. It would be nice to have > these maintained rather than discarded. > It is possible that I am misunderstanding the fingerprint scheme, though, so > please correct me. > My personal preference would be to use a sequential id per group. The main > reason I like this is because the id doubles as the version number, i.e. > my_schema/4 is the 4th version of the my_schema record/group. Persisted data > then only needs to store the varint encoding of the version number, which is > generally going to be 1 byte for a few hundred schema updates. The string > my_schema/4 acts as a global id for this. This does allow per-group sharding > for id generation, but sharding seems unlikely to be needed here. A 50GB > database would store 52 million schemas. 52 million schemas "should be enough > for anyone". :-) > Probably the easiest thing would be to just make the id generation scheme > pluggable. That would kind of satisfy everyone, and, as a side-benefit give > us at linkedin a gradual migration path off our md5-based ids. In this case > ids would basically be opaque url-safe strings from the point of view of the > repository and users could munge this id and encode it as they like. > APIs > Here are the proposed APIs. This tacitly assumes ids are per-group, but the > change if pretty minor if not: > Get a schema by id > GET /schemas/<group>/<id> > If the schema exists the response code will be 200 and the response body will > be the schema text. > If it doesn't exist the response will be 404. > GET /schemas > Produces a list of group names, one per line. > GET /schemas/group > Produces a list of versions for the given group, one per line. > GET /schemas/group/latest > If the group exists the response code will be 200 and the response body will > be the schema text of the last registered schema. > If the group doesn't exist the response code will be 404. > Register a schema > POST /schemas/groups/<group_name> > Parameters: > schema=<text of schema> > compatibility_model=XYZ > force_override=(true|false) > There are a few cases: > If the group exists and the change is incompatible with the current latest, > the server response code will be 403 (forbidden) UNLESS the force_override > flag is set in which case not check will be made. > If the server doesn't have an implementation corresponding to the given > compatibility model key it will give a response code 400 > If the group does not exist it will be created with the given schema (and > compatibility model) > If the group exists and this schema has already been registered the server > returns response code 200 and the id already assigned to that schema > If the group exists, but this schema hasn't been registered, and the > compatibility checks pass, then the response code will be 200 and it will > store the schema and return the id of the schema > The force_override flag allows registering an incompatible schema. We have > found that sometimes you know "for sure" that your change is okay and just > want to damn the torpedoes and charge ahead. This would be intended for > manual rather than programmatic usage. > Intended Usage > Let's assume we are implementing a put and get API as a database would have > using this registry, there is no substantial difference for a messaging style > api. Here are the details of how this works: > Say you have two methods > void put(table, key, record) > Record get(table, key) > Put is expected to do the following under the covers: > 1. Check the record's schema against a local cache of schema=>id to get the > schema id > 3. If it is not found then register it with the schema registry and get back > a schema id and add this pair to the cache > 4. Store the serialized record bytes and schema id > Get is expected to do the following: > 1. Retrieve the serialized record bytes and schema id from the store > 2. Check a local cache to see if this schema is known for this schema id > 3. If not, fetch the schema by id from the schema registry > 4. Deserialize the record using the schema and return it > Code Layout > Where to put this code? Contrib package? Elsewhere? Someone should tell me... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira