+1 great to see this proposal coming out. - Sijie
On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote: > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9 > > ------- > > * **Status**: Proposal > * **Author**: Dave Rusek - Streamlio > * **Pull Request**: See Below > * **Mailing List discussion**: > > > ## Motivation > > Data flowing through a messaging system is typically untyped. Data flows > from > end-to-end as bytes and only the producers and consumers are aware of the > type > and structure of the data. This requires systems to coordinate out-of-band > and > makes it hard for other systems to discover useful data on which they can > operate. Schema registries help to alleviate these problems by providing a > centralized storage area for structural definitions of system data. By > having a > centralized storage repository systems producing data to the system can > communicate to downstream systems the structure of the data being produced. > > This document is a proposal to build a schema registry service tightly > integrated with Pulsar's topic hierarchy. This schema integration is an > opt-in > feature and will not affect existing or future properties, clusters, > namespaces, > or topics that do not choose to take advantage. If however, an > administrator > chooses to use this functionality then it will serve as a self-describing > integrity check for data in the system as well as allow integrations > between > Pulsar and other systems that are able to discover and take advantage of > this > type information > > ## Design > > ### Data Model > > ```protobuf > message Schema { > enum Format { > AVRO = 0; > JSON = 1; > PROTOBUF = 2; > THRIFT = 3; > } > > enum State { > STAGED = 1; > ACTIVE = 2; > } > > optional string name = 1; > optional int32 version = 2; > optional Format format = 3; > optional State state = 4; > optional string modified_user = 5; > optional string modified_time = 6; > } > ``` > > ### Storing Schema Data > > Schema data will be stored alongside message data in BookKeeper. Much like > a > managed ledger schema entries will be stored as an append only, ordered, > list of > entries. Schema entries occupy a BookKeeper Ledger and a topic with an > associated schema will require a zookeeper node. Topics without any > associated > schema data will incur no overhead. > > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1) > > ### Serving Schema Data > > Serving schemas from the pulsar brokers would allow us to take advantage of > the > topic ownership routing logic to co-locate a schema with it’s topic as well > as > ensure a single owner per schema ledger in the case of the streamlio schema > registry. Such an arrangement would serve both read and writes through the > same > broker. This will require a new admin API to expose the schema data model > as a > collection of REST resources. > > ```java > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{version}") > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > ``` > > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2) > > # Changes > > * Implement a Schema Repository in Pulsar brokers [Staged PR]( > https://github.com/mgodave/incubator-pulsar/pull/1) > * Add Schema resouces to broker admin API [Staged PR]( > https://github.com/mgodave/incubator-pulsar/pull/2) > * Extend client/server binary protocol to expose schema to client [PR]( > https://github.com/apache/incubator-pulsar/pull/1112) >