The concept of a schema registry is good. I have some questions, concerns and comments
1) Access control There may be a need for some topics to remain private to uses that don't have permissions on that topic. The existence of such topics, (or its schema) should not be disclosed by discoveries. That is, unauthorized requests (probes) should return 404 (does not exist ) and not 401. (forbidden). Please ensure this in the implementation. 2)Isn't the Schema message definition (the meta Schema) best left to each particular installation? Shouldn't Pulsar define just a Schema as base fields and key value pairs? I mean, I can think of many different fields in addition to name, version, format, state and mods. Every time someone needs to add something to the meta Schema, it will require a protocol change. The list of optional fields in the current definition is arbitrary. There is nothing particular about those fields that require that they be enumerated. And the optional nature of all those fields indicates this very same issue - that this list of fields is very subjective and will be subject to all sort of additions and deletions. Isn't that list better implemented as key value properties? What is the rationale for this specific set of fields? 3)Use of Zookeeper as a repo. I speak from experience, as I run some of the largest Pulsar clusters in existence. I have a significant disagreement with using Zookeeper as the meta repo for the Schema. Even if the actual Schema is stored in Bookkeeper ledgers, using a ZK node for Schema is an increase in ZK load, and reduces the scalability of Pulsar. ZK nodes have a significant impact on Pulsar scalability limits. And I mean the impact from the very existence of a ZK node, not the read/writes on that node. I understand this feature is optional. But that does not solve the underlying issue. Pulsar should be moving towards reducing ZK usage, not increasing it. We should be working to reduce even the existing usage of ZK, wherever it is possible. We should not be building a feature which requires a tradeoff between using that feature and scalability. I would like to use this feature, but as it is, it is going to reduce the working limit of my clusters 15-20%. That is definitely not a good thing Joe On Mon, Feb 5, 2018 at 11:04 AM, Sijie Guo <guosi...@gmail.com> wrote: > +1 great to see this proposal coming out. > > - Sijie > > On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote: > > > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9 > > > > ------- > > > > * **Status**: Proposal > > * **Author**: Dave Rusek - Streamlio > > * **Pull Request**: See Below > > * **Mailing List discussion**: > > > > > > ## Motivation > > > > Data flowing through a messaging system is typically untyped. Data flows > > from > > end-to-end as bytes and only the producers and consumers are aware of the > > type > > and structure of the data. This requires systems to coordinate > out-of-band > > and > > makes it hard for other systems to discover useful data on which they can > > operate. Schema registries help to alleviate these problems by providing > a > > centralized storage area for structural definitions of system data. By > > having a > > centralized storage repository systems producing data to the system can > > communicate to downstream systems the structure of the data being > produced. > > > > This document is a proposal to build a schema registry service tightly > > integrated with Pulsar's topic hierarchy. This schema integration is an > > opt-in > > feature and will not affect existing or future properties, clusters, > > namespaces, > > or topics that do not choose to take advantage. If however, an > > administrator > > chooses to use this functionality then it will serve as a self-describing > > integrity check for data in the system as well as allow integrations > > between > > Pulsar and other systems that are able to discover and take advantage of > > this > > type information > > > > ## Design > > > > ### Data Model > > > > ```protobuf > > message Schema { > > enum Format { > > AVRO = 0; > > JSON = 1; > > PROTOBUF = 2; > > THRIFT = 3; > > } > > > > enum State { > > STAGED = 1; > > ACTIVE = 2; > > } > > > > optional string name = 1; > > optional int32 version = 2; > > optional Format format = 3; > > optional State state = 4; > > optional string modified_user = 5; > > optional string modified_time = 6; > > } > > ``` > > > > ### Storing Schema Data > > > > Schema data will be stored alongside message data in BookKeeper. Much > like > > a > > managed ledger schema entries will be stored as an append only, ordered, > > list of > > entries. Schema entries occupy a BookKeeper Ledger and a topic with an > > associated schema will require a zookeeper node. Topics without any > > associated > > schema data will incur no overhead. > > > > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1) > > > > ### Serving Schema Data > > > > Serving schemas from the pulsar brokers would allow us to take advantage > of > > the > > topic ownership routing logic to co-locate a schema with it’s topic as > well > > as > > ensure a single owner per schema ledger in the case of the streamlio > schema > > registry. Such an arrangement would serve both read and writes through > the > > same > > broker. This will require a new admin API to expose the schema data model > > as a > > collection of REST resources. > > > > ```java > > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{version}") > > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > > ``` > > > > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2) > > > > # Changes > > > > * Implement a Schema Repository in Pulsar brokers [Staged PR]( > > https://github.com/mgodave/incubator-pulsar/pull/1) > > * Add Schema resouces to broker admin API [Staged PR]( > > https://github.com/mgodave/incubator-pulsar/pull/2) > > * Extend client/server binary protocol to expose schema to client [PR]( > > https://github.com/apache/incubator-pulsar/pull/1112) > > >