It's not clear from the proposal if schema is enforced against a namespace or individual topic has it's schema.
Regarding the use of Zookeeper, I also agree with Joe. We should avoid using zookeeper to store such information since it limits our ability to scale beyond a point. We could create a namespace specific ledger to store such data. Andrews. On Tue, Feb 6, 2018 at 5:29 AM, Joe F <j...@apache.org> wrote: > The concept of a schema registry is good. I have some questions, concerns > and comments > > 1) Access control > There may be a need for some topics to remain private to uses that don't > have permissions on that topic. The existence of such topics, (or its > schema) should not be disclosed by discoveries. That is, unauthorized > requests (probes) should return 404 (does not exist ) and not 401. > (forbidden). Please ensure this in the implementation. > > > 2)Isn't the Schema message definition (the meta Schema) best left to each > particular installation? Shouldn't Pulsar define just a Schema as base > fields and key value pairs? I mean, I can think of many different fields in > addition to name, version, format, state and mods. Every time someone needs > to add something to the meta Schema, it will require a protocol change. > The list of optional fields in the current definition is arbitrary. There > is nothing particular about those fields that require that they be > enumerated. And the optional nature of all those fields indicates this very > same issue - that this list of fields is very subjective and will be > subject to all sort of additions and deletions. Isn't that list better > implemented as key value properties? What is the rationale for this > specific set of fields? > > 3)Use of Zookeeper as a repo. > > I speak from experience, as I run some of the largest Pulsar clusters in > existence. I have a significant disagreement with using Zookeeper as the > meta repo for the Schema. Even if the actual Schema is stored in Bookkeeper > ledgers, using a ZK node for Schema is an increase in ZK load, and reduces > the scalability of Pulsar. ZK nodes have a significant impact on Pulsar > scalability limits. And I mean the impact from the very existence of a ZK > node, not the read/writes on that node. > > I understand this feature is optional. But that does not solve the > underlying issue. Pulsar should be moving towards reducing ZK usage, not > increasing it. We should be working to reduce even the existing usage of > ZK, wherever it is possible. > > We should not be building a feature which requires a tradeoff between > using that feature and scalability. I would like to use this feature, > but as it is, it is going to reduce the working limit of my clusters > 15-20%. That is definitely not a good thing > > Joe > > > On Mon, Feb 5, 2018 at 11:04 AM, Sijie Guo <guosi...@gmail.com> wrote: > >> +1 great to see this proposal coming out. >> >> - Sijie >> >> On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote: >> >> > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9 >> > >> > ------- >> > >> > * **Status**: Proposal >> > * **Author**: Dave Rusek - Streamlio >> > * **Pull Request**: See Below >> > * **Mailing List discussion**: >> > >> > >> > ## Motivation >> > >> > Data flowing through a messaging system is typically untyped. Data flows >> > from >> > end-to-end as bytes and only the producers and consumers are aware of the >> > type >> > and structure of the data. This requires systems to coordinate >> out-of-band >> > and >> > makes it hard for other systems to discover useful data on which they can >> > operate. Schema registries help to alleviate these problems by providing >> a >> > centralized storage area for structural definitions of system data. By >> > having a >> > centralized storage repository systems producing data to the system can >> > communicate to downstream systems the structure of the data being >> produced. >> > >> > This document is a proposal to build a schema registry service tightly >> > integrated with Pulsar's topic hierarchy. This schema integration is an >> > opt-in >> > feature and will not affect existing or future properties, clusters, >> > namespaces, >> > or topics that do not choose to take advantage. If however, an >> > administrator >> > chooses to use this functionality then it will serve as a self-describing >> > integrity check for data in the system as well as allow integrations >> > between >> > Pulsar and other systems that are able to discover and take advantage of >> > this >> > type information >> > >> > ## Design >> > >> > ### Data Model >> > >> > ```protobuf >> > message Schema { >> > enum Format { >> > AVRO = 0; >> > JSON = 1; >> > PROTOBUF = 2; >> > THRIFT = 3; >> > } >> > >> > enum State { >> > STAGED = 1; >> > ACTIVE = 2; >> > } >> > >> > optional string name = 1; >> > optional int32 version = 2; >> > optional Format format = 3; >> > optional State state = 4; >> > optional string modified_user = 5; >> > optional string modified_time = 6; >> > } >> > ``` >> > >> > ### Storing Schema Data >> > >> > Schema data will be stored alongside message data in BookKeeper. Much >> like >> > a >> > managed ledger schema entries will be stored as an append only, ordered, >> > list of >> > entries. Schema entries occupy a BookKeeper Ledger and a topic with an >> > associated schema will require a zookeeper node. Topics without any >> > associated >> > schema data will incur no overhead. >> > >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1) >> > >> > ### Serving Schema Data >> > >> > Serving schemas from the pulsar brokers would allow us to take advantage >> of >> > the >> > topic ownership routing logic to co-locate a schema with it’s topic as >> well >> > as >> > ensure a single owner per schema ledger in the case of the streamlio >> schema >> > registry. Such an arrangement would serve both read and writes through >> the >> > same >> > broker. This will require a new admin API to expose the schema data model >> > as a >> > collection of REST resources. >> > >> > ```java >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema") >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{version}") >> > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema") >> > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema") >> > ``` >> > >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2) >> > >> > # Changes >> > >> > * Implement a Schema Repository in Pulsar brokers [Staged PR]( >> > https://github.com/mgodave/incubator-pulsar/pull/1) >> > * Add Schema resouces to broker admin API [Staged PR]( >> > https://github.com/mgodave/incubator-pulsar/pull/2) >> > * Extend client/server binary protocol to expose schema to client [PR]( >> > https://github.com/apache/incubator-pulsar/pull/1112) >> > >>