Re: [DISCUSS] PIP 14: Pulsar Schema Registry

Joe F Tue, 06 Feb 2018 05:30:07 -0800

The concept of a schema registry is good.  I have some questions, concerns
and comments

1) Access control
There may be a need for some topics to remain private to uses that don't
have permissions on that topic.  The existence of such topics, (or its
schema) should not be disclosed by discoveries.  That is, unauthorized
requests (probes) should return 404 (does not exist ) and not 401.
(forbidden). Please ensure this in the implementation.

2)Isn't the Schema message definition (the meta Schema) best left to each
particular installation? Shouldn't Pulsar define just a Schema  as base
fields and key value pairs? I mean, I can think of many different fields in
addition to name, version, format, state and mods. Every time someone needs
to  add something to the meta Schema, it will require a protocol change.
The list of optional fields in the current definition is arbitrary. There
is nothing particular about those fields that require that they be
enumerated. And the optional nature of all those fields indicates this very
same issue - that this list of fields is very subjective and will be
subject to all sort of additions and deletions. Isn't that list better
implemented as key value properties?  What is the rationale for this
specific set of fields?

3)Use of Zookeeper as a repo.

I speak from experience, as I run some of the largest Pulsar clusters in
existence. I have a significant disagreement with using Zookeeper as the
meta repo for the Schema. Even if the actual Schema is stored in Bookkeeper
ledgers, using a ZK node for Schema is an increase in ZK load, and reduces
the scalability of Pulsar. ZK nodes have a significant impact on Pulsar
scalability limits. And I mean the impact from the very existence of a ZK
node, not the read/writes on that node.

I understand this feature is optional. But that does not solve the
underlying issue. Pulsar should be moving towards reducing ZK usage, not
increasing it. We should be working to reduce even the existing usage of
ZK, wherever it is possible.

We should not be  building a feature which requires a tradeoff between
using  that feature and scalability.   I would like to use this feature,
but as it is, it is going to reduce the working limit of my clusters
15-20%.  That is definitely not a good thing

Joe

On Mon, Feb 5, 2018 at 11:04 AM, Sijie Guo <guosi...@gmail.com> wrote:

> +1 great to see this proposal coming out.
>
> - Sijie
>
> On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote:
>
> > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9
> >
> > -------
> >
> >  * **Status**: Proposal
> >  * **Author**: Dave Rusek - Streamlio
> >  * **Pull Request**: See Below
> >  * **Mailing List discussion**:
> >
> >
> > ## Motivation
> >
> > Data flowing through a messaging system is typically untyped. Data flows
> > from
> > end-to-end as bytes and only the producers and consumers are aware of the
> > type
> > and structure of the data. This requires systems to coordinate
> out-of-band
> > and
> > makes it hard for other systems to discover useful data on which they can
> > operate. Schema registries help to alleviate these problems by providing
> a
> > centralized storage area for structural definitions of system data. By
> > having a
> > centralized storage repository systems producing data to the system can
> > communicate to downstream systems the structure of the data being
> produced.
> >
> > This document is a proposal to build a schema registry service tightly
> > integrated with Pulsar's topic hierarchy. This schema integration is an
> > opt-in
> > feature and will not affect existing or future properties, clusters,
> > namespaces,
> > or topics that do not choose to take advantage. If however, an
> > administrator
> > chooses to use this functionality then it will serve as a self-describing
> > integrity check for data in the system as well as allow integrations
> > between
> > Pulsar and other systems that are able to discover and take advantage of
> > this
> > type information
> >
> > ## Design
> >
> > ### Data Model
> >
> > ```protobuf
> > message Schema {
> >     enum Format {
> >         AVRO = 0;
> >         JSON = 1;
> >         PROTOBUF = 2;
> >         THRIFT = 3;
> >     }
> >
> >     enum State {
> >         STAGED = 1;
> >         ACTIVE = 2;
> >     }
> >
> >     optional string name = 1;
> >     optional int32 version = 2;
> >     optional Format format = 3;
> >     optional State state = 4;
> >     optional string modified_user = 5;
> >     optional string modified_time = 6;
> > }
> > ```
> >
> > ### Storing Schema Data
> >
> > Schema data will be stored alongside message data in BookKeeper. Much
> like
> > a
> > managed ledger schema entries will be stored as an append only, ordered,
> > list of
> > entries. Schema entries occupy a BookKeeper Ledger and a topic with an
> > associated schema will require a zookeeper node. Topics without any
> > associated
> > schema data will incur no overhead.
> >
> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1)
> >
> > ### Serving Schema Data
> >
> > Serving schemas from the pulsar brokers would allow us to take advantage
> of
> > the
> > topic ownership routing logic to co-locate a schema with it’s topic as
> well
> > as
> > ensure a single owner per schema ledger in the case of the streamlio
> schema
> > registry. Such an arrangement would serve both read and writes through
> the
> > same
> > broker. This will require a new admin API to expose the schema data model
> > as a
> > collection of REST resources.
> >
> > ```java
> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{version}")
> > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> > ```
> >
> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2)
> >
> > # Changes
> >
> > * Implement a Schema Repository in Pulsar brokers [Staged PR](
> > https://github.com/mgodave/incubator-pulsar/pull/1)
> > * Add Schema resouces to broker admin API [Staged PR](
> > https://github.com/mgodave/incubator-pulsar/pull/2)
> > * Extend client/server binary protocol to expose schema to client [PR](
> > https://github.com/apache/incubator-pulsar/pull/1112)
> >
>

Re: [DISCUSS] PIP 14: Pulsar Schema Registry

Reply via email to