Re: [DISCUSS] PIP 14: Pulsar Schema Registry

Sahaya Andrews Tue, 06 Feb 2018 09:54:12 -0800

It's not clear from the proposal if schema is enforced against a
namespace or individual topic has it's schema.


Regarding the use of Zookeeper, I also agree with Joe. We should avoid
using zookeeper to store such information since it limits our ability
to scale beyond a point.

We could create a namespace specific ledger to store such data.

Andrews.

On Tue, Feb 6, 2018 at 5:29 AM, Joe F <j...@apache.org> wrote:
> The concept of a schema registry is good.  I have some questions, concerns
> and comments
>
> 1) Access control
> There may be a need for some topics to remain private to uses that don't
> have permissions on that topic.  The existence of such topics, (or its
> schema) should not be disclosed by discoveries.  That is, unauthorized
> requests (probes) should return 404 (does not exist ) and not 401.
> (forbidden). Please ensure this in the implementation.
>
>
> 2)Isn't the Schema message definition (the meta Schema) best left to each
> particular installation? Shouldn't Pulsar define just a Schema  as base
> fields and key value pairs? I mean, I can think of many different fields in
> addition to name, version, format, state and mods. Every time someone needs
> to  add something to the meta Schema, it will require a protocol change.
> The list of optional fields in the current definition is arbitrary. There
> is nothing particular about those fields that require that they be
> enumerated. And the optional nature of all those fields indicates this very
> same issue - that this list of fields is very subjective and will be
> subject to all sort of additions and deletions. Isn't that list better
> implemented as key value properties?  What is the rationale for this
> specific set of fields?
>
> 3)Use of Zookeeper as a repo.
>
> I speak from experience, as I run some of the largest Pulsar clusters in
> existence. I have a significant disagreement with using Zookeeper as the
> meta repo for the Schema. Even if the actual Schema is stored in Bookkeeper
> ledgers, using a ZK node for Schema is an increase in ZK load, and reduces
> the scalability of Pulsar. ZK nodes have a significant impact on Pulsar
> scalability limits. And I mean the impact from the very existence of a ZK
> node, not the read/writes on that node.
>
> I understand this feature is optional. But that does not solve the
> underlying issue. Pulsar should be moving towards reducing ZK usage, not
> increasing it. We should be working to reduce even the existing usage of
> ZK, wherever it is possible.
>
> We should not be  building a feature which requires a tradeoff between
> using  that feature and scalability.   I would like to use this feature,
> but as it is, it is going to reduce the working limit of my clusters
> 15-20%.  That is definitely not a good thing
>
> Joe
>
>
> On Mon, Feb 5, 2018 at 11:04 AM, Sijie Guo <guosi...@gmail.com> wrote:
>
>> +1 great to see this proposal coming out.
>>
>> - Sijie
>>
>> On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote:
>>
>> > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9
>> >
>> > -------
>> >
>> >  * **Status**: Proposal
>> >  * **Author**: Dave Rusek - Streamlio
>> >  * **Pull Request**: See Below
>> >  * **Mailing List discussion**:
>> >
>> >
>> > ## Motivation
>> >
>> > Data flowing through a messaging system is typically untyped. Data flows
>> > from
>> > end-to-end as bytes and only the producers and consumers are aware of the
>> > type
>> > and structure of the data. This requires systems to coordinate
>> out-of-band
>> > and
>> > makes it hard for other systems to discover useful data on which they can
>> > operate. Schema registries help to alleviate these problems by providing
>> a
>> > centralized storage area for structural definitions of system data. By
>> > having a
>> > centralized storage repository systems producing data to the system can
>> > communicate to downstream systems the structure of the data being
>> produced.
>> >
>> > This document is a proposal to build a schema registry service tightly
>> > integrated with Pulsar's topic hierarchy. This schema integration is an
>> > opt-in
>> > feature and will not affect existing or future properties, clusters,
>> > namespaces,
>> > or topics that do not choose to take advantage. If however, an
>> > administrator
>> > chooses to use this functionality then it will serve as a self-describing
>> > integrity check for data in the system as well as allow integrations
>> > between
>> > Pulsar and other systems that are able to discover and take advantage of
>> > this
>> > type information
>> >
>> > ## Design
>> >
>> > ### Data Model
>> >
>> > ```protobuf
>> > message Schema {
>> >     enum Format {
>> >         AVRO = 0;
>> >         JSON = 1;
>> >         PROTOBUF = 2;
>> >         THRIFT = 3;
>> >     }
>> >
>> >     enum State {
>> >         STAGED = 1;
>> >         ACTIVE = 2;
>> >     }
>> >
>> >     optional string name = 1;
>> >     optional int32 version = 2;
>> >     optional Format format = 3;
>> >     optional State state = 4;
>> >     optional string modified_user = 5;
>> >     optional string modified_time = 6;
>> > }
>> > ```
>> >
>> > ### Storing Schema Data
>> >
>> > Schema data will be stored alongside message data in BookKeeper. Much
>> like
>> > a
>> > managed ledger schema entries will be stored as an append only, ordered,
>> > list of
>> > entries. Schema entries occupy a BookKeeper Ledger and a topic with an
>> > associated schema will require a zookeeper node. Topics without any
>> > associated
>> > schema data will incur no overhead.
>> >
>> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1)
>> >
>> > ### Serving Schema Data
>> >
>> > Serving schemas from the pulsar brokers would allow us to take advantage
>> of
>> > the
>> > topic ownership routing logic to co-locate a schema with it’s topic as
>> well
>> > as
>> > ensure a single owner per schema ledger in the case of the streamlio
>> schema
>> > registry. Such an arrangement would serve both read and writes through
>> the
>> > same
>> > broker. This will require a new admin API to expose the schema data model
>> > as a
>> > collection of REST resources.
>> >
>> > ```java
>> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
>> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{version}")
>> > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
>> > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
>> > ```
>> >
>> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2)
>> >
>> > # Changes
>> >
>> > * Implement a Schema Repository in Pulsar brokers [Staged PR](
>> > https://github.com/mgodave/incubator-pulsar/pull/1)
>> > * Add Schema resouces to broker admin API [Staged PR](
>> > https://github.com/mgodave/incubator-pulsar/pull/2)
>> > * Extend client/server binary protocol to expose schema to client [PR](
>> > https://github.com/apache/incubator-pulsar/pull/1112)
>> >
>>

Re: [DISCUSS] PIP 14: Pulsar Schema Registry

Reply via email to