+1 great to see this proposal coming out.

- Sijie

On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote:

> https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9
>
> -------
>
>  * **Status**: Proposal
>  * **Author**: Dave Rusek - Streamlio
>  * **Pull Request**: See Below
>  * **Mailing List discussion**:
>
>
> ## Motivation
>
> Data flowing through a messaging system is typically untyped. Data flows
> from
> end-to-end as bytes and only the producers and consumers are aware of the
> type
> and structure of the data. This requires systems to coordinate out-of-band
> and
> makes it hard for other systems to discover useful data on which they can
> operate. Schema registries help to alleviate these problems by providing a
> centralized storage area for structural definitions of system data. By
> having a
> centralized storage repository systems producing data to the system can
> communicate to downstream systems the structure of the data being produced.
>
> This document is a proposal to build a schema registry service tightly
> integrated with Pulsar's topic hierarchy. This schema integration is an
> opt-in
> feature and will not affect existing or future properties, clusters,
> namespaces,
> or topics that do not choose to take advantage. If however, an
> administrator
> chooses to use this functionality then it will serve as a self-describing
> integrity check for data in the system as well as allow integrations
> between
> Pulsar and other systems that are able to discover and take advantage of
> this
> type information
>
> ## Design
>
> ### Data Model
>
> ```protobuf
> message Schema {
>     enum Format {
>         AVRO = 0;
>         JSON = 1;
>         PROTOBUF = 2;
>         THRIFT = 3;
>     }
>
>     enum State {
>         STAGED = 1;
>         ACTIVE = 2;
>     }
>
>     optional string name = 1;
>     optional int32 version = 2;
>     optional Format format = 3;
>     optional State state = 4;
>     optional string modified_user = 5;
>     optional string modified_time = 6;
> }
> ```
>
> ### Storing Schema Data
>
> Schema data will be stored alongside message data in BookKeeper. Much like
> a
> managed ledger schema entries will be stored as an append only, ordered,
> list of
> entries. Schema entries occupy a BookKeeper Ledger and a topic with an
> associated schema will require a zookeeper node. Topics without any
> associated
> schema data will incur no overhead.
>
> [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1)
>
> ### Serving Schema Data
>
> Serving schemas from the pulsar brokers would allow us to take advantage of
> the
> topic ownership routing logic to co-locate a schema with it’s topic as well
> as
> ensure a single owner per schema ledger in the case of the streamlio schema
> registry. Such an arrangement would serve both read and writes through the
> same
> broker. This will require a new admin API to expose the schema data model
> as a
> collection of REST resources.
>
> ```java
> @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{version}")
> @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> ```
>
> [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2)
>
> # Changes
>
> * Implement a Schema Repository in Pulsar brokers [Staged PR](
> https://github.com/mgodave/incubator-pulsar/pull/1)
> * Add Schema resouces to broker admin API [Staged PR](
> https://github.com/mgodave/incubator-pulsar/pull/2)
> * Extend client/server binary protocol to expose schema to client [PR](
> https://github.com/apache/incubator-pulsar/pull/1112)
>

Reply via email to