https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9

-------

 * **Status**: Proposal
 * **Author**: Dave Rusek - Streamlio
 * **Pull Request**: See Below
 * **Mailing List discussion**:


## Motivation

Data flowing through a messaging system is typically untyped. Data flows
from
end-to-end as bytes and only the producers and consumers are aware of the
type
and structure of the data. This requires systems to coordinate out-of-band
and
makes it hard for other systems to discover useful data on which they can
operate. Schema registries help to alleviate these problems by providing a
centralized storage area for structural definitions of system data. By
having a
centralized storage repository systems producing data to the system can
communicate to downstream systems the structure of the data being produced.

This document is a proposal to build a schema registry service tightly
integrated with Pulsar's topic hierarchy. This schema integration is an
opt-in
feature and will not affect existing or future properties, clusters,
namespaces,
or topics that do not choose to take advantage. If however, an administrator
chooses to use this functionality then it will serve as a self-describing
integrity check for data in the system as well as allow integrations between
Pulsar and other systems that are able to discover and take advantage of
this
type information

## Design

### Data Model

```protobuf
message Schema {
    enum Format {
        AVRO = 0;
        JSON = 1;
        PROTOBUF = 2;
        THRIFT = 3;
    }

    enum State {
        STAGED = 1;
        ACTIVE = 2;
    }

    optional string name = 1;
    optional int32 version = 2;
    optional Format format = 3;
    optional State state = 4;
    optional string modified_user = 5;
    optional string modified_time = 6;
}
```

### Storing Schema Data

Schema data will be stored alongside message data in BookKeeper. Much like a
managed ledger schema entries will be stored as an append only, ordered,
list of
entries. Schema entries occupy a BookKeeper Ledger and a topic with an
associated schema will require a zookeeper node. Topics without any
associated
schema data will incur no overhead.

[Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1)

### Serving Schema Data

Serving schemas from the pulsar brokers would allow us to take advantage of
the
topic ownership routing logic to co-locate a schema with it’s topic as well
as
ensure a single owner per schema ledger in the case of the streamlio schema
registry. Such an arrangement would serve both read and writes through the
same
broker. This will require a new admin API to expose the schema data model
as a
collection of REST resources.

```java
@GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
@GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{version}")
@DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
@POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
```

[Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2)

# Changes

* Implement a Schema Repository in Pulsar brokers [Staged PR](
https://github.com/mgodave/incubator-pulsar/pull/1)
* Add Schema resouces to broker admin API [Staged PR](
https://github.com/mgodave/incubator-pulsar/pull/2)
* Extend client/server binary protocol to expose schema to client [PR](
https://github.com/apache/incubator-pulsar/pull/1112)

Reply via email to