[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352355#comment-15352355
 ] 

Ryan Blue commented on AVRO-1704:
---------------------------------

[~nielsbasjes], sorry it's taken so long for me to get back to you on this.

On the spec:
* I think we should go with the header: 0xC3 0x01. The first byte makes it 
easily recognizable as you suggest and meets my requirement of minimizing the 
number of non-Avro payloads that match. Using 0x01 makes it easy to see the 
version and will prevent programs confusing payloads with text as Doug suggests.
* I don't see much value in reserving space in the second byte. I don't think 
there will be many formats for serializing Avro payloads and I don't think we 
will have problems with collision.

I've had a look at your patch and there's a lot in there: an update to the 
spec, an implementation, an XOR demo, changes to Schema hashing, specific 
support, and static default classes. I think it would be helpful to get this in 
by breaking up the work into separate patches, pull requests, or issues.

I also think we should simply the API a bit. I'd like to keep it small and grow 
it was we need to keep the maintenance and compatibility simple. For example, 
SchemaStorage has open and close methods that are only used in a test. I'd 
rather not add life-cycle methods like those unless the life-cycle of a 
SchemaStorage needs to be managed by Avro. To that end, I think we can simplify 
the API and I propose the following API:

{code:lang=java}
interface SchemaStore {
  Schema findByFingerprint(long);
}
{code}

I also think that the message API should be focused around a datum and a buffer 
or stream. The data model (GenericData instance) and other things can be passed 
in to create it and then reused for efficiency. I've actually implemented this 
already for a project that stores Avro-encoded payloads in Parquet so I've 
[adapted that implementation|https://github.com/apache/avro/pull/103] to look 
up fingerprints from a SchemaStore. The API is broken into encoder and decoder 
sides to deal with separate concerns: for the encoder that's how to manage 
buffers and for the decoder it's how to resolve schemas and datum reuse.

{code:lang=java}
interface DatumEncoder<D> {
  DatumDecoder(GenericData model, Schema, boolean copyBuffer);
  ByteBuffer encode(D datum); // if copyBuffer was true, this is a new buffer
  void encode(D datum, OutputStream);
}

interface DatumDecoder<D> {
  DatumDecoder(GenericData model, Schema, SchemaStore);
  D decode(ByteBuffer);
  D decode(ByteBuffer, D reuseDatum);
  D decode(InputStream);
  D decode(InputStream, D reuseDatum);
}
{code}

My branch is broken into a few commits. The first two are bug fixes, but the 
third is [the DatumEncoder implementation, 
d91b905|https://github.com/apache/avro/pull/103/commits/d91b90544f4486a72da8d3ff5b81dfc3c79d7c2f],
  and the fourth is [support for the Specific data model, 
7fa75aa|https://github.com/apache/avro/pull/103/commits/7fa75aab405c6460077d7cc7e403c664cce84431],
 based on your patch.

I'd like to hear what you think of the DatumEncoder API in that branch. It 
implements a few things that I think we'll need, like datum reuse, and it 
reuses encoders, DatumWriters, and buffers. It implements two encoder/decoder 
pairs, "raw" that is just the datum bytes and "binary" that implements the 
header and schema lookup. Definitely needs some improvements, like more through 
tests and better naming, like Doug's suggestion to use "message".

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to