[ 
https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Massie updated AVRO-27:
----------------------------

    Description: 
Object Container Files could use a 1 byte sync marker (set to zero) using 
zig-zag and COBS encoding within blocks to efficiently escape zeros from the 
record data.

h4. Zig-Zag encoding

With zig-zag encoding only the value of 0 (zero) gets encoded into a value with 
a single zero byte.  This property means that we can write any non-zero zig-zag 
long inside a block within concern for creating an unintentional sync byte. 

h4. COBS encoding

We'll use COBS encoding to ensure that all zeros are escaped inside the block 
payload.  You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for the 
details about COBS encoding.

h1. Block Format

All blocks start and end with a sync byte (set to zero) with a 
type-length-value format internally as follows:

|| name || format || length in bytes || value || meaning ||
| sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
for the start of a block |
| type | zig-zag long | variable | must be non-zero | The type field expresses 
whether the block is for _metadata_ or _normal_ data. |
| length | zig-zag long | variable | must be non-zero | The length field 
expresses the number of bytes until the next record (including the cobs code 
and sync byte).  Useful for skipping ahead to the next block. |
| cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros 
from the block payload |
| payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes | 
The payload of the block |
| sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
for the end of the block |

h2. COBS code table 

|| Code || Followed by || Meaning | 
| 0x00 | (not applicable) | (not allowed ) |
| 0x01 | nothing | Empty payload followed by the closing sync byte |
| 0x02 | one data byte | The single data byte, followed by the closing sync 
byte | 
| 0x03 | two data bytes | The pair of data bytes, followed by the closing sync 
byte |
| 0x04 | three data bytes | The three data bytes, followed by the closing sync 
byte |
| n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync 
byte |
| 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync byte 
|
| 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync byte 
|
| 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
(taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)

h1. Encoding

Only the block writer needs to perform byte-by-byte processing to encode the 
block.  The overhead for COBS encoding is very small in terms of the in-memory 
state required.

h1. Decoding

Block readers are not required to do as much byte-by-byte processing as a 
writer.  The reader could (for example) find a _metadata_ block by doing the 
following:

# Search for a zero byte in the file which marks the start of a record
# Read and zig-zag decode the _type_ of the block
#* If the block is _normal_ data, read the _length_, seek ahead to the next 
block and goto step #2 again
#* If the block is a _metadata_ block, cobs decode the data

  was:We could use a zero byte delimited record boundaries for Avro using UTF8, 
zig-zag longs and COBS encoding to efficiently escape zeros from the record data

        Summary: Consistent Overhead Byte Stuffing (COBS) encoded block format 
for Object Container Files  (was: Zero byte delimited container)

> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object 
> Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>
> Object Container Files could use a 1 byte sync marker (set to zero) using 
> zig-zag and COBS encoding within blocks to efficiently escape zeros from the 
> record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value 
> with a single zero byte.  This property means that we can write any non-zero 
> zig-zag long inside a block within concern for creating an unintentional sync 
> byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block 
> payload.  You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for 
> the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a 
> type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field 
> expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field 
> expresses the number of bytes until the next record (including the cobs code 
> and sync byte).  Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros 
> from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes 
> | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync 
> byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing 
> sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing 
> sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync 
> byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync 
> byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync 
> byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the 
> block.  The overhead for COBS encoding is very small in terms of the 
> in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a 
> writer.  The reader could (for example) find a _metadata_ block by doing the 
> following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next 
> block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to