[ 
https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Carey updated AVRO-27:
----------------------------

    Attachment: COWSCodec2.java
                COWSCodec.java
                COBSCodec2.java

Test COBS / COWS / COLS codecs.  First batch of files.  These three files are 
described as follows:

COBSCodec2.java  -- minor modification of the previous version for an improved 
testing loop.  Also modified to test in batch with the other new additions.

COWSCodec.java  -- first, hack-ish version of a COBS-like encoding that works 
in 4 byte chunks.  This version uses ByteBuffer.asIntBuffer(), and does all 
copies with the default nio 'copy from position() to limit()' behavior.  This 
turns out to be slow.  asIntBuffer does not have optimal copy operation as can 
be seen in the slow decode.

COWSCodec2.java -- re-implimented using ByteBuffer.getInt() and putInt().  
Significantly faster.

Three more files after this and a set of benchmarks on Linux with recent JRE's.

The point of all this is experimentation and optimization.  Although this 
specific JIRA may not become relevant -- the results of this investigation may 
be useful in other contexts as well.

> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object 
> Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>         Attachments: COBSCodec.java, COBSCodec2.java, COWSCodec.java, 
> COWSCodec2.java
>
>
> Object Container Files could use a 1 byte sync marker (set to zero) using 
> zig-zag and COBS encoding within blocks to efficiently escape zeros from the 
> record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value 
> with a single zero byte.  This property means that we can write any non-zero 
> zig-zag long inside a block within concern for creating an unintentional sync 
> byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block 
> payload.  You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for 
> the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a 
> type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field 
> expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field 
> expresses the number of bytes until the next record (including the cobs code 
> and sync byte).  Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros 
> from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes 
> | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync 
> byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing 
> sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing 
> sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync 
> byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync 
> byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync 
> byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the 
> block.  The overhead for COBS encoding is very small in terms of the 
> in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a 
> writer.  The reader could (for example) find a _metadata_ block by doing the 
> following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next 
> block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to