[jira] Updated: (AVRO-27) Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files

Scott Carey (JIRA) Mon, 25 May 2009 23:17:21 -0700

     [ 
https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Scott Carey updated AVRO-27:
----------------------------

    Attachment: COLSCodec2.java

So, aligned access is important --  However, the JVM 's JIT can not guarantee 
it on a ByteBuffer or byte[], but can on a LongBuffer or long[].  Here are 
results on my laptop akin to the above, but with a COLSCodec2 that uses a 
LongBuffer rather than a ByteBuffer + getLong()/putLong().

COLSCodec, one zero word every 1 words
Encoding at 939.8201 MB/sec
Decoding at 980.54034 MB/sec
COLSCodec, one zero word every 10 words
Encoding at 822.7025 MB/sec
Decoding at 1188.7073 MB/sec
COLSCodec, one zero word every 1000 words
Encoding at 1104.4512 MB/sec
Decoding at 1429.9589 MB/sec


Unfortunately, for anything reading/writing from the network or a file, byte 
streams and arrays are the only option.  And as demonstrated before, 
asLongBuffer or asIntBuffer is not optimized and fairly restrictive.  This 
seems to indicate that in the future, there is more that the JVM can do, or 
there are Java APIs that could be made so that the JIT can easily detect data 
alignment and be more efficient.

> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object 
> Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>         Attachments: COBSCodec.java, COBSCodec2.java, COBSPerfTest.java, 
> COLSCodec.java, COLSCodec2.java, COWSCodec.java, COWSCodec2.java, 
> COWSCodec3.java
>
>
> Object Container Files could use a 1 byte sync marker (set to zero) using 
> zig-zag and COBS encoding within blocks to efficiently escape zeros from the 
> record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value 
> with a single zero byte.  This property means that we can write any non-zero 
> zig-zag long inside a block within concern for creating an unintentional sync 
> byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block 
> payload.  You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for 
> the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a 
> type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field 
> expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field 
> expresses the number of bytes until the next record (including the cobs code 
> and sync byte).  Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros 
> from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes 
> | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync 
> byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing 
> sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing 
> sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync 
> byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync 
> byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync 
> byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the 
> block.  The overhead for COBS encoding is very small in terms of the 
> in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a 
> writer.  The reader could (for example) find a _metadata_ block by doing the 
> following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next 
> block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-27) Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files

Reply via email to