[ 
https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707471#action_12707471
 ] 

Matt Massie commented on AVRO-27:
---------------------------------

The suspense was just killing me so I had to get some benchmarks myself.  

Scott, I'll be interested to see if you have similar results over the weekend.

I rewrote the LBL code to use ByteBuffers instead of ArrayByteList from the 
older Apache commons primitives.  The new API looks like...

{code}
public static void decode(ByteBuffer src, int from, int to, ByteBuffer dest) 
throws IOException
public static void encode(ByteBuffer src, int from, int to, ByteBuffer dest)
{code}

I chose ByteBuffers because I didn't want to realloc new byte arrays but 
instead operate on the same byte array for each test.  

My test results are the average of 10 tests run on a 64 MB ByteBuffer running 
on my MacBook Pro

{noformat}
  Model Name:   MacBook Pro
  Model Identifier:     MacBookPro5,1
  Processor Name:       Intel Core 2 Duo
  Processor Speed:      2.4 GHz
  Number Of Processors: 1
  Total Number Of Cores:        2
  L2 Cache:     3 MB
  Memory:       4 GB
  Bus Speed:    1.07 GHz
{noformat}

Since my test wasn't multithreaded... only one core was used.

My tests verified that the byte array wasn't altered by the encoding/decoding 
process (there were no failures).

These number are meant to be ballpark values since my MacBook was "quiet" 
during the tests... I was cranking some Radiohead on iTunes.

One of the factors that can effect the speed of COBS is the number of zeros you 
need to encode/decode.  In the worse case, you are encoding nothing but zeros.  
In that case, you'll essentially be replace all zeros with ones.

*The results from this worse case (nothing but zeros) are as follows...*

Encoding at 38.22 MB/sec
Decoding at 17.85 MB/sec

*If we have one zero every 10 bytes...*

Encoding at 57.26 MB/sec
Decoding at 151.91 MB/sec

*If you have one zero every 100 bytes...*

Encoding at 74.81 MB/sec
Decoding at 846.56 MB/sec

*If you have one zero every 1000 bytes...*

Encoding at 73.70 MB/sec
Decoding at 1128.75 MB/sec

*If you have one zero every 10,000 bytes...*

Encoding at 74.40 MB/sec
Decoding at 1118.88 MB/sec

*If you have no zeros at all...*

Encoding at 73.98 MB/sec
Decoding at 1151.08 MB/sec

So it looks to me like... even with native Java code... we'll be able to push 
~100MB/sec - 200MB/sec... (except for the worse case where we have 64MB of 
zeros).

I'll post my code to this Jira so others can point and laugh.  :)


> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object 
> Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>
> Object Container Files could use a 1 byte sync marker (set to zero) using 
> zig-zag and COBS encoding within blocks to efficiently escape zeros from the 
> record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value 
> with a single zero byte.  This property means that we can write any non-zero 
> zig-zag long inside a block within concern for creating an unintentional sync 
> byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block 
> payload.  You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for 
> the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a 
> type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field 
> expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field 
> expresses the number of bytes until the next record (including the cobs code 
> and sync byte).  Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros 
> from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes 
> | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync 
> byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing 
> sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing 
> sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync 
> byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync 
> byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync 
> byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the 
> block.  The overhead for COBS encoding is very small in terms of the 
> in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a 
> writer.  The reader could (for example) find a _metadata_ block by doing the 
> following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next 
> block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to