[ 
https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706754#action_12706754
 ] 

Scott Carey commented on AVRO-27:
---------------------------------

An outsider here -- I've got an idea on how to avoid the performance pitfalls 
of COBS' byte-by-byte nature and as I thought through it, I spotted many other 
opportunities for enhancement since larger chunks afford a lot more bits in the 
Code that can be used for things other than the length of the following literal 
chunk.

h2. Proposal -- COLS, a modification of COBS 
h3. (for greater performance and extensibility for large data streams) 

Java is particularly bad at byte-by-byte operations.  The COBS paper clearly 
indicates its design intention was stuffing data through embedded systems such 
as telephone lines and other networks where byte-by-byte processing of the 
whole payload is already mandatory.

Doing so here would be a performance bottleneck in Java.  Some simple tests can 
be constructed to prove or disprove this claim.

I propose that rather than use COBS, one uses COLS or COWS ... that is Constant 
Overhead Long Stuffing or Constant Overhead Word Stuffing instead.

This would be inefficient if we expect most payloads to be small (less than 256 
bytes), but I suspect most hadoop related payloads to be large, and often very 
large.

I favor stuffing Longs rather than Ints, since most systems will soon be 
running 64 bit JVMs in the future.  Sun's next JRE release has Object Pointer 
Compression, which makes the memory overhead of a 64 bit JVM very small 
compared to a 32 bit JVM, and performance is generally faster than the 32 bit 
JVM due to native 64 bit operations and more registers (for x86-64 at least).
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

I will describe the proposal below assuming a translation of COBS to COLS, from 
1 byte at a time to 8 byte at a time encoding.  However, it is clear that a 4 
byte variant is very similar and may be preferable.

h3. Proposed Changes -- Simple Block format with COLS

|| name || format || length in bytes || value || meaning ||
| sync | byte | *8* | *0L* | The sync *long* serves as a clear marker for the 
start of a block |
| type | 1 byte | 1 | non-zero | The type field expresses whether the block is 
for _metadata_ or _normal_ data. *note - if this is only ever going to be a 
binary flag, it can be packed into the length or sequence number as a sign 
value.*   *However, it is decoding performance critical to keep the non-COLS 
header 8 byte aligned*|
| block sequence number | 3 byte unsigned int | 3 bytes | 0 - 2^24 | the block 
sequence number -- a client can use this to resume a stream from the last 
successful block. This may not be needed if the metadata blocks take care of 
this. | 
| length | fixed 4 byte signed int | variable | *>= 0L* | The length field 
expresses the number of bytes of COLS_payload data in bytes.  Useful for 
skipping ahead to the next block.  |
| COLS_payload | *COLS* | *length as above* | see COLS description below | The 
data in this block, encoded. |

The above would put cap the stream length to 2GB * 16M = 32PB.  There is room 
to increase this significantly by taking bits from the type and giving those to 
the block count.  2GB blocks are rather unlikely for now however -- as is 
multi-PB streams. 

h4. Discussion
* The entire stream would need to be 8 byte aligned in order to process it 
cleanly with something like java.nio.LongBuffer.  This would include metadata 
blocks.
* The sequence is assumed to be in network-order.   Endianness can be handled 
and is not discussed in detail here.
* The type can likely be encoded in a single bit in the block sequence number 
or length field.  If more than two types of blocks are expected, more bits can 
be reserved for future use.
* The length can be stored as the number of longs rather than bytes (bytes / 8) 
since the COLS payload is a multiple of 8 bytes.
* The COLS payload here differs from the original proposal.  It will have an 
entire COBS-like stream, with possibly many COLS code markers (at least one per 
0L value in the block data).
* One may want to have both the encoded length above, and the decoded length 
(or a checksum) as extra data validation.  Perhaps even 4 types:  METADATA, 
METADATA_CSUM, NORMAL, NORMAL_CSUM -- where  the ordinary variants store the 
length (fast, but less reliable) and the _CSUM variants store a checksum 
(slower, but highly reliable).

h3. Basic COBS to COLS description
COBS describes a byte-by-byte encoding where a zero byte cannot exist, and a 
set of codes are used to encode runs of data that does not contain a zero byte. 
 All codes but but one have an implicit trailing zero.  The last block is 
assumed to have no implicit zero regardless of the code.

COLS is a simple extension of this scheme to 64 bit chunks.  In its base form, 
it does nothing more than work with larger chunks:

|| COLS Code (Long, 8 bytes) || Followed by || Meaning || 
| 0L | N/A | (not allowed)|
| 1L | nothing | A single zero Long |
| 2L | one long (8 bytes) | The single data long, followed by a trailing zero 
long * |
| 3L | two longs (16 bytes) | The pair of data longs, followed by a trailing 
zero long * |
| nL | (n-1) longs | The (n-1) longs, followed by a trailing zero long * |
| MAX \*\* | MAX - 1 longs | MAX -1 longs, with no trailing zero |

\* The last code in the sequence (which can be identified by the length header 
or a 0L indicating the start of the next block) does NOT have an implicit 
trailing zero. 
\*\* MAX needs to be chosen, and can't realistically be very large since 
encoding requires an arraycopy of size (MAX -1) * 8

The COLS_payload has multiple COLS Code entries (and literals), up to the 
length specified in the header (where a 0L should then occur).

However -- there are drawbacks to using such a large chunk without other 
modifications from COBS:
# 64 bits is far too large for a length field.  For encoding, a COBS code block 
must fit in RAM, and for performance, should probably fit in half an L2 cache.  
However, for decoding COLS code length is irrelevant.
# If the size of the data encoded is not a multiple of 8 bytes, we need a 
mechanism to encode that up to 7 trailing bytes should be truncated (3 bits).
# For most blocks, the overhead will be exactly 8 bytes (unless the block has a 
trailing 0L).
# Very long data streams without a zero Long are unlikely, so very large chunk 
lengths are not very useful. 

There are also benefits however. The above suggests that most of the 8 byte 
COLS code block space is not needed to encode length.  Much can be done with 
this!
Some thoughts:
* The 3 bits needed to define the truncation behavior can be stored in the COLS 
code.
* The overhead can be reduced, by encoding short trailing sequences into the 
upper bits rather than purely truncating -- e.g. you can append 2 bytes instead 
of truncating 6.
* Rudimentary run-length encoding or other light weight compression can be done 
with the extra bits (completely encoder-optional).
* We can remove the requirement that most codes have an implicit trailing zero, 
and encode that in one of the extra bits.

If only the lower 2 bytes of an 8 byte COLS code represent the size, (MAX = 
2^16 - 1), then the max literal size is 512KB - 8B.  If we remove the implicit 
trailing zero, an encoder can optionally encode smaller literal sequences 
(perhaps for performance, or compression).
What can be done with the remaining 48 bits?
Some ideas:
# The highest 4 bytes can represent data to append to the literal.  In this 
way, half of the size overhead of the encoding is removed.  This should 
generally only apply to the last COLS code in the block (for performance 
reasons and maintaining 8 byte alignment on all arraycopy operations, but its 
encoder optional).
# the next bit represents whether the COLS block has an implicit 0L appended.
# a bit can be used to signify endianness (this might be a better fit for the 
Block header or stream metadata -- detecting zero's works without known 
endianness)
# The next three bits can represent how much data is truncated or appended to 
the literal, (before the optional implicit 0L): 

|| value || meaning ||
| 000 | do not truncate or append |
| 100 | append all 4 leading bytes in the COLS code after the literal |
| 111 | append the first 3 leading bytes in the COLS code after the literal |
| 110 | append the first 2 leading bytes in the COLS code after the literal |
| 101 | append the leading byte in the COLS code after the literal |
| 011 | truncate the last 3 bytes of the literal |
| 010 | truncate the last 2 bytes of the literal |
| 001 | truncate the last byte of the literal |

This leaves us with 12 bits.  I propose that these be used for rudimentary 
(optional) compression:

* Option A:
** Run length only -- the 12 bits represent the number of times to repeat the 
literal.  Or 4 bits are the number of COLS chunks backwards (including this 
one) to repeat, and 8 bits is the number of repeats.  Or ... some other form of 
emitting copies of entire COLS chunks.
* Option B:
** Some form of LZ-like compression that copies in 8 byte chunks -- 4 bits 
represent the number of Longs to copy (so, max match size is 15 * 8 bytes), and 
8 bits represents the number of Longs backwards (from the end of this COLS 
chunk) to begin that copy (up to 2KB).  Because of the truncation/append 
feature, this is not constrained to 8-byte aligned copies on the output, but 
the encoded format is entirely 8 byte aligned and all copies are multiples of 8 
bytes.  I would not be surprised if this was as fast as LZO or faster, since it 
is very similar but operates in a more chunky fashion.  Compression levels 
would not be that great, but like most similar algorithms to this the encoder 
can do more work to search for matches.  Decoding uncompressed data should be 
essentially free (if the 4 bits are 0, do nothing -- and most COLS blocks would 
be fairly large so this check does not occur that frequently).
* Option C:
** Reserve those 12 bits for future use / research

Alternatively, one to 4 extra bytes used for the "append" feature can be 
reassigned to have more than 12 bits for compression metadata. 

So, with the above modifications, the COLS code  looks like this:

The COLS code is 8 bytes.  The low 16 bits encode basic meaning.
An 8 byte COLS code cannot be 0L.
|| Code & 0xFFFF (low 2 bytes) || Followed by || Meaning || 
| 0x0000 | N/A | (not allowed)|
| 0x0001 | nothing | A single zero Long |
| 0x0002 | one long (8 bytes) | The single data long |
| 0x0003 | two longs (16 bytes) | The pair of data longs |
| n | (n-1) longs |  The (n-1) longs |
| 0xFFFF | 2^16 - 2 longs | 2^16 - 2 longs |

The next portion, is to determine the state of truncation or appending. 
Two options are listed -- only truncation, and truncation/appending.  The 
appending could be up to 5 bytes if we squeeze all the rest of the space.  The 
example below is for up to 4 bytes appended and 3 bytes truncated.

{code}appendCode = (Code >> 28) & 0xF;{code}

|| appendCode & 0x7 || Append (+) or truncate (-) || From || truncate only 
option ||
| 0x0 | 0 | nothing| 0 |
| 0x1 | (-)1 | nothing | (-)1 |
| 0x2 | (-)2 | nothing | (-)2 |
| 0x3 | (-)3 | nothing | (-)3 |
| 0x4 | (+)1 | Code >>> 63 | (-)4 |
| 0x5 | (+)2 | Code >>> 62 | (-)5 |
| 0x6 | (+)3 | Code >>> 61 | (-)6 |
| 0x7 | (+)4 | Code >>> 60 | (-)7 |

It may be wiser to choose an option between these.  If 3 bytes are chosen as 
the max arbitrary append length, with 4 truncated, 20 bits are left for other 
purposes, rather than 12.  The average COLS chunk would be one byte larger.

|| AppendCode & 0x8 || Append 0L ||
| 0 | do not append 0L (8 zero bytes) |
| 1 | do append 0L (8 zero bytes) | 
 
h2. Encoding
The writer would perform processing in 8 byte chunks until the end of the block 
where some byte-by-byte processing would occur. Compression options would be 
entirely the writer's choice.
The state overhead can be very low or large at the writer's whim.  Larger COLS 
chunk sizes require more state (and larger arraycopys), and any compression 
option adds state overhead.  

h2. Decoding
Decoding in all circumstances reads data in 8 byte chunks.  Copies occur in 8 
byte chunks, 8 byte aligned save for the end of a block if the block does not 
have a multiple of 8 bytes in its payload.  An encoder can cause copy 
destinations (but not sources) to not be 8 byte aligned if certain special 
options (compression) or intentionally misaligned encoding is done.   
Generally, an encoder can choose to make all but the last few bytes of the last 
block in the stream aligned. 

> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object 
> Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>
> Object Container Files could use a 1 byte sync marker (set to zero) using 
> zig-zag and COBS encoding within blocks to efficiently escape zeros from the 
> record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value 
> with a single zero byte.  This property means that we can write any non-zero 
> zig-zag long inside a block within concern for creating an unintentional sync 
> byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block 
> payload.  You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for 
> the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a 
> type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field 
> expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field 
> expresses the number of bytes until the next record (including the cobs code 
> and sync byte).  Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros 
> from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes 
> | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync 
> byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing 
> sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing 
> sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync 
> byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync 
> byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync 
> byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the 
> block.  The overhead for COBS encoding is very small in terms of the 
> in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a 
> writer.  The reader could (for example) find a _metadata_ block by doing the 
> following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next 
> block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to