[
https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707181#action_12707181
]
Matt Massie commented on AVRO-27:
---------------------------------
{quote}
1. What is the tangible advantage of a single record boundary?
2. Why would this be more corruption resistant?
{quote}
I'm imagining a situation where you have part of an Avro Object container file
minus the header/footer metablock because of data loss or subscribing to a data
stream in "real-time" midstream. In that situation, determining the random 16
byte sync marker would require some work (e.g. finding recurring 16-byte
values, searching for the string "schema" and working back, etc). Having a
constant sync value (with an escaped payload) makes this recovery easier and
the code a little cleaner. To be honest, this point is weakened by the fact
that we're not planning on streaming Object container files anyway.
{quote}
3. How likely is a collision?
{quote}
Seems like this is a non-issue with a 16-byte sync value as it is now but it's
always good to be future proof. :)
I'm curious what other Java experts (since I'm not) out there feel about COBS
in Java . It sounds from Scott's comment that byte stuffing in Java is a
non-starter.
There is code at..
https://bosshog.lbl.gov/repos/java-u3/trunk/sea/src/gov/lbl/dsd/sea/nio/util/COBSCodec.java
...from Lawrence Berkeley Labs to do COBS encoding in Java with the following
comment
{code}
/* Performance Note: The JDK 1.5 server VM runs <code>decode(encode(src))</code>
* at about 125 MB/s throughput on a commodity PC (2 GHz Pentium 4). Encoding is
* the bottleneck, decoding is extremely cheap. Obviously, this is way more
* efficient than Base64 encoding or similar application level byte stuffing
* mechanisms.
*/
{code}
> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object
> Container Files
> ----------------------------------------------------------------------------------------
>
> Key: AVRO-27
> URL: https://issues.apache.org/jira/browse/AVRO-27
> Project: Avro
> Issue Type: New Feature
> Components: spec
> Reporter: Matt Massie
>
> Object Container Files could use a 1 byte sync marker (set to zero) using
> zig-zag and COBS encoding within blocks to efficiently escape zeros from the
> record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value
> with a single zero byte. This property means that we can write any non-zero
> zig-zag long inside a block within concern for creating an unintentional sync
> byte.
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block
> payload. You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for
> the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a
> type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker
> for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field
> expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field
> expresses the number of bytes until the next record (including the cobs code
> and sync byte). Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros
> from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes
> | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker
> for the end of the block |
> h2. COBS code table
> || Code || Followed by || Meaning |
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync
> byte |
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing
> sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing
> sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync
> byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync
> byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync
> byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the
> block. The overhead for COBS encoding is very small in terms of the
> in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a
> writer. The reader could (for example) find a _metadata_ block by doing the
> following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next
> block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.