[ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790405#action_12790405
 ] 

Scott Carey commented on AVRO-160:
----------------------------------

I agree, a simple format for 80%+ of the use cases is a good thing.  It 
alleviates my prior feeling that this might be trying to do "too many things".  
In the future, another format (or slight variation on this) might support 
appends with schema changes or compression codec changes in a much simpler way 
than the earlier design in this ticket (store only the schema when it changes, 
with an index in a footer ?)

The last thing I want to clarify is the sync marker behavior.  Even with a 16 
byte marker there needs to be well defined behavior for collisions and 
disambiguation.  Is there documentation on this, or only in the code?
8 bytes or less might be plenty depending on how the sync marker behavior on 
writes and reads is defined.

There are many approaches to this --

The first increases the cost and complexity of writing and but makes marker 
identification unambiguous.
While writing, make sure no sequence of bytes match, and if there is a match, 
follow the match with a "is_literal" marker byte that cannot be found in what 
usually follows the sync marker (if what follows is a count of block entries, 
then encode -1?).
The marker could even be 4 bytes with this approach.  Detecting a collision on 
output and inserting literal marker byte may not be trivial however and surely 
will add overhead.  But it will make seeking to block boundaries clear and 
error handling code on the reading side simple.


Another way is to write blindly and then have a well defined behavior for 
detecting the various types of corruption possible when one assumes data after 
the marker is a valid header, and what to do when it happens.  Although 
improbable, it is possible for random data to mimic a block header, and for 
errors to only be detected when attempting to deserialize entries.  There are 
several cases to disambiguate a corrupted block from a normal block that 
happens to have the sync marker in its data.  One doesn't want to accidentally 
skip a block without reporting corruption, or fail to read because of a 
collision.

Additionally, if the marker (start of block) is aligned it would speed up the 
marker detection on both the writer and reader side and lessen the collision 
probability slightly (by a factor of the alignment width). 


There a lot of options for dealing with the sync marker and its collision and 
disambiguation behavior, I feel that whatever it is needs to be well defined in 
a specification.


> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to