[jira] Commented: (AVRO-160) file format should be friendly to streaming

Scott Carey (JIRA) Thu, 22 Oct 2009 16:46:24 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768975#action_12768975
 ]


Scott Carey commented on AVRO-160:
----------------------------------

{quote}For mapreduce, we need to be able to seek to an arbitrary point in the 
file, then scan to the next sync point and start reading the file. That's 
mostly what I mean by random access.{quote}

Ok, I misinterpreted.  I'll call that "seek and scan" for the rest of this 
comment, as opposed to random access which I interpret as "go to tuple # 
655321" or "read the first tuple following location X".  It also is related to 
the limitation that all schemas in the file must be representable in one big 
union schema.  If the requirement to read a tuple is only that the reader knows 
the schema in the prior metadata block, then what can be stored in one file is 
less restrictive.

{quote}
It should also be possible to layer indexes on top of this, to support random 
access by key. Indexes might be stored as side files, or perhaps in the file's 
metadata. To support these, it should be possible to ask, while writing, the 
position of the current block start, so that one may store that in an index and 
subsequently seek to it, then scan the block for the desired entry.
{quote}
I agree.  It is useful to leave open the option for index type metadata in the 
metadata block.  I'll add that the metadata block might also contain an index 
into that block to avoid scanning it (for large blocks).  Unfortunately, to do 
this with streaming writes, the metadata block with the index must be _after_ 
the block.  So, perhaps the metadata block needs two types of metadata, that 
which describes a previous block(s) and that which describes the next one?  

This is where I start to wonder if serving too many needs in one file type is 
the right choice.

{quote}
Let me elaborate on my last proposal. 
{quote}

I like it, but if we ever want true optimized random access (perhaps not) it 
would have to change or we would need side files.


{quote}
I think it still may make sense to flush metadata at the end of the file. It 
may no longer contain the schema, but it can contain things like counts and 
indexes. Streaming applications would not be able to use this, but other 
applications might find it very useful. Side files in HDFS are expensive.{quote}

It definitely makes sense to flush some metadata at the end, but much of that 
might be optional.

One useful thing would be the following. 
This allows MapReduce to not have to "seek and scan" but instead find the start 
of the metadata block nearest the HDFS block boundary. If counts are stored, it 
also allows basic random access by tuple number.

When a file is closed, the last metadata block can contain the offset of each 
known metadata block.  Perhaps this is optional, but if it exists then the 
input splitter can split on those boundaries and avoid seeking.  When the file 
is appended, it can either copy-forward this crude index or keep a reference to 
the prior "finish" metadata block.

Maybe, a straightforward thing to do is consider that each block in this file 
has a header, a data block, and a footer.  The header has the schema of the 
tuples in the block and any other information required to read the block, like 
the compression codec, etc.  The footer contains the tuple count and other 
optional info (like an index) and the length of the block.  The sync marker is 
in every footer, and in the first block's header.

Ok, I think I'm done with my speculation for now :)


> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-160) file format should be friendly to streaming

Reply via email to