[jira] Commented: (AVRO-160) file format should be friendly to streaming

Philip Zeyliger (JIRA) Thu, 22 Oct 2009 11:28:24 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768806#action_12768806
 ]


Philip Zeyliger commented on AVRO-160:
--------------------------------------

Ok, that makes sense.

For some reason, I thought you could write AAAAAXBBBBBBY where records A are 
written with schema X, and then records B are written with schema Y, where X 
and Y are resolvable using schema resolution.  But that doesn't work because 
though X and Y may be resolvable, they may not have the same serialization.

So, it turns out there are two types of schema compatibility: writer-reader 
compatibility, which means that we can read when we have both schemas 
available, and writer-writer compatibility, which concerns whether we can read 
(or write) data with only one of the two schemas.  I don't like those names, 
though.

There's something appealing about writing the schema frequently.  You could 
also store an offset pointer to the schema in every block header, instead of 
the entire thing.

What use cases are you thinking about?
 * Map/reduce outputs tend to be uniform, since it's unlikely that a M/R 
program changes its output in media res.
 * Map/reduce inputs might be heterogeneous because you're combining logs from 
last year with logs from this year, though it's likely that individual files 
are homogeneous.  (And if you bother to combine files, you may as well do the 
schema resolution as part of the concatentation, and keep the new file 
homogeneous.)
 * HBase cells are not likely to use this format, but rather keep the schema 
per column.
 * An individual program's log files are likely to be homogeneous.  There's no 
harm in starting a new log file when you upgrade, rather than appending to the 
old one.

> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-160) file format should be friendly to streaming

Reply via email to