[ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769519#action_12769519 ]
Doug Cutting commented on AVRO-160: ----------------------------------- > At what size does the metadata block overhead represent larger overhead than > storing extra information per tuple like Thrift/Protobuf? That's a hard to calculate, but I think we're a ways from that, especially if we write the schema in binary. > At what size does compression become less effective? I think we've found that, after ~64k, the compression ratio does not typically significantly improve. > Are larger blocks better for streaming read/write performance? In, e.g., mapreduce, we stream through a series of blocks, so we're still sequentially accessing ~64MB chunks, regardless of the compression-block size. > file format should be friendly to streaming > ------------------------------------------- > > Key: AVRO-160 > URL: https://issues.apache.org/jira/browse/AVRO-160 > Project: Avro > Issue Type: Improvement > Components: spec > Reporter: Doug Cutting > > It should be possible to stream through an Avro data file without seeking to > the end. > Currently the interpretation is that schemas written to the file apply to all > entries before them. If this were changed so that they instead apply to all > entries that follow, and the initial schema is written at the start of the > file, then streaming could be supported. > Note that the only change permitted to a schema as a file is written is to, > if it is a union, to add new branches at the end of that union. If it is not > a union, no changes may be made. So it is still the case that the final > schema in a file can read every entry in the file and thus may be used to > randomly access the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.