[ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768975#action_12768975 ]
Scott Carey commented on AVRO-160: ---------------------------------- {quote}For mapreduce, we need to be able to seek to an arbitrary point in the file, then scan to the next sync point and start reading the file. That's mostly what I mean by random access.{quote} Ok, I misinterpreted. I'll call that "seek and scan" for the rest of this comment, as opposed to random access which I interpret as "go to tuple # 655321" or "read the first tuple following location X". It also is related to the limitation that all schemas in the file must be representable in one big union schema. If the requirement to read a tuple is only that the reader knows the schema in the prior metadata block, then what can be stored in one file is less restrictive. {quote} It should also be possible to layer indexes on top of this, to support random access by key. Indexes might be stored as side files, or perhaps in the file's metadata. To support these, it should be possible to ask, while writing, the position of the current block start, so that one may store that in an index and subsequently seek to it, then scan the block for the desired entry. {quote} I agree. It is useful to leave open the option for index type metadata in the metadata block. I'll add that the metadata block might also contain an index into that block to avoid scanning it (for large blocks). Unfortunately, to do this with streaming writes, the metadata block with the index must be _after_ the block. So, perhaps the metadata block needs two types of metadata, that which describes a previous block(s) and that which describes the next one? This is where I start to wonder if serving too many needs in one file type is the right choice. {quote} Let me elaborate on my last proposal. {quote} I like it, but if we ever want true optimized random access (perhaps not) it would have to change or we would need side files. {quote} I think it still may make sense to flush metadata at the end of the file. It may no longer contain the schema, but it can contain things like counts and indexes. Streaming applications would not be able to use this, but other applications might find it very useful. Side files in HDFS are expensive.{quote} It definitely makes sense to flush some metadata at the end, but much of that might be optional. One useful thing would be the following. This allows MapReduce to not have to "seek and scan" but instead find the start of the metadata block nearest the HDFS block boundary. If counts are stored, it also allows basic random access by tuple number. When a file is closed, the last metadata block can contain the offset of each known metadata block. Perhaps this is optional, but if it exists then the input splitter can split on those boundaries and avoid seeking. When the file is appended, it can either copy-forward this crude index or keep a reference to the prior "finish" metadata block. Maybe, a straightforward thing to do is consider that each block in this file has a header, a data block, and a footer. The header has the schema of the tuples in the block and any other information required to read the block, like the compression codec, etc. The footer contains the tuple count and other optional info (like an index) and the length of the block. The sync marker is in every footer, and in the first block's header. Ok, I think I'm done with my speculation for now :) > file format should be friendly to streaming > ------------------------------------------- > > Key: AVRO-160 > URL: https://issues.apache.org/jira/browse/AVRO-160 > Project: Avro > Issue Type: Improvement > Components: spec > Reporter: Doug Cutting > > It should be possible to stream through an Avro data file without seeking to > the end. > Currently the interpretation is that schemas written to the file apply to all > entries before them. If this were changed so that they instead apply to all > entries that follow, and the initial schema is written at the start of the > file, then streaming could be supported. > Note that the only change permitted to a schema as a file is written is to, > if it is a union, to add new branches at the end of that union. If it is not > a union, no changes may be made. So it is still the case that the final > schema in a file can read every entry in the file and thus may be used to > randomly access the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.