On Mar 31, 2010, at 5:23 PM, Scott Banachowski wrote: > Hi, > > I'm looking at the spec for the container file, and have 2 questions: > > The map of metadata key/value pairs begins with a long, then a number of > string-key/bytes-value pairs. To be consistent with avro maps, should this > be followed by a long of 0? The spec doesn't say explicitly, but if the > header is described by an avro schema I would suspect yes. >
The Java code for the file uses the avro binary encoder for the map, so it could be defined by an avro schema. ---------- vout.writeMapStart(); // write metadata vout.setItemCount(meta.size()); for (Map.Entry<String,byte[]> entry : meta.entrySet()) { vout.startItem(); vout.writeString(entry.getKey()); vout.writeBytes(entry.getValue()); } vout.writeMapEnd(); vout.flush(); //vout may be buffered, flush before writing to out ---------- > Are the longs that describe the file block varint longs? Or 64-bit longs? > I assume avro varints. But if so, if you ever wanted to expand the size of > block by writing more objects to it, you'd be in trouble because you'd > potentially be unable to fit the new size in the varint's location. > This uses avro encoded longs. A block cannot be lengthened in place, one has to know the number of objects and size of the block before writing to the file. However, since HDFS is write-once resizing a block is not possible for a key use case no matter how the format is designed. Also, anything other than sequential writes is dangerous for data integrity without great care. In the Java code objects are encoded to a byte array block buffer before copying the block bytes to the file. The file format's default block size is 16000 bytes, and it is probably most efficient between 1k and 64k. A file format optimized for very large blocks would differ. Also, any file format for random access or in-place modification would necessarily be designed differently. This one best matches streams of small ( < 100 byte) to medium ( < 4k) sized records, and is built to function with the Hadoop use case. > Also, I looked around the repo for some example container files, but didn't > see any. Are there any examples checked in that we can use to examine their > layout and test our readers? > > thanks, > Scott >