I was reading Owen's presentation at Hadoop Summit on ORC. Slide #14 describes how codecs are used for generic compression.
I think we can adopt some of their ideas in HFile v3. Cheers On Fri, Jul 19, 2013 at 9:48 AM, Andrew Purtell <apurt...@apache.org> wrote: > On Fri, Jul 19, 2013 at 4:23 AM, Jean-Marc Spaggiari < > jean-m...@spaggiari.org> wrote: > > > If tags are activated but empty, is it going to be the > > same thing? Or are we going to have all the tags overhead? Like can we > have > > a byte to say "no tags in that file" in addition to "tags are activated > for > > that file"? > > > > This reminds me of an interesting discussion we had. So like with > memstoreTS, if we determine that no cells in a file have tags (or > timestamps) then we can flag that in file metadata and turn off any related > persistence when writing out the data blocks. With millions of KVs in a > file that can achieve substantial space savings. Having a new file format > on the table also opens up possibilities like block headers: an N-byte > structure (where N is something like 4 or 8 bytes maybe) at the start of > each block that describes the encoding strategy taken for the block: > whether tags are present or not, if we used FAST_DIFF, or some new packing > together of related values (we put the keys up front with one or two byte > pointers into the block where their values are, de-dup values in the latter > part of the block), or a dictionary scheme (and with which dictionary in > what meta block) etc. We might borrow ideas from Parquet or ORC. We can > stop serializing HFile blocks as individual cells into streams and look at > them as a group of cells to write into a bytebuffer, providing a lot more > freedom for efficiently structuring the internal details of the block. Let > me make sure this point makes it out into the public discussion, to > highlight the additional benefit of having an experimental file format > available in the 0.96 cycle - it's a place where we and users can go off on > new directions far beyond inline tags. Of course such changes in unreleased > trunk code could make that possible too, but what I have observed is > "professional" HBase devs are much more likely to look at trunk than a > user. Users really want to work on and contribute a patch for what they are > running in production. Consider recent contributions from Yahoo and Taobao > as an example of what I mean. The bar for putting something into V2 is > extremely high as it should be on account of how performance critical that > code is. I'm not suggesting less rigor for V3, what I am suggesting is V3 > can provide design freedom by going in different directions than the legacy > V2 code. > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >