[ https://issues.apache.org/jira/browse/HADOOP-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558846#action_12558846 ]
Bryan Duxbury commented on HADOOP-2604: --------------------------------------- Here's some of the ideas we're tossing around as a starter: * Exclude column family name from the file: Currently we store HStoreKeys, which are serialized to contain row, qualified cell name, and timestamp. However, seeing as how a given MapFile only ever belongs to one column family it's very wasteful to store the same column family name over and over again. In a custom implementation, we wouldn't have to save that data. * Separate indices for rows from qualified name and timestamp: Currently, the index in MapFiles is over all records, so the same row can appear in the index more than one time (differentiated by column name/timestamp). If the index just contained row keys, then we could store each row key exactly once, which would point to a record group of qualified names and timestamps (and values of course). Within the record group, there could be another separate small index on qualified name. This would again reduce the size of data stored, size of indices, and make it easier to do things like split regions lexically instead of skewed by cell count. * Use random rather than streaming reads: There is some indication that the existing MapFile implementation is optimized for streaming access; HBase supports random reads, which are therefore not efficient under MapFile. It would make sense for us to design our new implementation in such a way that it would be very cheap to do random access. > [hbase] Create an HBase-specific MapFile implementation > ------------------------------------------------------- > > Key: HADOOP-2604 > URL: https://issues.apache.org/jira/browse/HADOOP-2604 > Project: Hadoop > Issue Type: Improvement > Components: contrib/hbase > Reporter: Bryan Duxbury > Priority: Minor > > Today, HBase uses the Hadoop MapFile class to store data persistently to > disk. This is convenient, as it's already done (and maintained by other > people :). However, it's beginning to look like there might be possible > performance benefits to be had from doing an HBase-specific implementation of > MapFile that incorporated some precise features. > This issue should serve as a place to track discussion about what features > might be included in such an implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.