[jira] Commented: (HADOOP-2604) [hbase] Create an HBase-specific MapFile implementation

Bryan Duxbury (JIRA) Mon, 14 Jan 2008 15:27:54 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558846#action_12558846
 ]


Bryan Duxbury commented on HADOOP-2604:
---------------------------------------

Here's some of the ideas we're tossing around as a starter:

 * Exclude column family name from the file: Currently we store HStoreKeys, 
which are serialized to contain row, qualified cell name, and timestamp. 
However, seeing as how a given MapFile only ever belongs to one column family 
it's very wasteful to store the same column family name over and over again. In 
a custom implementation, we wouldn't have to save that data.
 * Separate indices for rows from qualified name and timestamp: Currently, the 
index in MapFiles is over all records, so the same row can appear in the index 
more than one time (differentiated by column name/timestamp). If the index just 
contained row keys, then we could store each row key exactly once, which would 
point to a record group of qualified names and timestamps (and values of 
course). Within the record group, there could be another separate small index 
on qualified name. This would again reduce the size of data stored, size of 
indices, and make it easier to do things like split regions lexically instead 
of skewed by cell count.
 * Use random rather than streaming reads: There is some indication that the 
existing MapFile implementation is optimized for streaming access; HBase 
supports random reads, which are therefore not efficient under MapFile. It 
would make sense for us to design our new implementation in such a way that it 
would be very cheap to do random access.

> [hbase] Create an HBase-specific MapFile implementation
> -------------------------------------------------------
>
>                 Key: HADOOP-2604
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2604
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: Bryan Duxbury
>            Priority: Minor
>
> Today, HBase uses the Hadoop MapFile class to store data persistently to 
> disk. This is convenient, as it's already done (and maintained by other 
> people :). However, it's beginning to look like there might be possible 
> performance benefits to be had from doing an HBase-specific implementation of 
> MapFile that incorporated some precise features.
> This issue should serve as a place to track discussion about what features 
> might be included in such an implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2604) [hbase] Create an HBase-specific MapFile implementation

Reply via email to