[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238948#comment-13238948
 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

{quote}Do we not have this now Matt?{quote}
I don't believe so.  Right now, HFileReaderV2.blockSeek(..) does a brute force 
loop through each KV in the block and does a full KV comparison at each KV.  
Say there are 1000 KVs in the block and 20/row, on average you will have to do 
500 full KV comparisons just to get to your key, and even if you are selecting 
all 20 KVs in the row you still do 500 comparisons on average to get to the 
start of the row.  It's also very important to remember that these are long 
keys and they are most likely to share a common prefix since they got sorted to 
the same block, so you're comparators are churning the same prefix bytes over 
and over.

{quote}The slow write and scan speeds would tend to rule it out for anything 
but a random read workload but when random reading only{quote}
Yeah, it's not really targeted at long scans on cold data, but there are some 
cases in between long cold scans and hot point queries.  Some factors: If it's 
warm data your block cache is now 6x bigger.  Then there are many scans that 
are really sub-block prefix scans to get all the keys in a row (the 20 of 1000 
cells i mentioned above).  If your scan will have a low hit ratio then the 
ability to jump between rows inside a block will lessen CPU usage.

It performs best when doing lots of random Gets on hot data (in block cache).  
Possibly a persistent memcached alternative, especially with SSDs.  I believe 
the current system is actually limited by CPU when doing high throughput Gets 
on data with long keys and short values.  The trie's individual request latency 
may not be much different, but a single server could serve many more parallel 
requests before maxing out cpu.

The bigger the value/key ratio of your data the smaller the difference between 
trie and anything else.  Seems like many people now have big values so I doubt 
they would see a difference.  I'm more trying to enable smarter schemas with 
compound primary keys.

{quote}Regards your working set read testing, did it all fit in memory?{quote}
yes, i'm trying to do the purest comparison of cpu usage possible.  leaving it 
up to others to extrapolate the results to what happens with the effectively 
bigger block cache, more rows/block fetched from disk for equivalent size 
block, etc.  i'm currently just standing up a StoreFile and using it directly.  
see 
https://github.com/hotpads/hbase-prefix-trie/blob/hcell-scanners/test/org/apache/hadoop/hbase/cell/pt/test/performance/seek/SeekBenchmarkMain.java.
  i'll try to factor in network and disk latencies in later tests (did some 
preliminary tests friday but am on vacation this week).

{quote}How did you measure cycles per seek?{quote}
simple assumption of 2 billion/second.  was just trying to emphasize how many 
cycles a seek currently takes

{quote}What is numBlocks? The total number of blocks the dataset fit in?{quote}
number of data blocks in the HFile i fed in
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, 
> SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be compared 
> as often as it is now.
> Current code is on github.  The trie code is in a separate project than the 
> slightly modified hbase.  There is an hbase project there as well with the 
> DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to