[ 
https://issues.apache.org/jira/browse/HBASE-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778529#action_12778529
 ] 

stack commented on HBASE-1938:
------------------------------

Profiling in-memory scanning of MemStore:

+ All time is in cacheNextRow as you'd expect.  41% of CPU is doing 
SortedSet#first, 24% in making an Iterator, and 22% doing the isEmpty test 
(which calls the #first method).   Each of these methods end up in 
KVComparator.compare.  All our scanning time is doing compares.  Alot of time 
is spent making up ints and longs out of bytes; e.g. getKeyLength and 
getRowLength.  I can see that some of these constructions -- e.g. getKeyLength 
-- happen multiple times in a single scan for a single KV (imagine if multiple 
concurrent scans).  This would seem to argue that we cache the construction of 
lengths but there'd be an associated memory cost.... maybe do it for just a few 
of these lengths?  Key length?  For example, calculating keylength once on 
construction would seem to make scanning near 30% faster in simple test.

Without caching of KeyLength:

Loaded
Scan: 2406
Scan: 1685
Scan: 1656
Scan: 1655
Scan: 1646
Scan: 1647
Scan: 1646


With caching of KeyLength:
Loaded
Scan: 1970
Scan: 1282
Scan: 1292
Scan: 1252
Scan: 1273
Scan: 1272
Scan: 1284
Scan: 1220
..

Let me attach patches that have amended test and the change I made to KV.

+ The "reputable lads" mentioned above think our getting tailSet for each cache 
of row content is wasteful, that we should be able do to better -- say take out 
iterator once and keep it for life of the Scanner.  On snapshot, we'd have to 
poke all outstanding Scanners to readjust themselves.  Looking at the numbers, 
though actually taking a tailset is surprisingly inexpensive, the tests for 
isEmpty and creation of Iterator each time are bulk of CPU.  Let me play with 
changing the MemStoreScanner implementation to be just a set.

> Make in-memory table scanning faster
> ------------------------------------
>
>                 Key: HBASE-1938
>                 URL: https://issues.apache.org/jira/browse/HBASE-1938
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: test.patch
>
>
> This issue is about profiling hbase to see if I can make hbase scans run 
> faster when all is up in memory.  Talking to some users, they are seeing 
> about 1/4 million rows a second.  It should be able to go faster than this 
> (Scanning an array of objects, they can do about 4-5x this).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to