[ 
https://issues.apache.org/jira/browse/PHOENIX-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917593#comment-13917593
 ] 

Lars Hofhansl commented on PHOENIX-76:
--------------------------------------

I did some informal performance tests and I found that seeking is about 5-10x 
as expensive as calling next() on the scanner. I tested with very small column 
values, with larger values next() becomes proportionally more expensive.
(informal tests, because the exact outcome depends on the ratio of valuesize to 
keysize and keyvaluesize to HFile blocksize)

In addition what counts is the number of *gaps*. Consecutive columns have no 
extra cost beyond a call to next(), for example when the 3rd and 4th column are 
selected together.

So seeking is preferable if each consecutive range of selected columns then 
skips 5-10 columns and/or versions - for example selecting a single column out 
of a row where we expect 10 columns, or selecting 2 out of 20, or skipping a 
single column with 10 versions, or selecting column 1,2,3 out of 5 columns.

Some interesting data:
* selecting the first columns consecutively is cheap (i.e. 1st, 2nd, 3rd, 4th 
columns)... just as fast as the wildcard column tracker
* selecting the 3rd, 4th, and 5th column is hardly more expensive then just 
selecting the 3rd column alone
* (as said above) if a seek skips 5-10 KVs (i.e. columns or version) we should 
seek
* when column values are large (approaching the HFile block size of 64K) we 
should definitely seek

So the exact optimal place a bit hard to determine. I worry that this (and 
phoenix-29) might be a bit premature optimization based on too few performance 
tests. We might see terrible performance in many scenarios we have not tested 
as outlined above.

What would really help is if we could make sure that canonical column (right 
now it's "_", which sorts after capital letters) would always sort first... 
I.e. call it "$" or "!" or something. That should double the performance of 
count(1) for example.

Numbers: (not with Phoenix but HBase directly).
10m rows, 10 cols each, 8 bytes values, 10 bytes values, encoding = FAST_DIFF, 
exactly one version of each column, everything in the blockcache:
||Columns selected||none||1||1,2||1,2,3||2||2,3||2,3,4||2,4,6||1,2,3,4, 
6,7,8,9,10||1,2,3,4,5,6,7,8,9,10|
|Scan time/s|19.5|13.0|14.5|21.1|18.2|19.8|21.1|31.7|25.8|22.0|

We should do more tests with (1) more versions and (2) longer values and (3) 
longer keys


> Fix perf regression due to PHOENIX-29
> -------------------------------------
>
>                 Key: PHOENIX-76
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-76
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: James Taylor
>            Assignee: Anoop Sam John
>             Fix For: 3.0.0
>
>         Attachments: PHOENIX-76.patch
>
>
> Many queries got slower as a result of PHOENIX-29. There are a few simple 
> checks we can do to prevent the adding of the new filter:
> - if the query is an aggregate query, as we don't return KVs in this case, so 
> we're only doing extra processing that we don't need. For this, you can check 
> statement.isAggregate().
> - if there are multiple column families  referenced in the where clause, as 
> the seek that gets done is better in this case because we'd potentially be 
> seeking over an entire stores worth of data into a different store.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to