[jira] [Updated] (CASSANDRA-3545) Fix very low Secondary Index performance

Sylvain Lebresne (Updated) (JIRA) Thu, 08 Dec 2011 08:00:03 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sylvain Lebresne updated CASSANDRA-3545:
----------------------------------------

    Attachment: 0002-cleanup.patch
                0001-3545.patch

I agree with Jonathan than interning this inside the column family feels 
cleaner (and is more efficient). Attaching patch to do that (actually 2 patch, 
the second one does some cleaning of the comparator being given to lots of 
methods that don't care about it or can get it by other means). The patches are 
against trunk since I don't think we should push that into a stable release 
(independently of the actual implementation).

Note that this only applies to memtable, so this has probably much more impact 
on small benchmarks (where you insert and get immediately) than it will have in 
real life (it's still an improvement, don't get me wrong).

For the rest:
bq. 2) Don't calculate MD5 hash for startKey every time. It's optimal to 
compute it once (so search will be twice faster).

Unfortunately I don't see much way to do this any cleanly, without breaking 
badly the comparator abstraction.

bq. 3) Think about something faster that MD5 for hashing (like 
TigerRandomPartitioner with Tiger/128 hash).

It could be worth checking, though a quick search doesn't seem to return much 
interesting things. Finding a faster MD5 implementation would be convenient 
too, but the only thing I've found so far is 
http://twmacinta.com/myjava/fast_md5.php, which is unfortunately incompatible 
with our licence.

bq. 4) Don't use Tokens (with MD5 hash for RandomPartitioner) for comparing and 
sorting keys in index rows. In index rows, keys can be stored and compared with 
simple Byte Comparator

Imo, that's the most promising option. I don't think that would be very 
complicated to do (I actually think it would be pretty easy but I may be 
forgetting a difficulty), but the annoying part will likely be how to deal with 
the upgrade/backward compatibility. I may give it a shot at some point though.

                
> Fix very low Secondary Index performance
> ----------------------------------------
>
>                 Key: CASSANDRA-3545
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3545
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.7.0
>            Reporter: Evgeny Ryabitskiy
>             Fix For: 1.0.6
>
>         Attachments: 0001-3545.patch, 0002-cleanup.patch, 
> CASSANDRA-3545.patch, CASSANDRA-3545_v2.patch, IndexSearchPerformance.png
>
>
> While performing index search + value filtering over large Index Row ( ~100k 
> keys per index value) with chunks (size of 512-1024 keys) search time is 
> about 8-12 seconds, which is very very low.
> After profiling I got this picture:
> 60% of search time is calculating MD5 hash with MessageDigester (Of cause it 
> is because of RundomPartitioner).
> 33% of search time (half of all MD5 hash calculating time) is double 
> calculating of MD5 for comparing two row keys while rotating Index row to 
> startKey (when performing search query for next chunk).
> I see several performance improvements:
> 1) Use good algorithm to search startKey in sorted collection, that is faster 
> then iteration over all keys. This solution is on first place because it 
> simple, need only local code changes and should solve problem (increase 
> search in multiple times).
> 2) Don't calculate MD5 hash for startKey every time. It's optimal to compute 
> it once (so search will be twice faster).
> Also need local code changes.
> 3) Think about something faster that MD5 for hashing (like 
> TigerRandomPartitioner with Tiger/128 hash).
> Need research and maybe this research was done.
> 4) Don't use Tokens (with MD5 hash for RandomPartitioner) for comparing and 
> sorting keys in index rows. In index rows, keys can be stored and compared 
> with simple Byte Comparator. 
> This solution requires huge code changes.
> I'm going to start from first solution. Next improvements can be done with 
> next tickets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3545) Fix very low Secondary Index performance

Reply via email to