[jira] [Reopened] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Yuki Morishita (JIRA) Fri, 05 Oct 2012 00:22:33 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuki Morishita reopened CASSANDRA-4710:
---------------------------------------


When I was trying to reproduce CASSANDRA-4733, I stumbled upon following error.

{code}
ERROR [ValidationExecutor:2] 2012-10-04 15:24:43,440 CassandraDaemon.java (line 
132) Exception in thread Thread[ValidationExecutor:2,1,main]
java.lang.AssertionError: 113427529603963934725865253558964126270 is not 
contained in 
(56713727820156410577229101238628035242,113427455640312821154458202477256070484]
        at 
org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:345)
        at 
org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:727)
        at 
org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:66)
        at 
org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:451)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
{code}

It turned out that the cause was SSTR#getPositionsForRanges returning unrelated 
section of file due to bug in SSTR#getPosition. getPosition was returning null 
when it should return position.

getPosition starts search for key from nearest sampled index up to index 
interval count.
The following check inside getPosition:

{code}
 while (!input.isEOF() && i < DatabaseDescriptor.getIndexInterval())
{code}

stops search for indexed position when it searches all indexes between index 
sampling intervals and method returns null.
But with the check above, when searching for key that is greater than the last 
key inside index interval but is less than next sampled index, the method 
returns null instead of the position.

I think the fix for this is changing < to <=.
                
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Daniel Norberg
>            Assignee: Daniel Norberg
>            Priority: Minor
>             Fix For: 1.2.0 beta 2
>
>         Attachments: 
> 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the 
> keys during index scans becomes a bottleneck for reads when using 
> RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for 
> EQ operations improves throughput by some 30% for my workload (moving the 
> bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Reply via email to