[ 
https://issues.apache.org/jira/browse/CASSANDRA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828301#comment-13828301
 ] 

Jason Brown commented on CASSANDRA-1632:
----------------------------------------

OK, so after two months of trying to get thread affinity to rock the world, I 
have to admit that I can't get better performance than what we currently get 
with kernel scheduler. Details of code/testing follow, but my hunch as to why I 
didn't see a boost, and in most cases, saw degradation, lie in with:

- we use many, many threads (over 1000 on some of Netflix's production servers, 
currently running c* 1.1), more than than the count of processors. Most 
studies/literature I found about thread affinity used a thread count less than 
or equal to the cpu count, so not much help there.
- a given request with be passed across several threads/pools, and in order to 
get the best cpu cache coherency for that data (which, I believe, would be the 
best goal for this work), it's path through the code base should be on the same 
processor (for L1/L2 cache hits) or on the same core (hopefully hit L3 cache). 
I briefly thought about working this in, but the heavy use of statics and 
singletons make this a tricky and daunting task as I would need to either 'kill 
the damn singletons' (thx, @driftx :)) or carry around some complicated thread 
state in thread local-like structures.

First off, using Peter Lawrey's thread affinity 
(https://github.com/peter-lawrey/Java-Thread-Affinity) was reasonably easy to 
use, and was able to easily pin threads to processors. I then created several 
cpu affinity strategies:

- EqualSpreadCpuAffinityStrategy -  Intended to be used with an 
AbstractExecutorService (ThreadPoolExecutor or ForkJoinPool), this 
implementation assigns each successively created thread within a pool to the 
next sequential CPU, thus attempting to spread the threads around equally 
amongst the CPUs. As we have several AbstractExecutorService instances inside 
of cassandra, this implementation picks a random CPU to start with as this will 
avoid overloading the lower numbered CPUs.
- FirstAssignmentCpuAffinityStrategy - pins a thread to the first cpu the 
kernel assinged it to.
- NoCpuAffinityStrategy - NOP, used for comparison vs. others, but mainly used 
for comparing vs. trunk.

I applied these strategies (one at a time) to the various ThreadPoolExecutors 
we have (via overloading the NamedThreadFactory). After many variations, I 
ended up just applying the thread affinity to several key places, including 
OTC, ITC, READ & MUTATE stages, as well as the native protocol's 
RequestThreadPoolExecutor and TDisruptorServer (you can see my hacked up 
version of the disruptor-thrift lib at 
https://github.com/jasobrown/disruptor_thrift_server/tree/thread_affinity).

One thing I discovered was that by isolating the CPUs that handle IRQ (exp. 
disk and network IO) from cassandra, I did get a modest bump in throughput 
(~5%). As it depends on the kernel and the OS's configuration as to whether the 
disk/network IRQ is pinned to one (or more) specific CPU, this is a little 
difficult to abstract out. Anecdotally, the ec2 instaces that i used for 
testing always assigned cpu0 for blkio (disk) and cpu1 for network (eth0). Spot 
checking other Netflix instances of different instance type and even different 
kernel version, showed that the IRQ distribution was not consistent across our 
nodes (very similar, but not the same). Thus, we could create a spin-off ticket 
to have cassandra isolate itself from those cpus, I think that work should be 
explored outside this ticket.

The remants of all my coding on thread affinity can be found here 
https://github.com/jasobrown/cassandra/tree/1632_threadAffinity

Testing:

env: ec2, three nodes in us-west-1
m2.4xlarge, 8 processors, 68G RAM
Linux 3.2.0.52 (Ubunut 12.10 LTS)
cassandra - 8Gb head, 800 new gen
traffic-generating application: @belliottsmith's improved cassandra-stress 
(https://github.com/belliottsmith/cassandra/tree/iss-6199-stress, 47a96c1f5557f)


Compared with trunk (during early-mid Nov 2013), I found the performance of 
EqualSpreadCpuAffinityStrategy to be about 10-20% worse (throughtput and 
latency). The FirstAssignmentCpuAffinityStrategy was more or less on par with 
trunk. 

I found that while thread affinity reduced the CPU migrations of threads (as 
measured by 'perf stat -p $pid') by an order of magnitude, there was no 
appreciable effeciency gain to cassandra as a whole. As I don't have access to 
the PMU on ec2 instances (for obtaining metrics like L1/2/3 cache hit ratio 
from perf), I could not measure if the thread affinity code actually made the 
entire process more efficient. Either way, the latency and throughput 
performance metrics obtained at the client (cassandra stress) did not bear out 
an overall improvement in cassandra.

If we all we can do at best is match the kernel's scheduler, I feel confortable 
with putting the thread affinity for now, unless anybody finds something I've 
missed or misunderstood.


> Thread workflow and cpu affinity
> --------------------------------
>
>                 Key: CASSANDRA-1632
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1632
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Jason Brown
>              Labels: performance
>         Attachments: threadAff_reads.txt, threadAff_writes.txt
>
>
> Here are some thoughts I wanted to write down, we need to run some serious 
> benchmarks to see the benefits:
> 1) All thread pools for our stages use a shared queue per stage. For some 
> stages we could move to a model where each thread has its own queue. This 
> would reduce lock contention on the shared queue. This workload only suits 
> the stages that have no variance, else you run into thread starvation. Some 
> stages that this might work: ROW-MUTATION.
> 2) Set cpu affinity for each thread in each stage. If we can pin threads to 
> specific cores, and control the workflow of a message from Thrift down to 
> each stage, we should see improvements on reducing L1 cache misses. We would 
> need to build a JNI extension (to set cpu affinity), as I could not find 
> anywhere in JDK where it was exposed. 
> 3) Batching the delivery of requests across stage boundaries. Peter Schuller 
> hasn't looked deep enough yet into the JDK, but he thinks there may be 
> significant improvements to be had there. Especially in high-throughput 
> situations. If on each consumption you were to consume everything in the 
> queue, rather than implying a synchronization point in between each request.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to