[ https://issues.apache.org/jira/browse/CASSANDRA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828301#comment-13828301 ]
Jason Brown commented on CASSANDRA-1632: ---------------------------------------- OK, so after two months of trying to get thread affinity to rock the world, I have to admit that I can't get better performance than what we currently get with kernel scheduler. Details of code/testing follow, but my hunch as to why I didn't see a boost, and in most cases, saw degradation, lie in with: - we use many, many threads (over 1000 on some of Netflix's production servers, currently running c* 1.1), more than than the count of processors. Most studies/literature I found about thread affinity used a thread count less than or equal to the cpu count, so not much help there. - a given request with be passed across several threads/pools, and in order to get the best cpu cache coherency for that data (which, I believe, would be the best goal for this work), it's path through the code base should be on the same processor (for L1/L2 cache hits) or on the same core (hopefully hit L3 cache). I briefly thought about working this in, but the heavy use of statics and singletons make this a tricky and daunting task as I would need to either 'kill the damn singletons' (thx, @driftx :)) or carry around some complicated thread state in thread local-like structures. First off, using Peter Lawrey's thread affinity (https://github.com/peter-lawrey/Java-Thread-Affinity) was reasonably easy to use, and was able to easily pin threads to processors. I then created several cpu affinity strategies: - EqualSpreadCpuAffinityStrategy - Intended to be used with an AbstractExecutorService (ThreadPoolExecutor or ForkJoinPool), this implementation assigns each successively created thread within a pool to the next sequential CPU, thus attempting to spread the threads around equally amongst the CPUs. As we have several AbstractExecutorService instances inside of cassandra, this implementation picks a random CPU to start with as this will avoid overloading the lower numbered CPUs. - FirstAssignmentCpuAffinityStrategy - pins a thread to the first cpu the kernel assinged it to. - NoCpuAffinityStrategy - NOP, used for comparison vs. others, but mainly used for comparing vs. trunk. I applied these strategies (one at a time) to the various ThreadPoolExecutors we have (via overloading the NamedThreadFactory). After many variations, I ended up just applying the thread affinity to several key places, including OTC, ITC, READ & MUTATE stages, as well as the native protocol's RequestThreadPoolExecutor and TDisruptorServer (you can see my hacked up version of the disruptor-thrift lib at https://github.com/jasobrown/disruptor_thrift_server/tree/thread_affinity). One thing I discovered was that by isolating the CPUs that handle IRQ (exp. disk and network IO) from cassandra, I did get a modest bump in throughput (~5%). As it depends on the kernel and the OS's configuration as to whether the disk/network IRQ is pinned to one (or more) specific CPU, this is a little difficult to abstract out. Anecdotally, the ec2 instaces that i used for testing always assigned cpu0 for blkio (disk) and cpu1 for network (eth0). Spot checking other Netflix instances of different instance type and even different kernel version, showed that the IRQ distribution was not consistent across our nodes (very similar, but not the same). Thus, we could create a spin-off ticket to have cassandra isolate itself from those cpus, I think that work should be explored outside this ticket. The remants of all my coding on thread affinity can be found here https://github.com/jasobrown/cassandra/tree/1632_threadAffinity Testing: env: ec2, three nodes in us-west-1 m2.4xlarge, 8 processors, 68G RAM Linux 3.2.0.52 (Ubunut 12.10 LTS) cassandra - 8Gb head, 800 new gen traffic-generating application: @belliottsmith's improved cassandra-stress (https://github.com/belliottsmith/cassandra/tree/iss-6199-stress, 47a96c1f5557f) Compared with trunk (during early-mid Nov 2013), I found the performance of EqualSpreadCpuAffinityStrategy to be about 10-20% worse (throughtput and latency). The FirstAssignmentCpuAffinityStrategy was more or less on par with trunk. I found that while thread affinity reduced the CPU migrations of threads (as measured by 'perf stat -p $pid') by an order of magnitude, there was no appreciable effeciency gain to cassandra as a whole. As I don't have access to the PMU on ec2 instances (for obtaining metrics like L1/2/3 cache hit ratio from perf), I could not measure if the thread affinity code actually made the entire process more efficient. Either way, the latency and throughput performance metrics obtained at the client (cassandra stress) did not bear out an overall improvement in cassandra. If we all we can do at best is match the kernel's scheduler, I feel confortable with putting the thread affinity for now, unless anybody finds something I've missed or misunderstood. > Thread workflow and cpu affinity > -------------------------------- > > Key: CASSANDRA-1632 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1632 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Chris Goffinet > Assignee: Jason Brown > Labels: performance > Attachments: threadAff_reads.txt, threadAff_writes.txt > > > Here are some thoughts I wanted to write down, we need to run some serious > benchmarks to see the benefits: > 1) All thread pools for our stages use a shared queue per stage. For some > stages we could move to a model where each thread has its own queue. This > would reduce lock contention on the shared queue. This workload only suits > the stages that have no variance, else you run into thread starvation. Some > stages that this might work: ROW-MUTATION. > 2) Set cpu affinity for each thread in each stage. If we can pin threads to > specific cores, and control the workflow of a message from Thrift down to > each stage, we should see improvements on reducing L1 cache misses. We would > need to build a JNI extension (to set cpu affinity), as I could not find > anywhere in JDK where it was exposed. > 3) Batching the delivery of requests across stage boundaries. Peter Schuller > hasn't looked deep enough yet into the JDK, but he thinks there may be > significant improvements to be had there. Especially in high-throughput > situations. If on each consumption you were to consume everything in the > queue, rather than implying a synchronization point in between each request. -- This message was sent by Atlassian JIRA (v6.1#6144)