[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Attachment: E600M_summary_key_s.svg E100M_summary_key_s.svg E10M_summary_key_s.svg Attached are some graphs of running reads with exponential key distributions over different key ranges (10M, 100M, 600M) - though over the same dataset (600M keys), on a 2-node c3.8xlarge cluster, with 1 c3.8xlarge load generator. All of the thread counts were run for 1M operations. These are the last of a sequence of test runs as, initially, by far the biggest determining factor was page cache behaviour - with each iteration the page cache's page retention algorithm got progressively better at retaining the best subset of the pages to service the requests. Note also the scales - in particular the 600M test both are within 1% of each other performance-wise, and I would note that the variability was greater than 1%, and that in prior runs the positions were reversed. More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Benedict Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, E100M_summary_key_s.svg, E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, austin_diskbound_read.svg, aws.svg, aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Attachment: jason_run3.svg jason_run2.svg jason_run1.svg More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Benedict Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, austin_diskbound_read.svg, aws.svg, aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-4718: --- Attachment: stress_2014May15.txt Adding more fuel to the testing fire, i took a first pass at having a large of amount of data on disk (~2x the memory size of each box), and running the read tests - see attached file: stress_2014May15.txt. I cleared the page cache before switching to each branch from for the reads, and then performed 3 rounds of stress. The goal here was to see how the sep branch compared with cassandra-2.1 when doing most of the reads from disk (with a cold page cache, or where the cache is constantly churning due to new blocks being pulled in). The short story is the sep branch performs slightly worse the current cassandra-2.1 (which includes CASSANDRA-5663) on both ops/s and latencies. I'm going to do one more test where I preload a good chunk of the data into the page cache, then run the stress - hopefully to emulate the case where most reads come from the page cache and some go to disk. Will me to use a less naive key distribution alg, to ensure that we hit the hot keys, which is provided by stress. More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Benedict Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, stress_2014May15.txt, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-4718: --- Attachment: stress_2014May16.txt Next attachment (stress_2014May16). After I created a new set of data (about 300GB per node), generated with populate=x (rather than uniform=0..x), I dropped the page cache, then I attempted to pre-populate the page cache by running a warm up read round, like this: {code}./tools/bin/cassandra-stress read n=180664790 -key dist=extr\(1..6,2\) -rate threads=75 -mode native prepared cql3 -port native=9043 thrift=9161 -node {code} I used the value 180664790 from stress's print function, which can give you a reasonable dist count to hit a certain percentage of coverage for the different models (note: i may be completely incorrect on this point, so feel free to correct my understanding). Then I ran the same 3 stress tests i ran in yesterday's run (one where I load all 21 cols with the row, load the default count (5), then only iterate over the default key count (1mil)). I then cleared the page cache and performed the same testing for the other branch (using the existing data, but warming up the cache). The results look like the sep branch had lower latencies vs. cassandra 2.1 (same as yesterday), but less ops/sec. tbh, I'm not sure I proved a whole lot since yesterday’s tests, as dstat showed I was loading 200-300Mb of data per second, so I was certainly giving the page cache a good workout in any case. More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Benedict Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Attachment: austin_diskbound_read.svg For reference, a plot of reads from a much larger than memory dataset against the austin cluster. There's a lot of variability, but no appreciable difference between the two, which is what I would expect (but good to confirm) More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Benedict Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, austin_diskbound_read.svg, aws.svg, aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Attachment: jason_write.svg jason_read_latency.svg jason_read.svg aws.svg I've attached read throughput graphs for a run on a 2 node AWS c3.8xlarge with 1 load generating node (i've ignored write throughputs for now as the local disks are atrocious, so the numbers are pretty meaningless and all over the place), and graphs for Jason's runs read/write runs, including the read latencies (most representative of networking as they should all in-memory operations) It looks like on modern/patched kernels and/or hyperthreaded modern CPUs the lse-batchnetty patch wins hands down - latency and throughput are all substantially better. I'll now try to see if a little bit of tuning can yield better results for all systems, as the prior version of the LSE branch was hitting the same throughput as 2.1-batchnetty does on our internal no-hyperthreading 2.6 clusters. So if we can bring that back up the case should be pretty compelling. More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-4718: --- Attachment: belliotsmith_branches-stress.out.txt attached file (belliotsmith_branches.out) are my results from running the latest branches on the same hardware I described above. More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Attachment: aws_read.svg For those interested, comparison of the latest version versus 2.1-batchnetty (dropping other branches for comparison now). More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Benedict Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Reviewer: Jason Brown (was: Benedict) More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Benedict Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, backpressure-stress.out.txt, baq vs trunk.png, belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, jason_write.svg, op costs of various queues.ods, stress op rate with various queues.ods, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-4718: --- Attachment: backpressure-stress.out.txt I've run both of [~benedict]'s patches on my test cluster, and the results are quite hopeful (see backpressure-stress.out). Short story is anything we do for the native protocol will at least double the throughput and halve the latency, but that's the low end of the improvements. Benedict's second patch is about 5-10% faster than his first, and both are slightly slower than my original patch. As I think it's important to have back pressure for any pool, at a minimum I think we should go with [~benedict]'s first patch. I like the ideas in the second patch, but would like another day to digest the implementation. More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Priority: Minor Labels: performance Fix For: 2.1.0 Attachments: 4718-v1.patch, PerThreadQueue.java, backpressure-stress.out.txt, baq vs trunk.png, op costs of various queues.ods, stress op rate with various queues.ods, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Reviewer: Benedict (was: Pavel Yaskevich) More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Priority: Minor Labels: performance Fix For: 2.1 Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op costs of various queues.ods, stress op rate with various queues.ods, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-4718: --- Attachment: 4718-v1.patch v1-stress.out After several month hiatus, digging into this again. This time around, though, I have real hardware to test on, and thus the results are more consistent across executions (no cloud provider intermittently throttling me). Testing the thrift interface (both sync and hsha), the short story is throughput is up ~20% vs. TPE, and 95% / 99%iles are down 40-60%. The 99.9%ile, however, is a bit tricker. In some of my tests it is down almost 80%, and sometimes it is up 40-50%. I need to dig in further to understand what is going on (not sure if it’s because of a shared env, reading across numa cores, and so on). perf and likwid are my friends in this investigation. As to testing the native protocol interface, I’ve only tested writes (new 2.1 stress seems broken on reads) and I get double the throughput and 40-50% lower latencies across the board. My test cluster consists of three machines, 32 cores each, 2 sockets (2 numa cores), 132G memory, 2.6.39 kernel, plus a similar box that generates the load. A couple of notes about this patch: * RequestThreadPoolExecutor now decorates a FJP. Previously we had a TPE which contains, of course, a (bounded) queue. The bounded queue helped with back pressure from incoming requests. By using a FJP, there is no queue to help with back pressure as the FJP always enqueue a task (without blocking). Not sure if we still want/need that back pressure here. * As ForkJoinPool doesn’t expose much in terms of use metrics (like total completed) compared to ThreadPoolExecutor, the ForkJoinPoolMetrics is similarly barren. Not sure if we want to capture this on our own in DFJP or something like else. * I have made similar FJP changes to the disruptor-thrift library, and once this patch is committed, I’ll work with Pavel to make the changes over there and pull in the updated jar. As a side note, looks like the quasar project (http://docs.paralleluniverse.co/quasar/) indicates the jsr166e jar has some optimizations (http://blog.paralleluniverse.co/2013/05/02/quasar-pulsar/) over the jdk7 implementation (that are included in jdk8). I pulled in those changes and stress tested, but didn’t see much of a difference for our use case. I can, however, pull them in again if any one feels strongly. More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Priority: Minor Labels: performance Fix For: 2.1 Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op costs of various queues.ods, stress op rate with various queues.ods, v1-stress.out Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Attachment: stress op rate with various queues.ods More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Priority: Minor Labels: performance Attachments: baq vs trunk.png, op costs of various queues.ods, PerThreadQueue.java, stress op rate with various queues.ods Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-4718: Attachment: op costs of various queues.ods More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Priority: Minor Labels: performance Attachments: baq vs trunk.png, op costs of various queues.ods, PerThreadQueue.java Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-4718: -- Labels: performance (was: ) More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Priority: Minor Labels: performance Attachments: baq vs trunk.png, PerThreadQueue.java Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McGuire updated CASSANDRA-4718: Attachment: baq vs trunk.png I ran a performance test comparing jonathan's baq branch with trunk and saw about a substantial increase in read performance. More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Priority: Minor Attachments: baq vs trunk.png, PerThreadQueue.java Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay updated CASSANDRA-4718: - Attachment: PerThreadQueue.java How about Per thread queue? was curious about this ticket and ran a test and actually it performs better than Executors.newFixedThreadPool... I have to also note that the stages content when we have a lot of CPU's (was testing with 64 core machines) More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Priority: Minor Attachments: PerThreadQueue.java Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay updated CASSANDRA-4718: - Attachment: PerThreadQueue.java More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Priority: Minor Attachments: PerThreadQueue.java Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay updated CASSANDRA-4718: - Attachment: (was: PerThreadQueue.java) More-efficient ExecutorService for improved throughput -- Key: CASSANDRA-4718 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Priority: Minor Attachments: PerThreadQueue.java Currently all our execution stages dequeue tasks one at a time. This can result in contention between producers and consumers (although we do our best to minimize this by using LinkedBlockingQueue). One approach to mitigating this would be to make consumer threads do more work in bulk instead of just one task per dequeue. (Producer threads tend to be single-task oriented by nature, so I don't see an equivalent opportunity there.) BlockingQueue has a drainTo(collection, int) method that would be perfect for this. However, no ExecutorService in the jdk supports using drainTo, nor could I google one. What I would like to do here is create just such a beast and wire it into (at least) the write and read stages. (Other possible candidates for such an optimization, such as the CommitLog and OutboundTCPConnection, are not ExecutorService-based and will need to be one-offs.) AbstractExecutorService may be useful. The implementations of ICommitLogExecutorService may also be useful. (Despite the name these are not actual ExecutorServices, although they share the most important properties of one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira