[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062217#comment-14062217 ] Jonathan Ellis commented on CASSANDRA-4718: --- To summarize for those browsing, the primary result here was the introduction of SharedExecutorPool: https://github.com/apache/cassandra/blob/cassandra-2.1.0/src/java/org/apache/cassandra/concurrent/SharedExecutorPool.java > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1 rc1 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012945#comment-14012945 ] Benedict commented on CASSANDRA-4718: - Committed with that doc reworded > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1 rc1 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012513#comment-14012513 ] Jason Brown commented on CASSANDRA-4718: nit: please clean up the documentation grammar a little bit above the SEPWorker.prevStopCheck declaration. Other than that, +1. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009595#comment-14009595 ] Benedict commented on CASSANDRA-4718: - I've uploaded an update to the branch which should permit spinning down the last thread much more simply (and correctly) than the previous patch did. I've also retested it to confirm the performance characteristics remain intact. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008032#comment-14008032 ] Jason Brown commented on CASSANDRA-4718: Spoke with [~benedict] about his latest change (to spin down the last thread), we realized he had a bug, and he'll take another couple of days to fix/find better solution. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007043#comment-14007043 ] Benedict commented on CASSANDRA-4718: - I've uploaded an updated branch. I tweaked a couple of minor things (I let the last thread spin down, which couldn't happen before), and re-ran my battery of tests to confirm the performance is still good. I've introduced a new long test and fixed the ClassCastException on the logger thread. Should be good to go. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005093#comment-14005093 ] Benedict commented on CASSANDRA-4718: - Thanks! I just need to tidy up those two minor bugs you spotted, and I'm adding a long test as well so it has some isolated testing for future work. Should be ready to commit this evening. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005087#comment-14005087 ] Jason Brown commented on CASSANDRA-4718: +1 on the current 4718-sep branch FTR, [~benedict] and I have worked closely on the code for the last several weeks, and I've provided direct feedback to him about problems/concerns. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003406#comment-14003406 ] Benedict commented on CASSANDRA-4718: - One thing worth mentioning is that the size of the dataset over which this is effective is not necessarily represented accurately by the test, as it was run over a fully-compacted dataset, so the 10M keys would have been randomly distributed across all pages (we select from a prefix of the key range, but the murmur hash will get evenly distributed across the entire dataset once fully compacted). Were this run on a real dataset, with the most recent data being compacted separately to the older data, and the most recent data being hit primarily, there would be greater locality of data access and so any gains should be effective over a larger quantity of data. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003213#comment-14003213 ] Jonathan Ellis commented on CASSANDRA-4718: --- This looks exactly like I would have predicted: the more disk-bound the workload is, the less the executorservice matters. But when our hot dataset is cacheable (10M and to a lesser degree 100M) -sep is a clear win. This is the scenario that we ought to be optimizing for. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, E100M_summary_key_s.svg, > E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001621#comment-14001621 ] Jonathan Ellis commented on CASSANDRA-4718: --- bq. I can patch stress briefly to force it to run all thread counts in the requested range, instead of stopping when it hits a plateau That sounds like a good option to have. bq. when we did May 15 (which is completely different test btw, addressing your point from previous comment) there was almost no disk activity after original page cache warm up That doesn't sound right to me, all the number from May 15 are 7k-14k ops/s which is disk bound territory. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000839#comment-14000839 ] Benedict commented on CASSANDRA-4718: - compression is disabled by stress by default. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000836#comment-14000836 ] Pavel Yaskevich commented on CASSANDRA-4718: 250K op/s archived when test was running with almost no data, total duration was around a minute or so. We are, on the other hand, trying to make it more realistic in terms of data amount, I'm not sure about the tests from May 16 but when we did May 15 (which is completely different test btw, addressing your point from previous comment) there was almost no disk activity after original page cache warm up. If you can please patch the test to do runs with all of the threads and once we re-run I will also check disk activity, but I'm pretty sure it would be minimal, reading from the page cache cost is not as efficient are reading from anonymous area plus it does a syscall with compression (which is used by default) so I'm not surprised that op/s degraded. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000830#comment-14000830 ] Benedict commented on CASSANDRA-4718: - I meant 250Kop/s. We're now pushing 6Kop/s. The numbers from 16th May are the latest posted, to my knowledge, and the ones we're discussing? You can make stress do a fixed number of ops per run, but not a fixed set of thread counts currently - its auto mode (that this is from) ramps up thread counts until it detects a plateau; in these tests it seems that sep reached a higher throughput rate earlier, and so when it normalised down again stress considered it to have plateaued earlier. As to #2, run1 when it is truncated at a lower tc is as fast as stock is at its peak. However, you're right that it is possible it would have tanked further - in this case this would be indicative of a bug rather than a fundamental flaw in its design, but it is almost certainly down to the natural tendency to dip slightly below peak throughput after the real plateau. I can patch stress briefly to force it to run all thread counts in the requested range, instead of stopping when it hits a plateau, but the auto-mode isn't really designed to be a canonical test. If we want accurate like-for-like comparisons we want to graph each thread count separately for its whole run, and ensure each run is long enough to spot the general behavioural pattern (i.e. at least a few minutes for IO bound work). I'd also ensure we interleaved the two branches to try to avoid any weird page caching / other utilisation interferences. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000816#comment-14000816 ] Pavel Yaskevich commented on CASSANDRA-4718: I think you have plotted numbers from May16. I'm not sure what do you mean by "often" the problem with numbers is that they apparently cut off for both branches :( We have to redo the test I think, [~jasobrown] is there a way to guarantee that both branches are going to do the same number of runs with stress? I disagree with #2 because sep shows a sudden drop at the end as do runs 1 and 3 so we don't really know what is going to happen with sep on the high stress concurrency in those runs. bq. This work is clearly disk bound as the same hardware was pushing 250k/s with similar record sizes when exclusively in memory - we're seeing only 5% of that now. Unless possibly in-memory index scans are occupying all of the time (but according to Jason CPU utilisation was around 30% from a random non-scientific poll). I'm not sure if we can count 250K/s as a disk bound workload, which is only 3 buffer reads per second. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of > various queues.ods, stress op rate with various queues.ods, > stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000743#comment-14000743 ] Benedict commented on CASSANDRA-4718: - But like I said, the sep branch was actually faster more often than it was slower? And yes it routes intelligently, but to both replicas...? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000739#comment-14000739 ] Pavel Yaskevich commented on CASSANDRA-4718: [~benedict] Isn't CQL trying to be smart about request routing like Thrift? Anyhow, read concurrency was default 128 (can you confirm [~jasobrown]?) and we do remove all of the files, drop cache, restart, disable compaction to remove jitter as much as possible, so the only difference between two runs is one has sep patch another doesn't, if there are slow reads that should be happening in both runs because keys are read uniformly and although there is a big amount of sstables in the system for every read there is only one hit, which pretty much simulates the behavior of the systems where data accumulates over time. Also I can tell you that setup that we have for I/O is able to handle mere 300GB without a problem. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000727#comment-14000727 ] Benedict commented on CASSANDRA-4718: - Whilst on this topic, I've been thinking about disk/memory testing protocols in general, and it seems we really need to think through a good strategy for creating a consistent test bed that is representative. The test I have asked [~enigmacurry] to run is not going to be fair, as major compaction will cause all of the data points to be randomly distributed (by hash) across a single sstable, and given the records are small, selecting from a smallish random subset of this data will pretty much necessarily involve touching every page on disk with equal probability. However disabling compaction entirely is equally unfair, as we leave many sstables to check from bloom filter false positives (there are around 800+ sstables in Jason's test, for 300Gb of data, at a finger in air estimate), so most of the cache will be going to index files, with almost every data item lookup probably going to disk due to the reduced memory causing the same effect as the major compaction to kick in. It seems to me we need to 1) get the exponential distribution to select from last keys in preference to first keys (i.e. most recently written most commonly accessed); and 2) create a compaction strategy for testing purposes, that is designed to create a sort-of "in flight snapshot" of a real STCS workload, by compacting older data into exponentially larger files. These two together should give us much closer to a real live system that is using STCS, and with a consistent reproducible baseline behaviour. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000724#comment-14000724 ] Benedict commented on CASSANDRA-4718: - [~xedin] why are you only counting the primary replica data? Requests will hit both replicas by default? If you look at the results there is a reasonable amount of variability for both runs, so it's not clear that one is slower or faster - there are a number of points where 4718-sep is faster than 2.1, and vice versa, and given it is disk bound I am inclined to suggest this is not the patch making it perform worse. In fact, a majority of data points show higher throughput for 4718-sep, not for 2.1. Your first test, every thread count below 271 is faster; 271 seems to be a blip due to a small number of very slow reads affecting the very last measurement (there's a "race" in stress' auto mode where some measurements are still accepted after it's decided enough have been taken, as can be seen by the final stderr being above the acceptability point); 2.1 showed a similar effect at this tc, but smaller, so this seems likely to be random chance. The last test it is faster for all thread counts despite some weird max latencies. It's only the middle test where it appears to be marginally slower, and given this test performs effectively exactly the same amount of work as the first test, I'm not sure this demonstrates a great deal other than the variability. It's also worth asking what your max read concurrency is? As I'm surprised to see thread counts > 180 causing dramatic spikes in latency (both branches) when I'd expect them to be saturating the read stage well before then? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000562#comment-14000562 ] Pavel Yaskevich commented on CASSANDRA-4718: bq. Yeah, 200+MB/s sounds pretty disk bound to me. I vote that we move to the actual code review; we can certainly make further improvements later. I think what Jason meant is when he started doing reads system was pooling a lot of data into the memory at first, ~300GB he loaded was RF=2 and we have 128GB of RAM apart from kernel memory on those machines, so essentially it's ~150GB for primary replica which is not much bigger than total available memory for page cache, pretty much accounts for 10% you were talking about. As a summary, we made two benchmarks, first where amount of data was bigger than memory available for the page cache, second where most of the data fits into memory, both cases sep branch was performing worse than cassandra-2.1. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > austin_diskbound_read.svg, aws.svg, aws_read.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998622#comment-13998622 ] Benedict commented on CASSANDRA-4718: - bq. (latest bdplab tests) Which latest bdplab tests? The longer bdplab test from before (not the latest tests) had some issues (unrelated to this ticket) so we didn't get any read results, but showed increased write throughput. The latest tests have all been short runs. I am actually very pleased we are _at all_ faster for bdplab on any workload, as the first versions of these patches did not seem to benefit older hardware/kernels (we don't have enough hardware configurations to say which was the deciding factor), and actually incurred a slight penalty. The fact that the gap is very narrow for bdplab is not really important, nor are the thrift numbers. In both of those instances I am interested only in that we _do not perform any worse_; performing slightly better even here is just a bonus. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993683#comment-13993683 ] Benedict commented on CASSANDRA-4718: - Hmm. Frustrating - tweaks that made lse-batchnetty faster on my box (by 20-30%) make it slower on bdplab. Would be good to get some other numbers from different rigs involved to see if we can pin down the sweet spot, and maybe figure out what the cause of the discrepancy is. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998772#comment-13998772 ] Benedict commented on CASSANDRA-4718: - bq. And 4718-sep is essentially 2.1-batchnetty + the patch for this, right? Correct > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, stress_2014May15.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000295#comment-14000295 ] Jonathan Ellis commented on CASSANDRA-4718: --- Yeah, 200+MB/s sounds pretty disk bound to me. I vote that we move to the actual code review; we can certainly make further improvements later. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998639#comment-13998639 ] Benedict commented on CASSANDRA-4718: - bq. Since we're talking about benchmarks, it shouldn't be too hard to remove batching from the equation and check what value remains. The batching has already been committed to tip, so the 2.1-batchnetty is essentially this. bq. Can we bench with compression maybe? We probably should bench to get a comparison. It is quite likely the benefit will be lost, given we decompress 64K chunks at a time and currently have no uncompressed page cache. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998613#comment-13998613 ] Sylvain Lebresne commented on CASSANDRA-4718: - bq. currently the only valuable change I see here is batching for Netty I'm being lost in all the benchmark graphs and what they include I'll admit. We've now committed the batching separately with CASSANDRA-5663. Can someone sum up the graphs for "current tip of 2.1 (with CASSANDRA-5663)" versus "the same + benedict last patch"? Since we're talking about benchmarks, it shouldn't be too hard to remove batching from the equation and check what value remains. bq. Not mentioning that with compression every read results in syscall which forces thread to get parked anyway Can we bench with compression maybe? Both to see if any benefits is indeed lost when compression is on, and if compression generally out-perform no-compression or not. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998750#comment-13998750 ] Sylvain Lebresne commented on CASSANDRA-4718: - bq. The batching has already been committed to tip, so the 2.1-batchnetty is essentially this. And 4718-sep is essentially 2.1-batchnetty + the patch for this, right? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998146#comment-13998146 ] Jonathan Ellis commented on CASSANDRA-4718: --- Granted that a new executorservice won't help i/o bound workloads, but I knew that when I created the ticket and "must be significantly better for all workloads" is an unrealistically high bar for optimization work. This gives us a pretty huge benefit on at least some workloads ([1|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.bdplab.may12.threads-810-cql3_native_prepared.json&metric=op_rate&operation=4_read&smoothing=1&xmin=0&xmax=141.13&ymin=0&ymax=238843], [2|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2.may12.threads-810-cql3_native_prepared.json&metric=op_rate&operation=4_read&smoothing=1&xmin=0&xmax=134.31&ymin=0&ymax=340354.3]) and a smaller benefit on others, which I'm quite happy with. Unless the longer benchmarks Ryan is running show dramatically different results, I'm +1. I also note that the work here is almost entirely self contained, with the major exception being some new code in Message.Dispatcher. So while it's not as simple as dropping in LTQ or BAQ or FJP, the results are absolutely good enough to be worth a new Executor implementation. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993590#comment-13993590 ] Benedict commented on CASSANDRA-4718: - FTR, there are new branches: [4718-fjp-batchnetty|https://github.com/belliottsmith/cassandra/tree/4718-fjp-batchnetty] [4718-lse-batchnetty|https://github.com/belliottsmith/cassandra/tree/4718-lse-batchnetty] [cassandra-2.1-batchnetty|https://github.com/belliottsmith/cassandra/tree/cassandra-2.1-batchnetty] These are the three real contenders, and I've included the netty batching for all of them so we can get a like-for-like comparison going. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997754#comment-13997754 ] Pavel Yaskevich commented on CASSANDRA-4718: bq. What about writes, that's a pretty big scenario this helps improve The latest Ryan's numbers are from write workload. bq. Well, except that we expect in general for recent data to be accessed most often, or data to be accessed according to a zipf distribution, and in both of these cases caching helps to keep a significant portion of the data we're accessing in memory. Also, more users are getting incredibly performant SSDs that can respond to queries in time horizons measured in microseconds, and as this becomes the norm the distinction also becomes less important. I always thought that Zipf's law is for the scientific data, is it not? SSD could be performant but you can't get the full speed yet as close as you can get currently is 3.13+ with multiqueue support enabled. bq. Right, but we've always targetted "total data larger than memory, hot data more or less fits." So I absolutely think this ticket is relevant for a lot of use cases. Exactly, "hot data more or less fits" so the problem is that once you get into page page reclaim and disk reads (even SSDs), improvements maid here are no longer doing anything helpful, I think that would be clearly visible on the benchmarks to come. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997875#comment-13997875 ] Benedict commented on CASSANDRA-4718: - bq. I always thought that Zipf's law is for the scientific data, is it not? As far as I'm aware a zipf-like distribution is considered a good approximation for many data access patterns. A quick google yields an article showing that much web traffic follows a zipf distribution, but some follows a slightly different exponential distribution: http://www.cs.gmu.edu/~sqchen/publications/sigmetrics07-poster.pdf > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997596#comment-13997596 ] Jonathan Ellis commented on CASSANDRA-4718: --- bq. most of the use cases is exactly that - data set which exceeds available memory Right, but we've always targetted "total data larger than memory, hot data more or less fits." So I absolutely think this ticket is relevant for a lot of use cases. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997511#comment-13997511 ] T Jake Luciani commented on CASSANDRA-4718: --- bq. so I am not really sure if it worth it to commit all this code without any perf improvement for most of the usage scenarios. What about writes, that's a pretty big scenario this helps improve :) > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993670#comment-13993670 ] Ryan McGuire commented on CASSANDRA-4718: - Benchmarks from bdplab for the new branches. 3 nodes, separate stress host. * [810 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.trial2.3node.threads-810.log] * [270 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.trial2.3node.threads-270.log] (updating here as the tests finish...) > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996589#comment-13996589 ] Jason Brown commented on CASSANDRA-4718: Also, it looks like the period of time that you are running the tests for is very short (about 1 or 2 minutes). Can you let it run for *at least* 30 minutes or so (if not an hour or more), so we can see the burn in? Everything can look rosy in a 90 second test, but fall apart spectacularly under (closer to) real world conditions. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998310#comment-13998310 ] Jason Brown commented on CASSANDRA-4718: [~enigmacurry] How many threads are you running thrift with? If you aren't setting it explicitly, (iirc) it gets set to the number of processors, which is far below what anything sane should run with. For our machines, I've been using 512 for writes, and 128 for reads (mirroring what we run with in prod, which is same hardware as the machines I'm testing on, more or less). I think this may explain we we do not see the vast discrepancy between thrift and native protocol ops/second - native protocol default to 128 threads. Also, are you using sync or hsha for thrift? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997999#comment-13997999 ] Benedict commented on CASSANDRA-4718: - Thanks [~enigmacurry]! Those graphs all look pretty good to me. Think it's time to run some of the longer tests to see that performance is still good for other workloads. Let's drop thrift from the equation now. I'd suggest something like write n=6 -key populate=1..6 force major compaction for each thread count/branch: read n=1 -key dist=extr(1..6,2) and warm up with one (any) read test run before the rest, so that they all are playing from a roughly level page cache point This should create a dataset in the region of 110Gb, but around 75% of requests will be to ~40Gb of it, which should be in the region of the amount of page cache available to the EC2 systems after bloom filters etc. are accounted for NB: if you want to play with different distributions, cassandra-stress print lets you see what a spec would yield > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996074#comment-13996074 ] Jason Brown commented on CASSANDRA-4718: [~enigmacurry] Also, is it possible for you to fill up the disks with more sstables than available memory? I think we shouyld check how going to disk plays into the performance mix, rather than just reading from page cache for the entire read test. This should introduce another modality into the way the algorithm behaves, one that is probably more realistic to the real world (a mix of page cache hits and disk seeks). [~benedict] This rewrite is quite extensive wrt prior branches. As it this code is quite complex with many new additions, I will need a good chunk of time tomorrow to review this. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997787#comment-13997787 ] Jonathan Ellis commented on CASSANDRA-4718: --- bq. Exactly, "hot data more or less fits" so the problem is that once you get into page page reclaim and disk reads (even SSDs), improvements maid here are no longer doing anything helpful I don't follow you at all. If 90% of reads are already in-cache, this is going to help even if 10% are going to disk. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998266#comment-13998266 ] Pavel Yaskevich commented on CASSANDRA-4718: What I'm saying is so far it only gives the benefit only for the really small workset e.g. stress that runs for 1 minute. For the longer running test there is very small to no difference (latest bdplab tests), so we are doing longer running test right now in parallel with Ryan, currently the only valuable change I see here is batching for Netty. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997406#comment-13997406 ] Benedict commented on CASSANDRA-4718: - bq. Benedict WRT to the 2.1-batchnetty comparison, what did the latencies look like? Ryan's graphs are a much better way to view latencies; on the whole they seem universally as good or better (generally much better) bq. The more latency is introduced to the tasks the less effect would spinning have or in other words there is need to spin is eliminated Yes, also even more important is that the unpark() cost, when amortized over a long running operation, becomes insignificant regardless of if it is incurred; and producers cannot make forward progress anyway because the native-transport queue is full so avoiding paying the unpark cost on the network thread really doesn't achieve us anything, as those threads cannot make any forward progress. I fully expect there to be very little effect on workloads as the dataset exceeds memory and the row size climbs. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997809#comment-13997809 ] Ryan McGuire commented on CASSANDRA-4718: - @Benedict more "short" tests. Updated here as they complete: EC2 c3.8xlarge, cql native: * [810 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2.may12.threads-810-cql3_native_prepared.json] bdplab, cql native: * [810 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.bdplab.may12.threads-810-cql3_native_prepared.json] > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997731#comment-13997731 ] Benedict commented on CASSANDRA-4718: - A brief outline of the approach taken by the executor service I've submitted: It's premised on the idea that unpark() is a relatively expensive operation, and can block progress on the thread calling it (often it results in transfer of the physical execution to the signalled thread). So we want to avoid performing the operation as much as possible, so long as we do not incur any other penalties as a result of doing so. The approach I've taken to avoiding calling unpark() essentially amounts to trying to ensure the correct number of threads are running for servicing the current workload, without either delay of service or any waiting on any of the workers. We achieve this by essentially letting workers schedule themselves, except when we cannot guarantee they will do so on producing work for the queue (in which rare instance we spin up a worker directly) or the queue is full, in which case it costs us little to contribute to firing up workers. This can be roughly described as: # If all workers are currently either sleeping _indefinitely_ or occupied with work, we wake one (or start a new) worker # Before starting any given task, a worker checks if any more work is available on the queue it's processing and tries to hand it off to another unoccupied worker (preferring those that are scheduled to wake up of their own accord in the near future, to avoid signalling it, but waking/starting one if necessary) # Once we finish a task, we either: #* take another task from the queue we just processed, if any available, and loop back to (2); #* reassign ourselves to another executor that has work and go to (2); #* finally, if that fails, we enter a "yield"-spin loop # Each loop we spin for, we sleep a random interval scaled by the number of threads in this loop, so that the rate of wakeup on average is constant regardless of the number of spinning threads. When we wake up we: #* Check if we should deschedule ourselves (based on the total time spent sleeping by all threads recently - if it exceeds the real time elapsed, we put a worker to sleep indefinitely, preferably ourselves) #* Try to assign ourselves an executor with work outstanding, and go to (2) The actual assignment and queueing of work is itself a little interesting as well: to minimise signalling we have a ConcurrentLinkedQueue which is, by definition, unbounded. We then have a separate synchronisation state which maintains an atomic count of work permits (threads working the pool) and task permits (items on the queue). When we start a worker as a _producer_ we actually don't touch this queue at all, we just start a worker in a spinning state and let it assign itself some work. We do this to avoid signalling any other producers that may be blocked on the queue being full. When as a worker we take work from the queue to either assign to ourselves _or another worker_ we always atomically take both a worker permit and a task permit (or only the latter if we already own a task permit). This allows us to ensure we only wake up threads when they definitely have work to do. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and Outbo
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997515#comment-13997515 ] Benedict commented on CASSANDRA-4718: - bq. Also most of the use cases is exactly that - data set which exceeds available memory Well, except that we expect in general for recent data to be accessed most often, or data to be accessed according to a zipf distribution, and in both of these cases caching helps to keep a significant portion of the data we're accessing in memory. Also, more users are getting incredibly performant SSDs that can respond to queries in time horizons measured in microseconds, and as this becomes the norm the distinction also becomes less important. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997497#comment-13997497 ] Lior Golan commented on CASSANDRA-4718: --- But there are use cases where the full working set is memory resident or close to that. Improving performance in these use cases would reduce the need for caching in front of Cassandra > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997529#comment-13997529 ] Benedict commented on CASSANDRA-4718: - I've force pushed an updated branch to the repository which is simpler and has some nicer properties (though is ~1% slower at lower thread counts). I'm pretty happy with its current state, although still need to create some thorough executor stress tests. In the latest version workers self coordinate descheduling through a simpler scheme than the separate descheduler. The new scheme also permits thread over-provisioning to be corrected incredibly promptly (i.e. almost instantly), eliminating my one concern about this approach (that it could be slightly resource unfriendly in cases of variable workloads when sharing the underlying platform with another service). The latest version also delivers more consistency in its throughput rate by using a ConcurrentSkipListMap to order the spinning threads in order of expected schedule time, and by force-scheduling a new worker if a producer encounters a full task queue when not all workers are yet scheduled. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997436#comment-13997436 ] Pavel Yaskevich commented on CASSANDRA-4718: bq. Yes, also even more important is that the unpark() cost, when amortized over a long running operation, becomes insignificant regardless of if it is incurred; and producers cannot make forward progress anyway because the native-transport queue is full so avoiding paying the unpark cost on the network thread really doesn't achieve us anything. I fully expect there to be very little effect on workloads as the dataset exceeds memory and the row size climbs. Exactly my point which starts right after the bq you have taken. Also most of the use cases is exactly that - data set which exceeds available memory, so I am not really sure if it worth it to commit all this code without any perf improvement for most of the usage scenarios. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997316#comment-13997316 ] Jason Brown commented on CASSANDRA-4718: [~benedict] WRT to the 2.1-batchnetty comparison, what did the latencies look like? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997128#comment-13997128 ] Pavel Yaskevich commented on CASSANDRA-4718: The graphs Ryan posted in his previous comment (especially bdplab tests) look pretty close to what I would expect without running the tests and just looking as 4718-sep code. The more latency is introduced to the tasks the less effect would spinning have or in other words there is need to spin is eliminated, because the more time execution takes the more luckily it is that next task is already there waiting in the queue which makes thread parking/unparking no longer a dominant factor in latencies. So I would be very interested to see even longer running tests (especially reads) because that is much closer to the real behavior dominated by network/disk latencies. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996772#comment-13996772 ] Ryan McGuire commented on CASSANDRA-4718: - I scaled the test up by a factor of 10. I'll update here as the tests complete: 5 c3.8xlarge EC2 cluster: * [810 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.200M.EC2.threads-810.json] - cassandra-2.1 timed out in this test, I'll investigate it, but it wasn't one of the branches you asked for anyway. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996817#comment-13996817 ] Benedict commented on CASSANDRA-4718: - bq. Can you let it run for at least 30 minutes or so Let's hold off on that until we have some comparison numbers - agreed it's a good idea, but just want to get some idea of behaviour first > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996388#comment-13996388 ] Benedict commented on CASSANDRA-4718: - [~jasobrown] I've updated the repository with a number of minor tweaks/refactors, and improved comments. Let me know if there's anything still unclear. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996188#comment-13996188 ] Benedict commented on CASSANDRA-4718: - bq. Also, is it possible for you to fill up the disks with more sstables than available memory? +1 ("regular out-of-memory workload" was a dreadfully worded attempt to express this) bq. As this branch is quite complex with many new additions, I will need a good chunk of time tomorrow to review this. Let me know if there's anything specific that needs explaining. I will be commenting it before breakfast your time. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995763#comment-13995763 ] Benedict commented on CASSANDRA-4718: - I have uploaded a complete rewrite [here|https://github.com/belliottsmith/cassandra/tree/4718-sep] This on my tests is another 10%+ faster than lse-batchnetty, making it roughly on-par with 2.1-batchnetty on our old hardware, but thoroughly outstripping it on EC2 and my laptop. I still need to introduce some thorough tests for it, and to comment the code thoroughly, but the basic principle is the same only all executors share the same pool of worker threads so that scheduling is easier, and work can be passed more easily between them. I will revisit this work again sometime in the next year to see if we can squeeze anything more out of this, especially as we add more optimisations elsewhere - but for now we're reaching diminishing returns. [~enigmacurry] can you run a comparison of this and just cassandra-2.1-batchnetty on bdplab and EC2, so we can get a final comparison? [~jasobrown] if you feel like kicking off a run of this latest branch on your hardware so we have as many final data points to compare against that would also be really helpful. I'll get the code commented early tomorrow so we can get this reviewed ASAP. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995766#comment-13995766 ] Benedict commented on CASSANDRA-4718: - [~enigmacurry] it would be great to see equivalent runs for a regular out-of-memory workload as well, just to make sure there aren't any weird results. Thanks! > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993059#comment-13993059 ] Benedict commented on CASSANDRA-4718: - I have a few branches to test out, and I want to test them out an a variety of hardware. [~enigmacurry] can you run them on our internal multi-cpu boxes, and an AWS c3.8xlarge 4node cluster to the following spec: For each branch run: 20M inserts over 1M unique keys with 30, 90, 270 and 810 threads, then wipe each cluster and perform a single 1M key insert, and then run 20M reads over 1M unique keys with the same thread counts. All told that should take around 3hrs for -mode cql3 native prepared; I'd then like to repeat the tests for -mode thrift smart. The branches are: [https://github.com/belliottsmith/cassandra/tree/4718-lse] [https://github.com/belliottsmith/cassandra/tree/4718-lse-batchnetty] [https://github.com/belliottsmith/cassandra/tree/4718-fjp] [https://github.com/belliottsmith/cassandra/tree/4718-lowsignal] [https://github.com/belliottsmith/cassandra/tree/cassandra-2.1] Make sure you use my cassandra-2.1 so we're testing like-to-like (they're all rebased to the same version). I'll elaborate on the contents of these branches later, but suffice it to say the 4718-lse branch contains a new executor which attempts to reduce signalling costs to near zero by scheduling the correct number of threads to deal with the level of throughput the executor has been dealing with over the previous (short) adjustment window. -batchnetty includes some simple batching of netty messages. 4718-lowsignal is an enhanced version of the patch I uploaded previously to this ticket, and 4718-fjp is largely unchanged. On my own box, and on our austin test cluster, I see -lse faster than both -fjp and -lowsignal, however on our austin cluster (which is a not super-modern 4-cpu no-hyperthreading setup) I see both of them slower than stock 2.1, however -lse is only slightly slower, whereas -fjp is around 30% slower. I'll post polished numbers a little later. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > backpressure-stress.out.txt, baq vs trunk.png, op costs of various > queues.ods, stress op rate with various queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994679#comment-13994679 ] Ryan McGuire commented on CASSANDRA-4718: - @benedict, fwiw here's EC2 benchmarks: * [810 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2-4node.threads-810.log] * [270 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2-4node.threads-270.log] * [90 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2-4node.threads-90.log] > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993374#comment-13993374 ] Ryan McGuire commented on CASSANDRA-4718: - Above is with 4 nodes, one of which was the one hosting stress. Here's a 3 node variety, with stress on a separate host: * [30 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.3node.threads-30.log] * [90 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.3node.threads-90.log] * [270 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.3node.threads-270.log] * [810 threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.3node.threads-810.log] > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, > backpressure-stress.out.txt, baq vs trunk.png, > belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, > jason_write.svg, op costs of various queues.ods, stress op rate with various > queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986626#comment-13986626 ] Benedict commented on CASSANDRA-4718: - For comparison, a graph of Jason's results: https://docs.google.com/spreadsheets/d/1mLxyY9syaAlDb1ALGQ-oF7Qo0tQffbcNgFMVPktde88/edit?usp=sharing I'd like to do a couple of things here: # Tweak the Low Signal patch to potentially signal more intelligently rather than just always aggregating the last 5us of requests # Try increasing the queue length # Try these tests for a standardized load - the stress functionality we're using is great for giving a good ballpark idea of performance, but it varies the number of ops with each run, so running with a fixed 10M ops per run might be useful (stress could maybe do with an "ops per thread" option, as for the low thread counts this is a lot of work, but for high counts not very much) The lowsignal patch looks to outperform at certain thresholds, but underperform at others, and I'm hoping 1 and 2 might help us make it better overall. At high thread counts the difference is almost 20% for writes, which is non-trivial. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, > backpressure-stress.out.txt, baq vs trunk.png, op costs of various > queues.ods, stress op rate with various queues.ods, v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986065#comment-13986065 ] Benedict commented on CASSANDRA-4718: - Yes, typo - corrected. Yeah, just straight up read or write requests (I can push the same for both, but they tank for writes when we start hitting compaction/flush etc). I'm being as conservative as possible and assuming every core is spending every moment working on one of the ops (in reality it's more like half of that time). I don't have CASSANDRA-7061 yet to have any really accurate numbers to play with. As regards costs for unpark(), I've timed them in the past and that's in the ball park of what you'd expect given the literature and general OS behaviour (10us is probably a bit heavier than they often clock in, but a good figure to work with) > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986060#comment-13986060 ] Jason Brown commented on CASSANDRA-4718: bq. translates to around 60us/up Did you mean 60us/*op* ? Also, are these just requests from (new) cassandra stress? You've described the time to process each 'tiny message' but not what a tiny message is :). How are you measuring the time for each request (more for my own curiosity)? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986048#comment-13986048 ] Benedict commented on CASSANDRA-4718: - Sure. My box can push around 60Kop/s - this translates to around 60us/up (core time), when unpark() clocks in around 10us, you want to avoid it. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986040#comment-13986040 ] Jason Brown commented on CASSANDRA-4718: [~benedict] Can you clarify what you mean by "when dealing with a flood of tiny messages"? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986039#comment-13986039 ] Benedict commented on CASSANDRA-4718: - bq. Perhaps, but the overall improvement in performance (should we attribute that to the work stealing?) seems compelling enough. I'm not suggesting we forego the patch because of this concern, I'm raising it as something to bear in mind for the future. As I said, though, to some extent I have addressed this concern with the _lowsignal_ patch I uploaded, although it's debatable how elegant that approach is. bq. I didn't consider this a 'fork' as we're not mucking about with the internals of the FJP itself. Perhaps we're getting crossed wires and mixing up the patches I have uploaded (no fork), with the suggestion that we _may want to investigate forking in future_ in order to address these issues in a more elegant manner. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986032#comment-13986032 ] Jason Brown commented on CASSANDRA-4718: bq. The Semaphore is blocking After reading the correct jdk source class this time, I was mistaken about the blocking (i.e. park) aspect of Semaphore (got caught up in the rest of AbstractQueuedSynchronizer, which Semaphore subclasses and uses internally). Thus, I'll test out your branch as is this afternoon. bq. It isn't forked - this is all in the same extension class that you introduced I didn't consider this a 'fork' as we're not mucking about with the internals of the FJP itself. bq. FJP uses an exclusive lock for enqueueing work onto the pool, but does more whilst owning the lock, so is likely to take longer within the critical section Perhaps, but the overall improvement in performance (should we attribute that to the work stealing?) seems compelling enough. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986003#comment-13986003 ] Jeremiah Jordan commented on CASSANDRA-4718: bq. I can argue from running cassandra in production for almost four years that these metrics are not very helpful. At best they indicate that 'something' is amiss ("hey, pending tasks are getting higher"), but cannot give you a real clue as to what is wrong (gc, I/O contention, cpu throttling). As we got these data points largely for free from TPE, I guess it made sense to expose them, but if we have to go out of our way to fabricate a subset of them for FJP, I propose we drop them going forward (for FJP, at least). I would argue they are very useful because they give you that high level "something is wrong", so if its easy to keep them, I am very +1 on that. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985978#comment-13985978 ] Benedict commented on CASSANDRA-4718: - bq. . The Semaphore is blocking (by design) It's non-blocking until you run out of permits, at which point it must block. We have _many more_ shared counters than this semaphore, so I highly doubt it will be an issue (if doing nothing but spinning on updating it we could push probably several thousand times our current op-rate, and in reality we will be doing a lot inbetween, so contention is highly unlikely to be an issue, although it will incur a slight QPI penalty - nothing we don't incur all over the place though). bq. but any solution is better than forking FJP It isn't forked - this is all in the same extension class that you introduced...? bq. I literally have no idea what this means. FJP uses an exclusive lock for enqueueing work onto the pool, but does more whilst owning the lock, so is likely to take longer within the critical section. The second patch I uploaded attempts to mitigate this for native transport threads as those micros are actually a pretty big deal when dealing with a flood of tiny messages. bq. As we got these data points largely for free from TPE, I guess it made sense to expose them, but if we have to go out of our way to fabricate a subset of them for FJP, I propose we drop them going forward (for FJP, at least). I don't really mind, but I think you're overestimating the penalty for maintaining these counters. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985962#comment-13985962 ] Jason Brown commented on CASSANDRA-4718: [~benedict] I agree that FJP does not have a native enqueueing mechanism, but I'm not sure adding a Semaphore right in the middle of the class is the right solution. The Semaphore is blocking (by design) and will be contended for across (numa) cores. As an alternative, at least for the native protocol, does it make sense to move the back pressure point earlier in the processing chain? I'm unfamiliar with EventLoopGroup, but any solution is better than forking FJP. Also, I have an idea I'd like to try out .. give me a day or two. bq. ... is no more efficient (probably slightly less) than a standard executor. That's a future problem, however I literally have no idea what this means. bq. support the metrics that users may have gotten used to. I can argue from running cassandra in production for almost four years that these metrics are not very helpful. At best they indicate that 'something' is amiss ("hey, pending tasks are getting higher"), but cannot give you a real clue as to what is wrong (gc, I/O contention, cpu throttling). As we got these data points largely for free from TPE, I guess it made sense to expose them, but if we have to go out of our way to fabricate a subset of them for FJP, I propose we drop them going forward (for FJP, at least). > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985658#comment-13985658 ] Benedict commented on CASSANDRA-4718: - I've uploaded a slight variant of the patch [here|https://github.com/belliottsmith/cassandra/tree/4718-lowsignal] - this introduces a special FJP for that processing native transport work, that avoids blocking on enqueue to the pool unless the configured limit has been reached. Instead we schedule a ForkJoinTask that sleeps for 5us, forking any work that has been queued in the interval (and going to sleep only if no work has been seen in the past 5ms). This permits the connection worker threads to return to servicing their connections more promptly. It has only a modest effect on my box, but it does give a 5-10% bump in native transport performance. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984933#comment-13984933 ] Benedict commented on CASSANDRA-4718: - I've uploaded a new version of the patch [here|https://github.com/belliottsmith/cassandra/tree/4718-fjp] I've refactored the DebuggableForkJoinPool a little to support a limited queue (so that our native transport queue doesn't get too long), and to support the metrics that users may have gotten used to. I've tested the branch out very minimally and do see a very modest performance benefit on my box for reads, but that's far from conclusive - however it's quite likely any benefit is more visible on machines with more cores going spare though, as the single queue lock for a standard executor could easily become a point of contention. One slight concern I have with this approach is that it in order to make _enqueueing_ tasks less contentious we will need to either fork ForkJoinPool, or see if it is possible to implement an EventLoopGroup backed by a FJP, and use the same FJP to manage the connections as we do the execution of our tasks (as enqueuing tasks from a FJ-worker is contention-free). Given how FJP is intended to be used it is not optimised for enqueueing tasks, and is no more efficient (probably slightly less) than a standard executor. That's a future problem, however. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977020#comment-13977020 ] Jason Brown commented on CASSANDRA-4718: Guys, guys, there's plenty of my patch to criticize - you'll each have your fun, I'm sure :) > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976994#comment-13976994 ] Aleksey Yeschenko commented on CASSANDRA-4718: -- We were just doing some house cleaning today on IRC [16:49:06] jbellis: belliottsmith: is 4718 really patch available? [16:49:14] jbellis: #4718 [16:49:14] CassBotJr:https://issues.apache.org/jira/browse/CASSANDRA-4718 (Unresolved; 2.1): "More-efficient ExecutorService for improved throughput" [16:50:15] belliottsmith:so jasobrown says - he's marked pavel as reviewer so i've kept out of it beyond asking for a little more stress info :) [16:51:28] belliottsmith:honestly though, from last time i looked at it (a while back), it was a pretty simple change > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976988#comment-13976988 ] Pavel Yaskevich commented on CASSANDRA-4718: I still want to review this, why are you re-assigning [~benedict] ? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1.0 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969513#comment-13969513 ] Jason Brown commented on CASSANDRA-4718: OK, will give it a shot today. Also, just noticed I did not tune native_transport_max_threads at all (so I have the default of 128). Might play with that a bit, as well. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969476#comment-13969476 ] Benedict commented on CASSANDRA-4718: - [~jasobrown]: Could you upload the full stress outputs for these runs? And also try running a separate stress run with a fixed high threadcount and op count? In particular for CQL, the results in the file are a little bit weird. That said, given their consistency for thrift I don't doubt the result is meaningful, but it would be good to understand what we're incorporating a bit better before committing. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968674#comment-13968674 ] Benedict commented on CASSANDRA-4718: - Maybe you're one of the few people I haven't told of my dream of internally hashing C* so that each sub-hash is run by its own core. Guaranteed NUMA behaviour, and we can stop performing *any* CASes and make all kinds of single threaded optimisations in various places (e.g. no need to do CoW updates to memtables, so massive garbage reduction). Bit of a pipedream for the moment though :) > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968669#comment-13968669 ] Jason Brown commented on CASSANDRA-4718: Yeah, I think the extra cost across the QPI bus and such is easily masked by any disk I/O the actual op may have to :) bq. the data is completely randomly hodgepodgedly allocated across the CPUs Correct, given the current structure of the app. I can image something more CPU cache friendly, but it's huge change and I suspect that I/O still dwarfs those latency reductions. Nice hack day project > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968651#comment-13968651 ] Benedict commented on CASSANDRA-4718: - Note that given the way C* works, the work stealing crossing CPU boundaries is really not important - the data is completely randomly hodgepodgedly allocated across the CPUs. Nothing we can do in FJP can fix that :) > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968648#comment-13968648 ] Benedict commented on CASSANDRA-4718: - I think it is unlikely that the CAS has any meaningful impact. The QPI is very quick. I think this kind of speedup is more likely down to reduced signalling costs (unpark costs ~ 10micros, and work stealing means you have to go to the main queue far less frequently); possibly also the signalling of threads has been directly optimised in FJP. I knocked up a "low-switch" executor but found fairly little benefit on my box, as I can saturate the CPUs very easily (at which point the unpark cost is never incurred). On a many-CPU box, saturating all the cores is difficult, and so it is much more likely you'll be introducing bottlenecks on producers adding to the queue. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968643#comment-13968643 ] Jason Brown commented on CASSANDRA-4718: bq. but it ran for 4 times as long, indicating there was a high variance in throughput Huh, yeah, you are right, it did run longer. Admittedly my eyes have been ignoring that column (shame on me). Let me run the native protocol test again (and try to figure out the read situation, as well). > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968636#comment-13968636 ] Jason Brown commented on CASSANDRA-4718: As to multi-cpu machines, I spent a lot time thinking about the affects of NUMA systems on CAS operations/algs (esp. wrt to FJP, obviously). As I mentioned, I'm using systems with two sockets (two NUMA cores). As you get more sockets (and thus more numa cores) a thread on one core will be reaching across to more cores to do work stealing, thus adding contention to that memory address. Imagine four threads on for sockets all contending for work on a fifth thread. The memory values for that portion of the queue for that fifth thread is now pulled into all four sockets, thus becoming more of a contention point, as well as impacting latency (due to the CAS operation). However, this could be (and hopefully is) less of a cost than bothering with queues, blocking, posix threads, OS interrupts, and everything else that makes standard thread pool executors work. Thinking even crazier to optimize the FJP sharing across numa cores, this is when I start thinking about digging up the thread affinity work again, and binding threads of similar types (probably by Stage) to sockets, not just an individual CPU (I think that was my problem before). But then I wonder how much is to be gained on non-NUMA systems or systems where you can't determine if it's got NUMA or not (hello, cloud!) - and at that point I'm happy to realize the gains we have and move forward. bq. what problem are you seeing? Will ping you offline - too unexciting for this space :) > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968637#comment-13968637 ] Jason Brown commented on CASSANDRA-4718: bq. you were tearing down and trashing the data directories between write runs Yes, was also clearing the page cache, as well. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968612#comment-13968612 ] Benedict commented on CASSANDRA-4718: - Just to check: you were tearing down and trashing the data directories between write runs? Because the last result is a bit weird: double the throughput, but it ran for 4 times as long, indicating there was a high variance in throughput (usually means compaction / heavy flushing is taking effect) - but the same workload under thrift had no such spikes... > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968609#comment-13968609 ] Benedict commented on CASSANDRA-4718: - Nice! I wonder if this is a much bigger impact on multi-cpu machines, as I did not see anything like this dramatic improvement. But this is great. Do you have some stress dumps we can look at? bq. new 2.1 stress seems broken on reads Shouldn't be - what problem are you seeing? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841943#comment-13841943 ] Jason Brown commented on CASSANDRA-4718: OK, looks like my initial stab at switching over to FJP netted about 10-15% throughput increase, and mixed results on the latency scores (sometimes better, sometimes on par with trunk). I'm going run some more perf tests this weekend, and will decide how to proceed early next week - but the initial results do look promising. I've only tested the thrift endpoints so far, but when I retest this weekend, I'll throw in the cql3/native protocol, as well. Here's my current working branch: https://github.com/jasobrown/cassandra/tree/4718_fjp . Note, it's very hacked up/WIP as I wanted to confirm the performance benefits before making everything happy (read: metrics pools). Also, I modified [~xedin]'s thrift-disruptor lib, for this: https://github.com/jasobrown/disruptor_thrift_server/tree/4718_fjp. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Attachments: PerThreadQueue.java, baq vs trunk.png, op costs of > various queues.ods, stress op rate with various queues.ods > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822199#comment-13822199 ] Jason Brown commented on CASSANDRA-4718: Ha, just this week I was untangling my branch for CASSANDRA-1632, which included the FJP work. Should be able to get to this one next week after more performance testing. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Attachments: PerThreadQueue.java, baq vs trunk.png, op costs of > various queues.ods, stress op rate with various queues.ods > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822163#comment-13822163 ] Jonathan Ellis commented on CASSANDRA-4718: --- What's the latest on this? I think Jason had some work with FJP he was testing out... > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Attachments: PerThreadQueue.java, baq vs trunk.png, op costs of > various queues.ods, stress op rate with various queues.ods > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788501#comment-13788501 ] Benedict commented on CASSANDRA-4718: - Not necessarily. I still think that was most likely variance: - I have BAQ at same speed as LBQ in application - a 2x slow down of LBQ -> 0.01x slow down of application - a 10x slow down of LBQ -> 0.05x slow down of application => the queue speed is currently only ~1% of application cost. It's possible the faster queue is causing greater contention at a sync point, but this wouldn't work in the opposite direction if the contention at the sync point is low. Either way, if this were true we'd see the artificially slow queues also improve stress performance. Ryan also ran some of my tests and found no difference. I wouldn't absolutely rule out the possibility his test was valid, though, as I did not swap out the queues in OutboundTcpConnection for these tests as, at the time, I was concerned about the calls to size() which are expensive for my test queues, and I wanted the queue swap to be on equal terms across the board. I realise now these are only called via JMX, so shouldn't stop me swapping them in. I've just tried a quick test of directly (in process) stressing through the MessagingService and found no measureable difference to putting BAQ in the OutboundTcpConnection, though if I swap out across the board it is about 25% slower, which itself is interesting as this is close to a full stress, minus thrift. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Attachments: baq vs trunk.png, op costs of various queues.ods, > PerThreadQueue.java, stress op rate with various queues.ods > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788426#comment-13788426 ] Jonathan Ellis commented on CASSANDRA-4718: --- bq. The faster queue actually slows down the process, by about 9% - more than the queue supposedly much slower than it So this actually confirms Ryan's original measurement of C*/BAQ [slow queue] faster than C*/LBQ [fast queue]? > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Attachments: baq vs trunk.png, op costs of various queues.ods, > PerThreadQueue.java, stress op rate with various queues.ods > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788425#comment-13788425 ] Benedict commented on CASSANDRA-4718: - Disruptors are very difficult to use as a drop in replacement for the executor service, so I tried to knock up some queues that could provide similar performance without ripping apart the whole application. The resulting queues I benchmarked under high load, in isolation, against LinkedBlockingQueue, BlockingArrayQueue and the Disruptor, and plotted the average op costs in the "op costs of various queues" attachment*. As can be seen, these queues and the Disruptor are substantially faster under high load than LinkedBlockingQueue, however it can also be seen that: - The average op cost for LinkedBlockingQueue is still very low, in fact only around 300ns at worst - BlockingArrayQueue is considerably worse than LinkedBlockingQueue under all conditions These suggest both that the overhead attributed to LinkedBlockingQueue for a 1Mop workload (as run above) should be at most a few seconds of the overall cost (probably much less); and that BlockingArrayQueue is unlikely to make any cost incurred by LinkedBlockingQueue substantially better. This made me suspect the previous result might be attributable to random variance, but to be sure I ran a number of ccm -stress tests with the different queues, and plotted the results in "stress op rate with various queues.ods", which show the following: 1) No meaningful difference between BAQ, LBQ and SlowQueue (though the latter has a clear ~1% slow down) 2) UltraSlow (~10x slow down, or 2000ns spinning each op) is approximately 5% slower 3) The faster queue actually slows down the process, by about 9% - more than the queue supposedly much slower than it! Anyway, I've been concurrently looking at where I might be able to improve performance independent of this, and have found the following: A) Raw performance of local reads is ~6-7x faster than through Stress B) Raw performance of local reads run asynchronously is ~4x faster C) Raw performance of local reads run asynchronously using the fast queue is ~4.7x faster D) Performance of local reads from the Thrift server-side methods is ~3x faster E) Performance of remote (i.e. local non-optimised) reads is ~1.5x faster In particular (C) is interesting, as it demonstrates the queue really is faster in use, but I've yet to absolutely determine why that translates into an overall decline in throughput. It looks as though it's possible it causes greater congestion in LockSupport.unpark(), but this is a new piece of information, derived from YourKit. As these sorts of methods are difficult to meter accurately I don't necessarily trust it, and haven't had a chance to figure out what I can do with the information. If it is accurate, and I can figure out how to reduce the overhead, we might get a modest speed boost, which will accumulate as we find other places to improve. As to the overall problem of improving throughput, it seems to me that there are two big avenues to explore: 1) the networking (software) overhead is large; 2) possibly the cost of managing thread liveness (e.g. park/unpark/scheduler costs); though the evidence for this is as yet inconclusive... given the op rate and other evidence it doesn't seem to be synchronization overhead. I'm still trying to pin this down. Once the costs here are nailed down as tight as they can go, I'm pretty confident we can get some noticeable improvements to the actual work being done, but since that currently accounts for only a fraction of the time spent (probably less than 20%), I'd rather wait until it was a higher percentage so any improvement is multiplied. * These can be replicated by running org.apache.cassandra.concurrent.test.bench.Benchmark on any of the linked branches on github. https://github.com/belliottsmith/cassandra/tree/4718-lbq [using LinkedBlockingQueue] https://github.com/belliottsmith/cassandra/tree/4718-baq [using BlockingArrayQueue] https://github.com/belliottsmith/cassandra/tree/4718-lpbq [using a new high performance queue] https://github.com/belliottsmith/cassandra/tree/4718-slow [using a LinkedBlockingQueue with 200ns spinning each op] https://github.com/belliottsmith/cassandra/tree/4718-ultraslow [using a LinkedBlockingQueue with 2000ns spinning each op] > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Attachments: baq vs trunk.png, op costs of various queues.
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788243#comment-13788243 ] darion yaphets commented on CASSANDRA-4718: --- LMAX Disruptor's RingBuffer maybe a good idea for lock free component But maybe set a bigger size for hold the structure in ring buffer to avoid cover by new one And is meaning to use more memory ... > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Attachments: baq vs trunk.png, op costs of various queues.ods, > PerThreadQueue.java > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682785#comment-13682785 ] Jonathan Ellis commented on CASSANDRA-4718: --- Sounds like a reasonable place to start. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Priority: Minor > Attachments: baq vs trunk.png, PerThreadQueue.java > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682777#comment-13682777 ] Pavel Yaskevich commented on CASSANDRA-4718: I think even though we aren't in the most cache friendly behavior with variable size RMs we can still utilize better dispatch behavior with low cost synchronization. We can't do anything about blocking I/O operations requiring separate thread but I think it's time to re-evaluate NIO async sockets vs. having thread per in/out connection. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Priority: Minor > Attachments: baq vs trunk.png, PerThreadQueue.java > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680705#comment-13680705 ] Jonathan Ellis commented on CASSANDRA-4718: --- [~xedin] Thoughts on my comment above? https://issues.apache.org/jira/browse/CASSANDRA-4718?focusedCommentId=13629447&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13629447 > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Priority: Minor > Attachments: baq vs trunk.png, PerThreadQueue.java > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635568#comment-13635568 ] Marcus Eriksson commented on CASSANDRA-4718: unless the java api has improved _alot_ the last year or so, the code will be horrible > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Priority: Minor > Attachments: baq vs trunk.png, PerThreadQueue.java > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635560#comment-13635560 ] Piotr Kołaczkowski commented on CASSANDRA-4718: --- I'm not suggesting using scala in C* nor anywhere. It was just quicker for me to write a throw-away benchmark. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Priority: Minor > Attachments: baq vs trunk.png, PerThreadQueue.java > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635543#comment-13635543 ] Marcus Eriksson commented on CASSANDRA-4718: ftr, im very much -1 on using scala in cassandra (dont know if you suggest that even) i know it is supposed to interface nicely with java code, but it generally becomes a huge hairy part of the code base that noone wants to touch > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Priority: Minor > Attachments: baq vs trunk.png, PerThreadQueue.java > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635532#comment-13635532 ] Piotr Kołaczkowski commented on CASSANDRA-4718: --- I made another version of benchmark, according to Sergio's suggestions. Now it uses the following message processing graph: {noformat} /-- stage 0 processor 0 \ / \ /--- ---\ +-- stage 0 processor 1 + + + +--- ---+ >--+-- stage 0 processor 2 +>+ STAGE 1 +-->- ... >--->-+--- STAGE m ---+-> +-- ... + + + +--- ---+ \-- stage 0 processor n / \ / \--- ---/ {noformat} 128 threads are concurrently trying to get messages through all the stages and measure average latency, including the time required for the message to enter stage 0. Thread-pool stages are built from fixed size thread pools with n=8, because there are 8 cores. Actor-based stages are build from 128 actors each with a RoundRobinRouter in front of every stage. Average latencies: {noformat} {noformat}3 stages: Sync:364687 ns Async: 210766 ns Akka:201842 ns 4 stages: Sync:492581 ns Async: 221118 ns Akka:239407 ns 5 stages: Sync:671733 ns Async: 245370 ns Akka:283798 ns 6 stages: Sync:781759 ns Async: 262742 ns Akka:309384 ns {noformat} So Akka comes slightly slower than async thread pools. If someone wants to play with my code, here is the up-to-date version: {noformat} import java.util.concurrent.{CountDownLatch, Executors} import akka.actor.{Props, ActorSystem, Actor, ActorRef} import akka.routing.{SmallestMailboxRouter, RoundRobinRouter} class Message { var counter = 0 val latch = new CountDownLatch(1) } abstract class MultistageThreadPoolProcessor(stageCount: Int) { val stages = for (i <- 1 to stageCount) yield Executors.newFixedThreadPool(8) def shutdown() { stages.foreach(_.shutdown()) } } /** Synchronously processes a message through the stages. * The message is passed stage-to-stage by the coordinator thread. */ class SyncThreadPoolProcessor(stageCount: Int) extends MultistageThreadPoolProcessor(stageCount) { def process() { val message = new Message val task = new Runnable() { def run() { message.counter += 1 } } for (executor <- stages) executor.submit(task).get() } } /** Asynchronously processes a message through the stages. * Every stage after finishing its processing of the message * passes the message directly to the next stage, without bothering the coordinator thread. */ class AsyncThreadPoolProcessor(stageCount: Int) extends MultistageThreadPoolProcessor(stageCount) { def process() { val message = new Message val task = new Runnable() { def run() { message.counter += 1 if (message.counter >= stages.size) message.latch.countDown() else stages(message.counter).submit(this) } } stages(0).submit(task) message.latch.await() } } /** Similar to AsyncThreadPoolProcessor but it uses Akka actor system instead of thread pools and queues. * Every stage after finishing its processing of the message * passes the message directly to the next stage, without bothering the coordinator thread. */ class AkkaProcessor(stageCount: Int) { val system = ActorSystem() val stages: IndexedSeq[ActorRef] = { for (i <- 1 to stageCount) yield system.actorOf(Props(createActor()).withRouter(RoundRobinRouter(nrOfInstances = 128))) } def createActor(): Actor = { new Actor { def receive = { case m: Message => m.counter += 1 if (m.counter >= stages.size) m.latch.countDown() else stages(m.counter) ! m } } } def process() { val message = new Message stages(0) ! message message.latch.await() } def shutdown() { system.shutdown() } } object MessagingBenchmark extends App { def measureLatency(count: Int, f: () => Any): Double = { val start = System.nanoTime() for (i <- 1 to count) f() val end = System.nanoTime() (end - start).toDouble / count } def measureLatency(threadCount: Int, messageCount: Int, f: () => Any): Double = { class RequestThread extends Thread { var latency: Double = 0.0 override def run() { latency = measureLatency(messageCount, f) } } val threads = for (i <- 1 to threadCount) yield new RequestThread() threads.foreach(_.start()) threads.foreach(_.join()) threads.map(_.latency).sum / threads.size } val messageCount = 5 for (s
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635141#comment-13635141 ] Piotr Kołaczkowski commented on CASSANDRA-4718: --- Interesting thing, that after boosting the number of threads that invoke the process() method from 1 to 16, Akka gets slower, while thread-pool per stage approach gets faster. 16 user threads invoking process(), 4 core i7 with HT (8-virtual cores): {noformat} 2 stages: Sync: 28195 ns Async:26852 ns Akka: 51651 ns 4 stages: Sync: 75295 ns Async:60381 ns Akka: 85954 ns 8 stages: Sync:176879 ns Async: 124712 ns Akka:103073 ns 16 stages: Sync:367728 ns Async: 259715 ns Akka:146875 ns {noformat} top reports total ~780% CPU utilisation thread-pools: ~60% system, ~40% user Akka: ~15% system, ~85% user I try to add Disruptor to the benchmark suite. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Priority: Minor > Attachments: baq vs trunk.png, PerThreadQueue.java > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635120#comment-13635120 ] Piotr Kołaczkowski commented on CASSANDRA-4718: --- Another thing to consider might be using a high-performance Actor library e.g. Akka. I did a quick microbenchmark to see what is the latency of just passing a single message through several stages, in 3 variants: 1. Sync: one threadpool per stage, where some coordinator thread just moves message from one ExecutorService to another, after the stage finished processing 2. Async: one threadpool per stage, where every stage directly asynchronously pushes its result into the next stage 3. Akka: one Akka actor per stage, where every stage directly asynchronously pushes its result into the next stage The clear winner is Akka: {noformat} 2 stages: Sync: 38717 ns Async:36159 ns Akka: 12969 ns 4 stages: Sync: 65793 ns Async:49964 ns Akka: 18516 ns 8 stages: Sync:162256 ns Async: 19 ns Akka: 9237 ns 16 stages: Sync:296951 ns Async: 183588 ns Akka: 13574 ns 32 stages: Sync:572605 ns Async: 361959 ns Akka: 23344 ns {noformat} Code of the benchmark: {noformat} package pl.pk.messaging import java.util.concurrent.{CountDownLatch, Executors} import akka.actor.{Props, ActorSystem, Actor, ActorRef} class Message { var counter = 0 val latch = new CountDownLatch(1) } abstract class MultistageThreadPoolProcessor(stageCount: Int) { val stages = for (i <- 1 to stageCount) yield Executors.newCachedThreadPool() def shutdown() { stages.foreach(_.shutdown()) } } /** Synchronously processes a message through the stages. * The message is passed stage-to-stage by the coordinator thread. */ class SyncThreadPoolProcessor(stageCount: Int) extends MultistageThreadPoolProcessor(stageCount) { def process() { val message = new Message val task = new Runnable() { def run() { message.counter += 1 } } for (executor <- stages) executor.submit(task).get() } } /** Asynchronously processes a message through the stages. * Every stage after finishing its processing of the message * passes the message directly to the next stage, without bothering the coordinator thread. */ class AsyncThreadPoolProcessor(stageCount: Int) extends MultistageThreadPoolProcessor(stageCount) { def process() { val message = new Message val task = new Runnable() { def run() { message.counter += 1 if (message.counter >= stages.size) message.latch.countDown() else stages(message.counter).submit(this) } } stages(0).submit(task) message.latch.await() } } /** Similar to AsyncThreadPoolProcessor but it uses Akka actor system instead of thread pools and queues. * Every stage after finishing its processing of the message * passes the message directly to the next stage, without bothering the coordinator thread. */ class AkkaProcessor(stageCount: Int) { val system = ActorSystem() val stages: IndexedSeq[ActorRef] = { for (i <- 1 to stageCount) yield system.actorOf(Props(new Actor { def receive = { case m: Message => m.counter += 1 if (m.counter >= stages.size) m.latch.countDown() else stages(m.counter) ! m } })) } def process() { val message = new Message stages(0) ! message message.latch.await() } def shutdown() { system.shutdown() } } object MessagingBenchmark extends App { def measureLatency(count: Int, f: () => Any): Double = { val start = System.nanoTime() for (i <- 1 to count) f() val end = System.nanoTime() (end - start).toDouble / count } val messageCount = 20 for (stageCount <- List(2,4,8,16,32)) { printf("\n%d stages: \n", stageCount) val syncProcessor = new SyncThreadPoolProcessor(stageCount) val asyncProcessor = new AsyncThreadPoolProcessor(stageCount) val akkaProcessor = new AkkaProcessor(stageCount) printf("Sync: %8.0f ns\n", measureLatency(messageCount, syncProcessor.process)) printf("Async: %8.0f ns\n", measureLatency(messageCount, asyncProcessor.process)) printf("Akka: %8.0f ns\n", measureLatency(messageCount, akkaProcessor.process)) syncProcessor.shutdown() asyncProcessor.shutdown() akkaProcessor.shutdown() } } {noformat} > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Priority: Minor > Attachments: baq vs trunk.png, PerThreadQueue.java > > >