[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-07-15 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062217#comment-14062217
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

To summarize for those browsing, the primary result here was the introduction 
of SharedExecutorPool: 
https://github.com/apache/cassandra/blob/cassandra-2.1.0/src/java/org/apache/cassandra/concurrent/SharedExecutorPool.java

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1 rc1
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012945#comment-14012945
 ] 

Benedict commented on CASSANDRA-4718:
-

Committed with that doc reworded

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1 rc1
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-29 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012513#comment-14012513
 ] 

Jason Brown commented on CASSANDRA-4718:


nit: please clean up the documentation grammar a little bit above the 
SEPWorker.prevStopCheck declaration. Other than that, +1.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-27 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009595#comment-14009595
 ] 

Benedict commented on CASSANDRA-4718:
-

I've uploaded an update to the branch which should permit spinning down the 
last thread much more simply (and correctly) than the previous patch did. I've 
also retested it to confirm the performance characteristics remain intact.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-23 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008032#comment-14008032
 ] 

Jason Brown commented on CASSANDRA-4718:


Spoke with [~benedict] about his latest change (to spin down the last thread), 
we realized he had a bug, and he'll take another couple of days to fix/find 
better solution.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-23 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007043#comment-14007043
 ] 

Benedict commented on CASSANDRA-4718:
-

I've uploaded an updated branch. I tweaked a couple of minor things (I let the 
last thread spin down, which couldn't happen before), and re-ran my battery of 
tests to confirm the performance is still good. I've introduced a new long test 
and fixed the ClassCastException on the logger thread. Should be good to go.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-21 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005093#comment-14005093
 ] 

Benedict commented on CASSANDRA-4718:
-

Thanks! I just need to tidy up those two minor bugs you spotted, and I'm adding 
a long test as well so it has some isolated testing for future work. Should be 
ready to commit this evening.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-21 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005087#comment-14005087
 ] 

Jason Brown commented on CASSANDRA-4718:


+1 on the current 4718-sep branch

FTR, [~benedict] and I have worked closely on the code for the last several 
weeks, and I've provided direct feedback to him about problems/concerns. 

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-20 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003406#comment-14003406
 ] 

Benedict commented on CASSANDRA-4718:
-

One thing worth mentioning is that the size of the dataset over which this is 
effective is not necessarily represented accurately by the test, as it was run 
over a fully-compacted dataset, so the 10M keys would have been randomly 
distributed across all pages (we select from a prefix of the key range, but the 
murmur hash will get evenly distributed across the entire dataset once fully 
compacted). Were this run on a real dataset, with the most recent data being 
compacted separately to the older data, and the most recent data being hit 
primarily, there would be greater locality of data access and so any gains 
should be effective over a larger quantity of data.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-20 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003213#comment-14003213
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

This looks exactly like I would have predicted: the more disk-bound the 
workload is, the less the executorservice matters.  But when our hot dataset is 
cacheable (10M and to a lesser degree 100M) -sep is a clear win.  This is the 
scenario that we ought to be optimizing for.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
> E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-19 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001621#comment-14001621
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

bq. I can patch stress briefly to force it to run all thread counts in the 
requested range, instead of stopping when it hits a plateau

That sounds like a good option to have.

bq. when we did May 15 (which is completely different test btw, addressing your 
point from previous comment) there was almost no disk activity after original 
page cache warm up

That doesn't sound right to me, all the number from May 15 are 7k-14k ops/s 
which is disk bound territory.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000839#comment-14000839
 ] 

Benedict commented on CASSANDRA-4718:
-

compression is disabled by stress by default.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000836#comment-14000836
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


250K op/s archived when test was running with almost no data, total duration 
was around a minute or so. We are, on the other hand, trying to make it more 
realistic in terms of data amount, I'm not sure about the tests from May 16 but 
when we did May 15 (which is completely different test btw, addressing your 
point from previous comment) there was almost no disk activity after original 
page cache warm up. If you can please patch the test to do runs with all of the 
threads and once we re-run I will also check disk activity, but I'm pretty sure 
it would be minimal, reading from the page cache cost is not as efficient are 
reading from anonymous area plus it does a syscall with compression (which is 
used by default) so I'm not surprised that op/s degraded.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000830#comment-14000830
 ] 

Benedict commented on CASSANDRA-4718:
-

I meant 250Kop/s. We're now pushing 6Kop/s. The numbers from 16th May are the 
latest posted, to my knowledge, and the ones we're discussing?

You can make stress do a fixed number of ops per run, but not a fixed set of 
thread counts currently - its auto mode (that this is from) ramps up thread 
counts until it detects a plateau; in these tests it seems that sep reached a 
higher throughput rate earlier, and so when it normalised down again stress 
considered it to have plateaued earlier. As to #2, run1 when it is truncated at 
a lower tc is as fast as stock is at its peak. However, you're right that it is 
possible it would have tanked further - in this case this would be indicative 
of a bug rather than a fundamental flaw in its design, but it is almost 
certainly down to the natural tendency to dip slightly below peak throughput 
after the real plateau.

I can patch stress briefly to force it to run all thread counts in the 
requested range, instead of stopping when it hits a plateau, but the auto-mode 
isn't really designed to be a canonical test. If we want accurate like-for-like 
comparisons we want to graph each thread count separately for its whole run, 
and ensure each run is long enough to spot the general behavioural pattern 
(i.e. at least a few minutes for IO bound work). I'd also ensure we interleaved 
the two branches to try to avoid any weird page caching / other utilisation 
interferences.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000816#comment-14000816
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


I think you have plotted numbers from May16. I'm not sure what do you mean by 
"often" the problem with numbers is that they apparently cut off for both 
branches :( We have to redo the test I think, [~jasobrown] is there a way to 
guarantee that both branches are going to do the same number of runs with 
stress? I disagree with #2 because sep shows a sudden drop at the end as do 
runs 1 and 3 so we don't really know what is going to happen with sep on the 
high stress concurrency in those runs.

bq. This work is clearly disk bound as the same hardware was pushing 250k/s 
with similar record sizes when exclusively in memory - we're seeing only 5% of 
that now. Unless possibly in-memory index scans are occupying all of the time 
(but according to Jason CPU utilisation was around 30% from a random 
non-scientific poll).

I'm not sure if we can count 250K/s as a disk bound workload, which is only 3 
buffer reads per second. 

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
> various queues.ods, stress op rate with various queues.ods, 
> stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000743#comment-14000743
 ] 

Benedict commented on CASSANDRA-4718:
-

But like I said, the sep branch was actually faster more often than it was 
slower? And yes it routes intelligently, but to both replicas...?

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000739#comment-14000739
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


[~benedict] Isn't CQL trying to be smart about request routing like Thrift? 
Anyhow, read concurrency was default 128 (can you confirm [~jasobrown]?) and we 
do remove all of the files, drop cache, restart, disable compaction to remove 
jitter as much as possible, so the only difference between two runs is one has 
sep patch another doesn't, if there are slow reads that should be happening in 
both runs because keys are read uniformly and although there is a big amount of 
sstables in the system for every read there is only one hit, which pretty much 
simulates the behavior of the systems where data accumulates over time. Also I 
can tell you that setup that we have for I/O is able to handle mere 300GB 
without a problem.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000727#comment-14000727
 ] 

Benedict commented on CASSANDRA-4718:
-

Whilst on this topic, I've been thinking about disk/memory testing protocols in 
general, and it seems we really need to think through a good strategy for 
creating a consistent test bed that is representative. The test I have asked 
[~enigmacurry] to run is not going to be fair, as major compaction will cause 
all of the data points to be randomly distributed (by hash) across a single 
sstable, and given the records are small, selecting from a smallish random 
subset of this data will pretty much necessarily involve touching every page on 
disk with equal probability. However disabling compaction entirely is equally 
unfair, as we leave many sstables to check from bloom filter false positives 
(there are around 800+ sstables in Jason's test, for 300Gb of data, at a finger 
in air estimate), so most of the cache will be going to index files, with 
almost every data item lookup probably going to disk due to the reduced memory 
causing the same effect as the major compaction to kick in.

It seems to me we need to 1) get the exponential distribution to select from 
last keys in preference to first keys (i.e. most recently written most commonly 
accessed); and 2) create a compaction strategy for testing purposes, that is 
designed to create a sort-of "in flight snapshot" of a real STCS workload, by 
compacting older data into exponentially larger files. These two together 
should give us much closer to a real live system that is using STCS, and with a 
consistent reproducible baseline behaviour.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000724#comment-14000724
 ] 

Benedict commented on CASSANDRA-4718:
-

[~xedin] why are you only counting the primary replica data? Requests will hit 
both replicas by default? If you look at the results there is a reasonable 
amount of variability for both runs, so it's not clear that one is slower or 
faster - there are a number of points where 4718-sep is faster than 2.1, and 
vice versa, and given it is disk bound I am inclined to suggest this is not the 
patch making it perform worse. In fact, a majority of data points show higher 
throughput for 4718-sep, not for 2.1. Your first test, every thread count below 
271 is faster; 271 seems to be a blip due to a small number of very slow reads 
affecting the very last measurement (there's a "race" in stress' auto mode 
where some measurements are still accepted after it's decided enough have been 
taken, as can be seen by the final stderr being above the acceptability point); 
2.1 showed a similar effect at this tc, but smaller, so this seems likely to be 
random chance. The last test it is faster for all thread counts despite some 
weird max latencies. It's only the middle test where it appears to be 
marginally slower, and given this test performs effectively exactly the same 
amount of work as the first test, I'm not sure this demonstrates a great deal 
other than the variability.

It's also worth asking what your max read concurrency is? As I'm surprised to 
see thread counts > 180 causing dramatic spikes in latency (both branches) when 
I'd expect them to be saturating the read stage well before then?



> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000562#comment-14000562
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


bq. Yeah, 200+MB/s sounds pretty disk bound to me. I vote that we move to the 
actual code review; we can certainly make further improvements later.

I think what Jason meant is when he started doing reads system was pooling a 
lot of data into the memory at first, ~300GB he loaded was RF=2 and we have 
128GB of RAM apart from kernel memory on those machines, so essentially it's 
~150GB for primary replica which is not much bigger than total available memory 
for page cache, pretty much accounts for 10% you were talking about. As a 
summary, we made two benchmarks, first where amount of data was bigger than 
memory available for the page cache, second where most of the data fits into 
memory, both cases sep branch was performing worse than cassandra-2.1.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> austin_diskbound_read.svg, aws.svg, aws_read.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998622#comment-13998622
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. (latest bdplab tests)

Which latest bdplab tests? The longer bdplab test from before (not the latest 
tests) had some issues (unrelated to this ticket) so we didn't get any read 
results, but showed increased write throughput.

The latest tests have all been short runs. I am actually very pleased we are 
_at all_ faster for bdplab on any workload, as the first versions of these 
patches did not seem to benefit older hardware/kernels (we don't have enough 
hardware configurations to say which was the deciding factor), and actually 
incurred a slight penalty. The fact that the gap is very narrow for bdplab is 
not really important, nor are the thrift numbers. In both of those instances I 
am interested only in that we _do not perform any worse_; performing slightly 
better even here is just a bonus.




> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993683#comment-13993683
 ] 

Benedict commented on CASSANDRA-4718:
-

Hmm. Frustrating - tweaks that made lse-batchnetty faster on my box (by 20-30%) 
make it slower on bdplab. Would be good to get some other numbers from 
different rigs involved to see if we can pin down the sweet spot, and maybe 
figure out what the cause of the discrepancy is.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998772#comment-13998772
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. And 4718-sep is essentially 2.1-batchnetty + the patch for this, right?

Correct

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, stress_2014May15.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000295#comment-14000295
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

Yeah, 200+MB/s sounds pretty disk bound to me.  I vote that we move to the 
actual code review; we can certainly make further improvements later.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998639#comment-13998639
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. Since we're talking about benchmarks, it shouldn't be too hard to remove 
batching from the equation and check what value remains.

The batching has already been committed to tip, so the 2.1-batchnetty is 
essentially this.

bq. Can we bench with compression maybe?

We probably should bench to get a comparison. It is quite likely the benefit 
will be lost, given we decompress 64K chunks at a time and currently have no 
uncompressed page cache.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998613#comment-13998613
 ] 

Sylvain Lebresne commented on CASSANDRA-4718:
-

bq. currently the only valuable change I see here is batching for Netty

I'm being lost in all the benchmark graphs and what they include I'll admit. 
We've now committed the batching separately with CASSANDRA-5663. Can someone 
sum up the graphs for
"current tip of 2.1 (with CASSANDRA-5663)" versus "the same + benedict last 
patch"? Since we're talking about benchmarks, it shouldn't be too hard to 
remove batching from the equation and check what value remains.

bq. Not mentioning that with compression every read results in syscall which 
forces thread to get parked anyway

Can we bench with compression maybe? Both to see if any benefits is indeed lost 
when compression is on, and if compression generally out-perform no-compression 
or not.



> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998750#comment-13998750
 ] 

Sylvain Lebresne commented on CASSANDRA-4718:
-

bq. The batching has already been committed to tip, so the 2.1-batchnetty is 
essentially this.

And 4718-sep is essentially 2.1-batchnetty + the patch for this, right?

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998146#comment-13998146
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

Granted that a new executorservice won't help i/o bound workloads, but I knew 
that when I created the ticket and "must be significantly better for all 
workloads" is an unrealistically high bar for optimization work.  This gives us 
a pretty huge benefit on at least some workloads 
([1|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.bdplab.may12.threads-810-cql3_native_prepared.json&metric=op_rate&operation=4_read&smoothing=1&xmin=0&xmax=141.13&ymin=0&ymax=238843],
 
[2|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2.may12.threads-810-cql3_native_prepared.json&metric=op_rate&operation=4_read&smoothing=1&xmin=0&xmax=134.31&ymin=0&ymax=340354.3])
 and a smaller benefit on others, which I'm quite happy with.  Unless the 
longer benchmarks Ryan is running show dramatically different results, I'm +1.

I also note that the work here is almost entirely self contained, with the 
major exception being some new code in Message.Dispatcher.  So while it's not 
as simple as dropping in LTQ or BAQ or FJP, the results are absolutely good 
enough to be worth a new Executor implementation.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993590#comment-13993590
 ] 

Benedict commented on CASSANDRA-4718:
-

FTR, there are new branches: 
[4718-fjp-batchnetty|https://github.com/belliottsmith/cassandra/tree/4718-fjp-batchnetty]
 
[4718-lse-batchnetty|https://github.com/belliottsmith/cassandra/tree/4718-lse-batchnetty]
 
[cassandra-2.1-batchnetty|https://github.com/belliottsmith/cassandra/tree/cassandra-2.1-batchnetty]

These are the three real contenders, and I've included the netty batching for 
all of them so we can get a like-for-like comparison going.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997754#comment-13997754
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


bq. What about writes, that's a pretty big scenario this helps improve 

The latest Ryan's numbers are from write workload.

bq. Well, except that we expect in general for recent data to be accessed most 
often, or data to be accessed according to a zipf distribution, and in both of 
these cases caching helps to keep a significant portion of the data we're 
accessing in memory. Also, more users are getting incredibly performant SSDs 
that can respond to queries in time horizons measured in microseconds, and as 
this becomes the norm the distinction also becomes less important.

I always thought that Zipf's law is for the scientific data, is it not? SSD 
could be performant but you can't get the full speed yet as close as you can 
get currently is 3.13+ with multiqueue support enabled.

bq. Right, but we've always targetted "total data larger than memory, hot data 
more or less fits." So I absolutely think this ticket is relevant for a lot of 
use cases.

Exactly, "hot data more or less fits" so the problem is that once you get into 
page page reclaim and disk reads (even SSDs), improvements maid here are no 
longer doing anything helpful, I think that would be clearly visible on the 
benchmarks to come.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997875#comment-13997875
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. I always thought that Zipf's law is for the scientific data, is it not? 

As far as I'm aware a zipf-like distribution is considered a good approximation 
for many data access patterns. A quick google yields an article showing that 
much web traffic follows a zipf distribution, but some follows a slightly 
different exponential distribution: 
http://www.cs.gmu.edu/~sqchen/publications/sigmetrics07-poster.pdf



> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997596#comment-13997596
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

bq. most of the use cases is exactly that - data set which exceeds available 
memory

Right, but we've always targetted "total data larger than memory, hot data more 
or less fits."  So I absolutely think this ticket is relevant for a lot of use 
cases.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997511#comment-13997511
 ] 

T Jake Luciani commented on CASSANDRA-4718:
---

bq. so I am not really sure if it worth it to commit all this code without any 
perf improvement for most of the usage scenarios.

What about writes, that's a pretty big scenario this helps improve :)

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993670#comment-13993670
 ] 

Ryan McGuire commented on CASSANDRA-4718:
-

Benchmarks from bdplab for the new branches. 3 nodes, separate stress host.

 * [810 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.trial2.3node.threads-810.log]
 * [270 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.trial2.3node.threads-270.log]

(updating here as the tests finish...)

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996589#comment-13996589
 ] 

Jason Brown commented on CASSANDRA-4718:


Also, it looks like the period of time that you are running the tests for is 
very short (about 1 or 2 minutes). Can you let it run for *at least* 30 minutes 
or so (if not an hour or more), so we can see the burn in? Everything can look 
rosy in a 90 second test, but fall apart spectacularly under (closer to) real 
world conditions.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998310#comment-13998310
 ] 

Jason Brown commented on CASSANDRA-4718:


[~enigmacurry] How many threads are you running thrift with? If you aren't 
setting it explicitly, (iirc) it gets set to the number of processors, which is 
far below what anything sane should run with. For our machines, I've been using 
512 for writes, and 128 for reads (mirroring what we run with in prod, which is 
same hardware as the machines I'm testing on, more or less). I think this may 
explain we we do not see the vast discrepancy between thrift and native 
protocol ops/second - native protocol default to 128 threads.

Also, are you using sync or hsha for thrift? 



> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997999#comment-13997999
 ] 

Benedict commented on CASSANDRA-4718:
-

Thanks [~enigmacurry]!

Those graphs all look pretty good to me. Think it's time to run some of the 
longer tests to see that performance is still good for other workloads. Let's 
drop thrift from the equation now.

I'd suggest something like 

write n=6 -key populate=1..6 
force major compaction
for each thread count/branch:
 read n=1 -key dist=extr(1..6,2)
and warm up with one (any) read test run before the rest, so that they all are 
playing from a roughly level page cache point

This should create a dataset in the region of 110Gb, but around 75% of requests 
will be to ~40Gb of it, which should be in the region of the amount of page 
cache available to the EC2 systems after bloom filters etc. are accounted for

NB: if you want to play with different distributions, cassandra-stress print 
lets you see what a spec would yield

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996074#comment-13996074
 ] 

Jason Brown commented on CASSANDRA-4718:


[~enigmacurry] Also, is it possible for you to fill up the disks with more 
sstables than available memory? I think we shouyld check how going to disk 
plays into the performance mix, rather than just reading from page cache for 
the entire read test. This should introduce another modality into the way the 
algorithm behaves, one that is probably more realistic to the real world (a mix 
of page cache hits and disk seeks).

[~benedict] This rewrite is quite extensive wrt prior branches. As it this code 
is quite complex with many new additions, I will need a good chunk of time 
tomorrow to review this. 

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997787#comment-13997787
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

bq. Exactly, "hot data more or less fits" so the problem is that once you get 
into page page reclaim and disk reads (even SSDs), improvements maid here are 
no longer doing anything helpful

I don't follow you at all.  If 90% of reads are already in-cache, this is going 
to help even if 10% are going to disk.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998266#comment-13998266
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


What I'm saying is so far it only gives the benefit only for the really small 
workset e.g. stress that runs for 1 minute. For the longer running test there 
is very small to no difference (latest bdplab tests), so we are doing longer 
running test right now in parallel with Ryan, currently the only valuable 
change I see here is batching for Netty.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997406#comment-13997406
 ] 

Benedict commented on CASSANDRA-4718:
-

 bq. Benedict WRT to the 2.1-batchnetty comparison, what did the latencies look 
like?

Ryan's graphs are a much better way to view latencies; on the whole they seem 
universally as good or better (generally much better)

bq. The more latency is introduced to the tasks the less effect would spinning 
have or in other words there is need to spin is eliminated

Yes, also even more important is that the unpark() cost, when amortized over a 
long running operation, becomes insignificant regardless of if it is incurred; 
and producers cannot make forward progress anyway because the native-transport 
queue is full so avoiding paying the unpark cost on the network thread really 
doesn't achieve us anything, as those threads cannot make any forward progress. 
I fully expect there to be very little effect on workloads as the dataset 
exceeds memory and the row size climbs.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997809#comment-13997809
 ] 

Ryan McGuire commented on CASSANDRA-4718:
-

@Benedict more "short" tests. Updated here as they complete:

EC2 c3.8xlarge, cql native:

 * [810 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2.may12.threads-810-cql3_native_prepared.json]

bdplab, cql native:

 *  [810 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.bdplab.may12.threads-810-cql3_native_prepared.json]


> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997731#comment-13997731
 ] 

Benedict commented on CASSANDRA-4718:
-

A brief outline of the approach taken by the executor service I've submitted:

It's premised on the idea that unpark() is a relatively expensive operation, 
and can block progress on the thread calling it (often it results in transfer 
of the physical execution to the signalled thread). So we want to avoid 
performing the operation as much as possible, so long as we do not incur any 
other penalties as a result of doing so.

The approach I've taken to avoiding calling unpark() essentially amounts to 
trying to ensure the correct number of threads are running for servicing the 
current workload, without either delay of service or any waiting on any of the 
workers. We achieve this by essentially letting workers schedule themselves, 
except when we cannot guarantee they will do so on producing work for the queue 
(in which rare instance we spin up a worker directly) or the queue is full, in 
which case it costs us little to contribute to firing up workers. This can be 
roughly described as:

# If all workers are currently either sleeping _indefinitely_ or occupied with 
work, we wake one (or start a new) worker
# Before starting any given task, a worker checks if any more work is available 
on the queue it's processing and tries to hand it off to another unoccupied 
worker (preferring those that are scheduled to wake up of their own accord in 
the near future, to avoid signalling it, but waking/starting one if necessary)
# Once we finish a task, we either:
#* take another task from the queue we just processed, if any available, and 
loop back to (2); 
#* reassign ourselves to another executor that has work and go to (2); 
#* finally, if that fails, we enter a "yield"-spin loop
# Each loop we spin for, we sleep a random interval scaled by the number of 
threads in this loop, so that the rate of wakeup on average is constant 
regardless of the number of spinning threads. When we wake up we:
#* Check if we should deschedule ourselves (based on the total time spent 
sleeping by all threads recently - if it exceeds the real time elapsed, we put 
a worker to sleep indefinitely, preferably ourselves)
#* Try to assign ourselves an executor with work outstanding, and go to (2)

The actual assignment and queueing of work is itself a little interesting as 
well: to minimise signalling we have a ConcurrentLinkedQueue which is, by 
definition, unbounded. We then have a separate synchronisation state which 
maintains an atomic count of work permits (threads working the pool) and task 
permits (items on the queue). When we start a worker as a _producer_ we 
actually don't touch this queue at all, we just start a worker in a spinning 
state and let it assign itself some work. We do this to avoid signalling any 
other producers that may be blocked on the queue being full. When as a worker 
we take work from the queue to either assign to ourselves _or another worker_ 
we always atomically take both a worker permit and a task permit (or only the 
latter if we already own a task permit). This allows us to ensure we only wake 
up threads when they definitely have work to do.



> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and Outbo

[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997515#comment-13997515
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. Also most of the use cases is exactly that - data set which exceeds 
available memory

Well, except that we expect in general for recent data to be accessed most 
often, or data to be accessed according to a zipf distribution, and in both of 
these cases caching helps to keep a significant portion of the data we're 
accessing in memory. Also, more users are getting incredibly performant SSDs 
that can respond to queries in time horizons measured in microseconds, and as 
this becomes the norm the distinction also becomes less important.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Lior Golan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997497#comment-13997497
 ] 

Lior Golan commented on CASSANDRA-4718:
---

But there are use cases where the full working set is memory resident or close 
to that. Improving performance in these use cases would reduce the need for 
caching in front of Cassandra

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997529#comment-13997529
 ] 

Benedict commented on CASSANDRA-4718:
-

I've force pushed an updated branch to the repository which is simpler and has 
some nicer properties (though is ~1% slower at lower thread counts). I'm pretty 
happy with its current state, although still need to create some thorough 
executor stress tests.

In the latest version workers self coordinate descheduling through a simpler 
scheme than the separate descheduler. The new scheme also permits thread 
over-provisioning to be corrected incredibly promptly (i.e. almost instantly), 
eliminating my one concern about this approach (that it could be slightly 
resource unfriendly in cases of variable workloads when sharing the underlying 
platform with another service).

The latest version also delivers more consistency in its throughput rate by 
using a ConcurrentSkipListMap to order the spinning threads in order of 
expected schedule time, and by force-scheduling a new worker if a producer 
encounters a full task queue when not all workers are yet scheduled.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997436#comment-13997436
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


bq. Yes, also even more important is that the unpark() cost, when amortized 
over a long running operation, becomes insignificant regardless of if it is 
incurred; and producers cannot make forward progress anyway because the 
native-transport queue is full so avoiding paying the unpark cost on the 
network thread really doesn't achieve us anything. I fully expect there to be 
very little effect on workloads as the dataset exceeds memory and the row size 
climbs.

Exactly my point which starts right after the bq you have taken. Also most of 
the use cases is exactly that - data set which exceeds available memory, so I 
am not really sure if it worth it to commit all this code without any perf 
improvement for most of the usage scenarios.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997316#comment-13997316
 ] 

Jason Brown commented on CASSANDRA-4718:


[~benedict] WRT to the 2.1-batchnetty comparison, what did the latencies look 
like?

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-13 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997128#comment-13997128
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


The graphs Ryan posted in his previous comment (especially bdplab tests) look 
pretty close to what I would expect without running the tests and just looking 
as 4718-sep code. The more latency is introduced to the tasks the less effect 
would spinning have or in other words there is need to spin is eliminated, 
because the more time execution takes the more luckily it is that next task is 
already there waiting in the queue which makes thread parking/unparking no 
longer a dominant factor in latencies. So I would be very interested to see 
even longer running tests (especially reads) because that is much closer to the 
real behavior dominated by network/disk latencies. 

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-13 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996772#comment-13996772
 ] 

Ryan McGuire commented on CASSANDRA-4718:
-

I scaled the test up by a factor of 10. I'll update here as the tests complete:

5 c3.8xlarge EC2 cluster:

 * [810 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.200M.EC2.threads-810.json]
 -  cassandra-2.1 timed out in this test, I'll investigate it, but it wasn't 
one of the branches you asked for anyway.

 

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996817#comment-13996817
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. Can you let it run for at least 30 minutes or so

Let's hold off on that until we have some comparison numbers - agreed it's a 
good idea, but just want to get some idea of behaviour first

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996388#comment-13996388
 ] 

Benedict commented on CASSANDRA-4718:
-

[~jasobrown] I've updated the repository with a number of minor 
tweaks/refactors, and improved comments. Let me know if there's anything still 
unclear.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996188#comment-13996188
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. Also, is it possible for you to fill up the disks with more sstables than 
available memory?

+1 ("regular out-of-memory workload" was a dreadfully worded attempt to express 
this)

bq. As this branch is quite complex with many new additions, I will need a good 
chunk of time tomorrow to review this.

Let me know if there's anything specific that needs explaining. I will be 
commenting it before breakfast your time.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995763#comment-13995763
 ] 

Benedict commented on CASSANDRA-4718:
-

I have uploaded a complete rewrite 
[here|https://github.com/belliottsmith/cassandra/tree/4718-sep]

This on my tests is another 10%+ faster than lse-batchnetty, making it roughly 
on-par with 2.1-batchnetty on our old hardware, but thoroughly outstripping it 
on EC2 and my laptop. I still need to introduce some thorough tests for it, and 
to comment the code thoroughly, but the basic principle is the same only all 
executors share the same pool of worker threads so that scheduling is easier, 
and work can be passed more easily between them.

I will revisit this work again sometime in the next year to see if we can 
squeeze anything more out of this, especially as we add more optimisations 
elsewhere - but for now we're reaching diminishing returns.

[~enigmacurry] can you run a comparison of this and just 
cassandra-2.1-batchnetty on bdplab and EC2, so we can get a final comparison? 
[~jasobrown] if you feel like kicking off a run of this latest branch on your 
hardware so we have as many final data points to compare against that would 
also be really helpful. I'll get the code commented early tomorrow so we can 
get this reviewed ASAP.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995766#comment-13995766
 ] 

Benedict commented on CASSANDRA-4718:
-

[~enigmacurry] it would be great to see equivalent runs for a regular 
out-of-memory workload as well, just to make sure there aren't any weird 
results. Thanks!

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993059#comment-13993059
 ] 

Benedict commented on CASSANDRA-4718:
-

I have a few branches to test out, and I want to test them out an a variety of 
hardware. [~enigmacurry] can you run them on our internal multi-cpu boxes, and 
an AWS c3.8xlarge 4node cluster to the following spec:

For each branch run: 20M inserts over 1M unique keys with 30, 90, 270 and 810 
threads, then wipe each cluster and perform a single 1M key insert, and then 
run 20M reads over 1M unique keys with the same thread counts. All told that 
should take around 3hrs for -mode cql3 native prepared; I'd then like to repeat 
the tests for -mode thrift smart.

The branches are: 
[https://github.com/belliottsmith/cassandra/tree/4718-lse]
[https://github.com/belliottsmith/cassandra/tree/4718-lse-batchnetty]
[https://github.com/belliottsmith/cassandra/tree/4718-fjp]
[https://github.com/belliottsmith/cassandra/tree/4718-lowsignal]
[https://github.com/belliottsmith/cassandra/tree/cassandra-2.1]

Make sure you use my cassandra-2.1 so we're testing like-to-like (they're all 
rebased to the same version).

I'll elaborate on the contents of these branches later, but suffice it to say 
the 4718-lse branch contains a new executor which attempts to reduce signalling 
costs to near zero by scheduling the correct number of threads to deal with the 
level of throughput the executor has been dealing with over the previous 
(short) adjustment window. -batchnetty includes some simple batching of netty 
messages. 4718-lowsignal is an enhanced version of the patch I uploaded 
previously to this ticket, and 4718-fjp is largely unchanged.

On my own box, and on our austin test cluster, I see -lse faster than both -fjp 
and -lowsignal, however on our austin cluster (which is a not super-modern 
4-cpu no-hyperthreading setup) I see both of them slower than stock 2.1, 
however -lse is only slightly slower, whereas -fjp is around 30% slower. I'll 
post polished numbers a little later.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> backpressure-stress.out.txt, baq vs trunk.png, op costs of various 
> queues.ods, stress op rate with various queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-11 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994679#comment-13994679
 ] 

Ryan McGuire commented on CASSANDRA-4718:
-

@benedict, fwiw here's EC2 benchmarks:

 * [810 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2-4node.threads-810.log]
 * [270 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2-4node.threads-270.log]
 * [90 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2-4node.threads-90.log]


> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-11 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993374#comment-13993374
 ] 

Ryan McGuire commented on CASSANDRA-4718:
-

Above is with 4 nodes, one of which was the one hosting stress. Here's a 3 node 
variety, with stress on a separate host:

 * [30 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.3node.threads-30.log]
 * [90 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.3node.threads-90.log]
 * [270 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.3node.threads-270.log]
 * [810 
threads|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.3node.threads-810.log]
  

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
> backpressure-stress.out.txt, baq vs trunk.png, 
> belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
> jason_write.svg, op costs of various queues.ods, stress op rate with various 
> queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986626#comment-13986626
 ] 

Benedict commented on CASSANDRA-4718:
-

For comparison, a graph of Jason's results: 
https://docs.google.com/spreadsheets/d/1mLxyY9syaAlDb1ALGQ-oF7Qo0tQffbcNgFMVPktde88/edit?usp=sharing

I'd like to do a couple of things here: 
# Tweak the Low Signal patch to potentially signal more intelligently rather 
than just always aggregating the last 5us of requests
# Try increasing the queue length
# Try these tests for a standardized load - the stress functionality we're 
using is great for giving a good ballpark idea of performance, but it varies 
the number of ops with each run, so running with a fixed 10M ops per run might 
be useful (stress could maybe do with an "ops per thread" option, as for the 
low thread counts this is a lot of work, but for high counts not very much)

The lowsignal patch looks to outperform at certain thresholds, but underperform 
at others, and I'm hoping 1 and 2 might help us make it better overall. At high 
thread counts the difference is almost 20% for writes, which is non-trivial.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, 
> backpressure-stress.out.txt, baq vs trunk.png, op costs of various 
> queues.ods, stress op rate with various queues.ods, v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986065#comment-13986065
 ] 

Benedict commented on CASSANDRA-4718:
-

Yes, typo - corrected. Yeah, just straight up read or write requests (I can 
push the same for both, but they tank for writes when we start hitting 
compaction/flush etc). I'm being as conservative as possible and assuming every 
core is spending every moment working on one of the ops (in reality it's more 
like half of that time). I don't have CASSANDRA-7061 yet to have any really 
accurate numbers to play with.

As regards costs for unpark(), I've timed them in the past and that's in the 
ball park of what you'd expect given the literature and general OS behaviour 
(10us is probably a bit heavier than they often clock in, but a good figure to 
work with)

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986060#comment-13986060
 ] 

Jason Brown commented on CASSANDRA-4718:


bq. translates to around 60us/up

Did you mean 60us/*op* ? Also, are these just requests from (new) cassandra 
stress? You've described the time to process each 'tiny message' but not what a 
tiny message is :). How are you measuring the time for each request (more for 
my own curiosity)? 


> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986048#comment-13986048
 ] 

Benedict commented on CASSANDRA-4718:
-

Sure. My box can push around 60Kop/s - this translates to around 60us/up (core 
time), when unpark() clocks in around 10us, you want to avoid it.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986040#comment-13986040
 ] 

Jason Brown commented on CASSANDRA-4718:


[~benedict] Can you clarify what you mean by "when dealing with a flood of tiny 
messages"? 

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986039#comment-13986039
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. Perhaps, but the overall improvement in performance (should we attribute 
that to the work stealing?) seems compelling enough.

I'm not suggesting we forego the patch because of this concern, I'm raising it 
as something to bear in mind for the future. As I said, though, to some extent 
I have addressed this concern with the _lowsignal_ patch I uploaded, although 
it's debatable how elegant that approach is.

bq. I didn't consider this a 'fork' as we're not mucking about with the 
internals of the FJP itself.

Perhaps we're getting crossed wires and mixing up the patches I have uploaded 
(no fork), with the suggestion that we _may want to investigate forking in 
future_ in order to address these issues in a more elegant manner.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986032#comment-13986032
 ] 

Jason Brown commented on CASSANDRA-4718:


bq. The Semaphore is blocking

After reading the correct jdk source class this time, I was mistaken about the 
blocking (i.e. park) aspect of Semaphore (got caught up in the rest of 
AbstractQueuedSynchronizer, which Semaphore subclasses and uses internally). 
Thus, I'll test out your branch as is this afternoon.

bq. It isn't forked - this is all in the same extension class that you 
introduced

I didn't consider this a 'fork' as we're not mucking about with the internals 
of the FJP itself.

bq. FJP uses an exclusive lock for enqueueing work onto the pool, but does more 
whilst owning the lock, so is likely to take longer within the critical section

Perhaps, but the overall improvement in performance (should we attribute that 
to the work stealing?) seems compelling enough. 

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986003#comment-13986003
 ] 

Jeremiah Jordan commented on CASSANDRA-4718:


bq. I can argue from running cassandra in production for almost four years that 
these metrics are not very helpful. At best they indicate that 'something' is 
amiss ("hey, pending tasks are getting higher"), but cannot give you a real 
clue as to what is wrong (gc, I/O contention, cpu throttling). As we got these 
data points largely for free from TPE, I guess it made sense to expose them, 
but if we have to go out of our way to fabricate a subset of them for FJP, I 
propose we drop them going forward (for FJP, at least).

I would argue they are very useful because they give you that high level 
"something is wrong", so if its easy to keep them, I am very +1 on that.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985978#comment-13985978
 ] 

Benedict commented on CASSANDRA-4718:
-

bq. . The Semaphore is blocking (by design)

It's non-blocking until you run out of permits, at which point it must block. 
We have _many more_ shared counters than this semaphore, so I highly doubt it 
will be an issue (if doing nothing but spinning on updating it we could push 
probably several thousand times our current op-rate, and in reality we will be 
doing a lot inbetween, so contention is highly unlikely to be an issue, 
although it will incur a slight QPI penalty - nothing we don't incur all over 
the place though).

bq.  but any solution is better than forking FJP

It isn't forked - this is all in the same extension class that you 
introduced...?

bq. I literally have no idea what this means.

FJP uses an exclusive lock for enqueueing work onto the pool, but does more 
whilst owning the lock, so is likely to take longer within the critical 
section. The second patch I uploaded attempts to mitigate this for native 
transport threads as those micros are actually a pretty big deal when dealing 
with a flood of tiny messages.

bq.  As we got these data points largely for free from TPE, I guess it made 
sense to expose them, but if we have to go out of our way to fabricate a subset 
of them for FJP, I propose we drop them going forward (for FJP, at least).

I don't really mind, but I think you're overestimating the penalty for 
maintaining these counters.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985962#comment-13985962
 ] 

Jason Brown commented on CASSANDRA-4718:


[~benedict] I agree that FJP does not have a native enqueueing mechanism, but 
I'm not sure adding a Semaphore right in the middle of the class is the right 
solution. The Semaphore is blocking (by design) and will be contended for 
across (numa) cores. As an alternative, at least for the native protocol, does 
it make sense to move the back pressure point earlier in the processing chain? 
I'm unfamiliar with EventLoopGroup, but any solution is better than forking 
FJP. Also, I have an idea I'd like to try out .. give me a day or two.
 
bq. ...  is no more efficient (probably slightly less) than a standard 
executor. That's a future problem, however

I literally have no idea what this means. 

bq. support the metrics that users may have gotten used to.

I can argue from running cassandra in production for almost four years that 
these metrics are not very helpful. At best they indicate that 'something' is 
amiss ("hey, pending tasks are getting higher"), but cannot give you a real 
clue as to what is wrong (gc, I/O contention, cpu throttling). As we got these 
data points largely for free from TPE, I guess it made sense to expose them, 
but if we have to go out of our way to fabricate a subset of them for FJP, I 
propose we drop them going forward (for FJP, at least). 



> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985658#comment-13985658
 ] 

Benedict commented on CASSANDRA-4718:
-

I've uploaded a slight variant of the patch 
[here|https://github.com/belliottsmith/cassandra/tree/4718-lowsignal] - this 
introduces a special FJP for that processing native transport work, that avoids 
blocking on enqueue to the pool unless the configured limit has been reached. 
Instead we schedule a ForkJoinTask that sleeps for 5us, forking any work that 
has been queued in the interval (and going to sleep only if no work has been 
seen in the past 5ms). This permits the connection worker threads to return to 
servicing their connections more promptly.

It has only a modest effect on my box, but it does give a 5-10% bump in native 
transport performance.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984933#comment-13984933
 ] 

Benedict commented on CASSANDRA-4718:
-

I've uploaded a new version of the patch 
[here|https://github.com/belliottsmith/cassandra/tree/4718-fjp]

I've refactored the DebuggableForkJoinPool a little to support a limited queue 
(so that our native transport queue doesn't get too long), and to support the 
metrics that users may have gotten used to.

I've tested the branch out very minimally and do see a very modest performance 
benefit on my box for reads, but that's far from conclusive - however it's 
quite likely any benefit is more visible on machines with more cores going 
spare though, as the single queue lock for a standard executor could easily 
become a point of contention.

One slight concern I have with this approach is that it in order to make 
_enqueueing_ tasks less contentious we will need to either fork ForkJoinPool, 
or see if it is possible to implement an EventLoopGroup backed by a FJP, and 
use the same FJP to manage the connections as we do the execution of our tasks 
(as enqueuing tasks from a FJ-worker is contention-free). Given how FJP is 
intended to be used it is not optimised for enqueueing tasks, and is no more 
efficient (probably slightly less) than a standard executor. That's a future 
problem, however.



> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-22 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977020#comment-13977020
 ] 

Jason Brown commented on CASSANDRA-4718:


Guys, guys, there's plenty of my patch to criticize - you'll each have your 
fun, I'm sure :)

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-22 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976994#comment-13976994
 ] 

Aleksey Yeschenko commented on CASSANDRA-4718:
--

We were just doing some house cleaning today on IRC

[16:49:06] jbellis:  belliottsmith: is 4718 really patch available?
[16:49:14] jbellis:  #4718
[16:49:14] CassBotJr:https://issues.apache.org/jira/browse/CASSANDRA-4718 
(Unresolved; 2.1): "More-efficient ExecutorService for improved throughput"
[16:50:15] belliottsmith:so jasobrown says - he's marked pavel as 
reviewer so i've kept out of it beyond asking for a little more stress info :)
[16:51:28] belliottsmith:honestly though, from last time i looked at it 
(a while back), it was a pretty simple change

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-22 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976988#comment-13976988
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


I still want to review this, why are you re-assigning [~benedict] ?

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1.0
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-15 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969513#comment-13969513
 ] 

Jason Brown commented on CASSANDRA-4718:


OK, will give it a shot today. Also, just noticed I did not tune 
native_transport_max_threads at all (so I have the default of 128). Might play 
with that a bit, as well.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969476#comment-13969476
 ] 

Benedict commented on CASSANDRA-4718:
-

[~jasobrown]: Could you upload the full stress outputs for these runs? And also 
try running a separate stress run with a fixed high threadcount and op count?

In particular for CQL, the results in the file are a little bit weird. That 
said, given their consistency for thrift I don't doubt the result is 
meaningful, but it would be good to understand what we're incorporating a bit 
better before committing.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968674#comment-13968674
 ] 

Benedict commented on CASSANDRA-4718:
-

Maybe you're one of the few people I haven't told of my dream of internally 
hashing C* so that each sub-hash is run by its own core. Guaranteed NUMA 
behaviour, and we can stop performing *any* CASes and make all kinds of single 
threaded optimisations in various places (e.g. no need to do CoW updates to 
memtables, so massive garbage reduction).

Bit of a pipedream for the moment though :)

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968669#comment-13968669
 ] 

Jason Brown commented on CASSANDRA-4718:


Yeah, I think the extra cost across the QPI bus and such is easily masked by 
any disk I/O the actual op may have to :)

bq. the data is completely randomly hodgepodgedly allocated across the CPUs

Correct, given the current structure of the app. I can image something more CPU 
cache friendly, but it's huge change and I suspect that I/O still dwarfs those 
latency reductions. Nice hack day project 

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968651#comment-13968651
 ] 

Benedict commented on CASSANDRA-4718:
-

Note that given the way C* works, the work stealing crossing CPU boundaries is 
really not important - the data is completely randomly hodgepodgedly allocated 
across the CPUs. Nothing we can do in FJP can fix that :)

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968648#comment-13968648
 ] 

Benedict commented on CASSANDRA-4718:
-

I think it is unlikely that the CAS has any meaningful impact. The QPI is very 
quick. I think this kind of speedup is more likely down to reduced signalling 
costs (unpark costs ~ 10micros, and work stealing means you have to go to the 
main queue far less frequently); possibly also the signalling of threads has 
been directly optimised in FJP. I knocked up a "low-switch" executor but found 
fairly little benefit on my box, as I can saturate the CPUs very easily (at 
which point the unpark cost is never incurred). On a many-CPU box, saturating 
all the cores is difficult, and so it is much more likely you'll be introducing 
bottlenecks on producers adding to the queue.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968643#comment-13968643
 ] 

Jason Brown commented on CASSANDRA-4718:


bq. but it ran for 4 times as long, indicating there was a high variance in 
throughput

Huh, yeah, you are right, it did run longer. Admittedly my eyes have been 
ignoring that column (shame on me). Let me run the native protocol test again 
(and try to figure out the read situation, as well).

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968636#comment-13968636
 ] 

Jason Brown commented on CASSANDRA-4718:


As to multi-cpu machines, I spent a lot time thinking about the affects of NUMA 
systems on CAS operations/algs (esp. wrt to FJP, obviously). As I mentioned, 
I'm using systems with two sockets (two NUMA cores). As you get more sockets 
(and thus more numa cores) a thread on one core will be reaching across to more 
cores to do work stealing, thus adding contention to that memory address. 
Imagine four threads on for sockets all contending for work on a fifth thread. 
The memory values for that portion of the queue for that fifth thread is now 
pulled into all four sockets, thus becoming more of a contention point, as well 
as impacting latency (due to the CAS operation). However, this could be (and 
hopefully is) less of a cost than bothering with queues, blocking, posix 
threads, OS interrupts, and everything else that makes standard thread pool 
executors work.

Thinking even crazier to optimize the FJP sharing across numa cores, this is 
when I start thinking about digging up the thread affinity work again, and 
binding threads of similar types (probably by Stage) to sockets, not just an 
individual CPU (I think that was my problem before). But then I wonder how much 
is to be gained on non-NUMA systems or systems where you can't determine if 
it's got NUMA or not (hello, cloud!) - and at that point I'm happy to realize 
the gains we have and move forward.

bq. what problem are you seeing?

Will ping you offline - too unexciting for this space :)

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968637#comment-13968637
 ] 

Jason Brown commented on CASSANDRA-4718:


bq. you were tearing down and trashing the data directories between write runs

Yes, was also clearing the page cache, as well.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968612#comment-13968612
 ] 

Benedict commented on CASSANDRA-4718:
-

Just to check: you were tearing down and trashing the data directories between 
write runs? Because the last result is a bit weird: double the throughput, but 
it ran for 4 times as long, indicating there was a high variance in throughput 
(usually means compaction / heavy flushing is taking effect) - but the same 
workload under thrift had no such spikes...

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968609#comment-13968609
 ] 

Benedict commented on CASSANDRA-4718:
-

Nice! I wonder if this is a much bigger impact on multi-cpu machines, as I did 
not see anything like this dramatic improvement. But this is great. Do you have 
some stress dumps we can look at?

bq. new 2.1 stress seems broken on reads

Shouldn't be - what problem are you seeing?

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Fix For: 2.1
>
> Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-12-06 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841943#comment-13841943
 ] 

Jason Brown commented on CASSANDRA-4718:


OK, looks like my initial stab at switching over to FJP netted about 10-15% 
throughput increase, and mixed results on the latency scores (sometimes better, 
sometimes on par with trunk). I'm going run some more perf tests this weekend, 
and will decide how to proceed early next week - but the initial results do 
look promising. I've only tested the thrift endpoints so far, but when I retest 
this weekend, I'll throw in the cql3/native protocol, as well.

Here's my current working branch: 
https://github.com/jasobrown/cassandra/tree/4718_fjp . Note, it's very hacked 
up/WIP as I wanted to confirm the performance benefits before making everything 
happy (read: metrics pools). Also, I modified [~xedin]'s thrift-disruptor lib, 
for this: https://github.com/jasobrown/disruptor_thrift_server/tree/4718_fjp.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Attachments: PerThreadQueue.java, baq vs trunk.png, op costs of 
> various queues.ods, stress op rate with various queues.ods
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-11-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822199#comment-13822199
 ] 

Jason Brown commented on CASSANDRA-4718:


Ha, just this week I was untangling my branch for CASSANDRA-1632, which 
included the FJP work. Should be able to get to this one next week after more 
performance testing.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Attachments: PerThreadQueue.java, baq vs trunk.png, op costs of 
> various queues.ods, stress op rate with various queues.ods
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-11-13 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822163#comment-13822163
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

What's the latest on this?  I think Jason had some work with FJP he was testing 
out...

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Attachments: PerThreadQueue.java, baq vs trunk.png, op costs of 
> various queues.ods, stress op rate with various queues.ods
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-10-07 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788501#comment-13788501
 ] 

Benedict commented on CASSANDRA-4718:
-

Not necessarily. I still think that was most likely variance:

- I have BAQ at same speed as LBQ in application
- a 2x slow down of LBQ -> 0.01x slow down of application
- a 10x slow down of LBQ -> 0.05x slow down of application

=> the queue speed is currently only ~1% of application cost. It's possible the 
faster queue is causing greater contention at a sync point, but this wouldn't 
work in the opposite direction if the contention at the sync point is low. 
Either way, if this were true we'd see the artificially slow queues also 
improve stress performance.

Ryan also ran some of my tests and found no difference. I wouldn't absolutely 
rule out the possibility his test was valid, though, as I did not swap out the 
queues in OutboundTcpConnection for these tests as, at the time, I was 
concerned about the calls to size() which are expensive for my test queues, and 
I wanted the queue swap to be on equal terms across the board. I realise now 
these are only called via JMX, so shouldn't stop me swapping them in.

I've just tried a quick test of directly (in process) stressing through the 
MessagingService and found no measureable difference to putting BAQ in the 
OutboundTcpConnection, though if I swap out across the board it is about 25% 
slower, which itself is interesting as this is close to a full stress, minus 
thrift.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Attachments: baq vs trunk.png, op costs of various queues.ods, 
> PerThreadQueue.java, stress op rate with various queues.ods
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-10-07 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788426#comment-13788426
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

bq. The faster queue actually slows down the process, by about 9% - more than 
the queue supposedly much slower than it

So this actually confirms Ryan's original measurement of C*/BAQ [slow queue] 
faster than C*/LBQ [fast queue]?

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Attachments: baq vs trunk.png, op costs of various queues.ods, 
> PerThreadQueue.java, stress op rate with various queues.ods
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-10-07 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788425#comment-13788425
 ] 

Benedict commented on CASSANDRA-4718:
-

Disruptors are very difficult to use as a drop in replacement for the executor 
service, so I tried to knock up some queues that could provide similar 
performance without ripping apart the whole application. The resulting queues I 
benchmarked under high load, in isolation, against LinkedBlockingQueue, 
BlockingArrayQueue and the Disruptor, and plotted the average op costs in the 
"op costs of various queues" attachment*. As can be seen, these queues and the 
Disruptor are substantially faster under high load than LinkedBlockingQueue, 
however it can also be seen that:

- The average op cost for LinkedBlockingQueue is still very low, in fact only 
around 300ns at worst
- BlockingArrayQueue is considerably worse than LinkedBlockingQueue under all 
conditions

These suggest both that the overhead attributed to LinkedBlockingQueue for a 
1Mop workload (as run above) should be at most a few seconds of the overall 
cost (probably much less); and that BlockingArrayQueue is unlikely to make any 
cost incurred by LinkedBlockingQueue substantially better. This made me suspect 
the previous result might be attributable to random variance, but to be sure I 
ran a number of ccm -stress tests with the different queues, and plotted the 
results in "stress op rate with various queues.ods", which show the following:

1) No meaningful difference between BAQ, LBQ and SlowQueue (though the latter 
has a clear ~1% slow down)
2) UltraSlow (~10x slow down, or 2000ns spinning each op) is approximately 5% 
slower
3) The faster queue actually slows down the process, by about 9% - more than 
the queue supposedly much slower than it!

Anyway, I've been concurrently looking at where I might be able to improve 
performance independent of this, and have found the following:

A) Raw performance of local reads is ~6-7x faster than through Stress
B) Raw performance of local reads run asynchronously is ~4x faster
C) Raw performance of local reads run asynchronously using the fast queue is 
~4.7x faster
D) Performance of local reads from the Thrift server-side methods is ~3x faster
E) Performance of remote (i.e. local non-optimised) reads is ~1.5x faster

In particular (C) is interesting, as it demonstrates the queue really is faster 
in use, but I've yet to absolutely determine why that translates into an 
overall decline in throughput. It looks as though it's possible it causes 
greater congestion in LockSupport.unpark(), but this is a new piece of 
information, derived from YourKit. As these sorts of methods are difficult to 
meter accurately I don't necessarily trust it, and haven't had a chance to 
figure out what I can do with the information. If it is accurate, and I can 
figure out how to reduce the overhead, we might get a modest speed boost, which 
will accumulate as we find other places to improve.

As to the overall problem of improving throughput, it seems to me that there 
are two big avenues to explore: 

  1) the networking (software) overhead is large;
  2) possibly the cost of managing thread liveness (e.g. park/unpark/scheduler 
costs); though the evidence for this is as yet inconclusive... given the op 
rate and other evidence it doesn't seem to be synchronization overhead. I'm 
still trying to pin this down.

Once the costs here are nailed down as tight as they can go, I'm pretty 
confident we can get some noticeable improvements to the actual work being 
done, but since that currently accounts for only a fraction of the time spent 
(probably less than 20%), I'd rather wait until it was a higher percentage so 
any improvement is multiplied.


* These can be replicated by running 
org.apache.cassandra.concurrent.test.bench.Benchmark on any of the linked 
branches on github. 

https://github.com/belliottsmith/cassandra/tree/4718-lbq [using 
LinkedBlockingQueue]
https://github.com/belliottsmith/cassandra/tree/4718-baq [using 
BlockingArrayQueue]
https://github.com/belliottsmith/cassandra/tree/4718-lpbq [using a new high 
performance queue]
https://github.com/belliottsmith/cassandra/tree/4718-slow [using a 
LinkedBlockingQueue with 200ns spinning each op]
https://github.com/belliottsmith/cassandra/tree/4718-ultraslow [using a 
LinkedBlockingQueue with 2000ns spinning each op]


> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Attachments: baq vs trunk.png, op costs of various queues.

[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-10-07 Thread darion yaphets (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788243#comment-13788243
 ] 

darion yaphets commented on CASSANDRA-4718:
---

LMAX Disruptor's RingBuffer maybe a good idea for lock free component
But maybe set a bigger size for hold the structure in ring buffer to avoid  
cover by new one
And is meaning to use more memory ...

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: performance
> Attachments: baq vs trunk.png, op costs of various queues.ods, 
> PerThreadQueue.java
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-06-13 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682785#comment-13682785
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

Sounds like a reasonable place to start.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Priority: Minor
> Attachments: baq vs trunk.png, PerThreadQueue.java
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-06-13 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682777#comment-13682777
 ] 

Pavel Yaskevich commented on CASSANDRA-4718:


I think even though we aren't in the most cache friendly behavior with variable 
size RMs we can still utilize better dispatch behavior with low cost 
synchronization. We can't do anything about blocking I/O operations requiring 
separate thread but I think it's time to re-evaluate NIO async sockets vs. 
having thread per in/out connection.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Priority: Minor
> Attachments: baq vs trunk.png, PerThreadQueue.java
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-06-11 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680705#comment-13680705
 ] 

Jonathan Ellis commented on CASSANDRA-4718:
---

[~xedin] Thoughts on my comment above? 
https://issues.apache.org/jira/browse/CASSANDRA-4718?focusedCommentId=13629447&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13629447

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Priority: Minor
> Attachments: baq vs trunk.png, PerThreadQueue.java
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-04-18 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635568#comment-13635568
 ] 

Marcus Eriksson commented on CASSANDRA-4718:


unless the java api has improved _alot_ the last year or so, the code will be 
horrible

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Priority: Minor
> Attachments: baq vs trunk.png, PerThreadQueue.java
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-04-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635560#comment-13635560
 ] 

Piotr Kołaczkowski commented on CASSANDRA-4718:
---

I'm not suggesting using scala in C* nor anywhere. It was just quicker for me 
to write a throw-away benchmark.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Priority: Minor
> Attachments: baq vs trunk.png, PerThreadQueue.java
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-04-18 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635543#comment-13635543
 ] 

Marcus Eriksson commented on CASSANDRA-4718:


ftr, im very much -1 on using scala in cassandra (dont know if you suggest that 
even)

i know it is supposed to interface nicely with java code, but it generally 
becomes a huge hairy part of the code base that noone wants to touch

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Priority: Minor
> Attachments: baq vs trunk.png, PerThreadQueue.java
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-04-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635532#comment-13635532
 ] 

Piotr Kołaczkowski commented on CASSANDRA-4718:
---

I made another version of benchmark, according to Sergio's suggestions. Now it 
uses the following message processing graph:


{noformat} 
   /-- stage 0 processor 0  \ /   \ 
 /---   ---\
   +-- stage 0 processor 1  + +   + 
 +---   ---+
>--+-- stage 0 processor 2  +>+  STAGE 1  +-->- ... 
>--->-+---  STAGE m  ---+->
   +-- ...  + +   + 
 +---   ---+
   \-- stage 0 processor n  / \   / 
 \---   ---/
{noformat}

128 threads are concurrently trying to get messages through all the stages and 
measure average latency, including the time required for the message to enter 
stage 0.
Thread-pool stages are built from fixed size thread pools with n=8, because 
there are 8 cores.
Actor-based stages are build from 128 actors each with a RoundRobinRouter in 
front of every stage.

Average latencies:
{noformat}

{noformat}3 stages: 
Sync:364687 ns
Async:   210766 ns
Akka:201842 ns

4 stages: 
Sync:492581 ns
Async:   221118 ns
Akka:239407 ns

5 stages: 
Sync:671733 ns
Async:   245370 ns
Akka:283798 ns

6 stages: 
Sync:781759 ns
Async:   262742 ns
Akka:309384 ns
{noformat}

So Akka comes slightly slower than async thread pools.

If someone wants to play with my code, here is the up-to-date version:
{noformat}
import java.util.concurrent.{CountDownLatch, Executors}
import akka.actor.{Props, ActorSystem, Actor, ActorRef}
import akka.routing.{SmallestMailboxRouter, RoundRobinRouter}


class Message {
  var counter = 0
  val latch = new CountDownLatch(1)
}

abstract class MultistageThreadPoolProcessor(stageCount: Int) {

  val stages =
for (i <- 1 to stageCount) yield Executors.newFixedThreadPool(8)

  def shutdown() {
stages.foreach(_.shutdown())
  }

}

/** Synchronously processes a message through the stages.
  * The message is passed stage-to-stage by the coordinator thread. */
class SyncThreadPoolProcessor(stageCount: Int) extends 
MultistageThreadPoolProcessor(stageCount) {

  def process() {

val message = new Message

val task = new Runnable() {
  def run() { message.counter += 1 }
}

for (executor <- stages)
  executor.submit(task).get()
  }
}

/** Asynchronously processes a message through the stages.
  * Every stage after finishing its processing of the message
  * passes the message directly to the next stage, without bothering the 
coordinator thread. */
class AsyncThreadPoolProcessor(stageCount: Int) extends 
MultistageThreadPoolProcessor(stageCount) {

  def process() {

val message = new Message

val task = new Runnable() {
  def run() {
message.counter += 1
if (message.counter >= stages.size)
  message.latch.countDown()
else
  stages(message.counter).submit(this)
  }
}

stages(0).submit(task)
message.latch.await()
  }
}

/** Similar to AsyncThreadPoolProcessor but it uses Akka actor system instead 
of thread pools and queues.
  * Every stage after finishing its processing of the message
  * passes the message directly to the next stage, without bothering the 
coordinator thread. */
class AkkaProcessor(stageCount: Int) {

  val system = ActorSystem()

  val stages: IndexedSeq[ActorRef] = {
for (i <- 1 to stageCount) yield
  
system.actorOf(Props(createActor()).withRouter(RoundRobinRouter(nrOfInstances = 
128)))
  }

  def createActor(): Actor = {
new Actor {

  def receive = {
case m: Message =>
  m.counter += 1
  if (m.counter >= stages.size)
m.latch.countDown()
  else
stages(m.counter) ! m
  }
}
  }

  def process() {
val message = new Message
stages(0) ! message
message.latch.await()
  }

  def shutdown() {
system.shutdown()
  }

}



object MessagingBenchmark extends App {

  def measureLatency(count: Int, f: () => Any): Double = {
val start = System.nanoTime()
for (i <- 1 to count)
  f()
val end = System.nanoTime()
(end - start).toDouble / count
  }

  def measureLatency(threadCount: Int, messageCount: Int, f: () => Any): Double 
= {

class RequestThread extends Thread {
  var latency: Double = 0.0
  override def run() { latency = measureLatency(messageCount, f) }
}

val threads =
  for (i <- 1 to threadCount) yield new RequestThread()

threads.foreach(_.start())
threads.foreach(_.join())

threads.map(_.latency).sum / threads.size
  }


  val messageCount = 5
  for (s

[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-04-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635141#comment-13635141
 ] 

Piotr Kołaczkowski commented on CASSANDRA-4718:
---

Interesting thing, that after boosting the number of threads that invoke the 
process() method from 1 to 16, Akka gets slower, while thread-pool per stage 
approach gets faster.

16 user threads invoking process(), 4 core i7 with HT (8-virtual cores):

{noformat}
2 stages: 
Sync: 28195 ns
Async:26852 ns
Akka: 51651 ns

4 stages: 
Sync: 75295 ns
Async:60381 ns
Akka: 85954 ns

8 stages: 
Sync:176879 ns
Async:   124712 ns
Akka:103073 ns

16 stages: 
Sync:367728 ns
Async:   259715 ns
Akka:146875 ns
{noformat}

top reports total ~780% CPU utilisation
  thread-pools:  ~60% system, ~40% user
  Akka:  ~15% system, ~85% user

I try to add Disruptor to the benchmark suite.

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Priority: Minor
> Attachments: baq vs trunk.png, PerThreadQueue.java
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-04-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635120#comment-13635120
 ] 

Piotr Kołaczkowski commented on CASSANDRA-4718:
---

Another thing to consider might be using a high-performance Actor library e.g. 
Akka.

I did a quick microbenchmark to see what is the latency of just passing a 
single message through several stages, in 3 variants:

1. Sync: one threadpool per stage, where some coordinator thread just moves 
message from one ExecutorService to another, after the stage finished processing
2. Async: one threadpool per stage, where every stage directly asynchronously 
pushes its result into the next stage
3. Akka: one Akka actor per stage, where every stage directly asynchronously 
pushes its result into the next stage

The clear winner is Akka:
{noformat}
2 stages: 
Sync: 38717 ns
Async:36159 ns
Akka: 12969 ns

4 stages: 
Sync: 65793 ns
Async:49964 ns
Akka: 18516 ns

8 stages: 
Sync:162256 ns
Async:   19 ns
Akka:  9237 ns

16 stages: 
Sync:296951 ns
Async:   183588 ns
Akka: 13574 ns

32 stages: 
Sync:572605 ns
Async:   361959 ns
Akka: 23344 ns
{noformat}

Code of the benchmark:
{noformat}
package pl.pk.messaging

import java.util.concurrent.{CountDownLatch, Executors}
import akka.actor.{Props, ActorSystem, Actor, ActorRef}


class Message {
  var counter = 0
  val latch = new CountDownLatch(1)
}

abstract class MultistageThreadPoolProcessor(stageCount: Int) {

  val stages =
for (i <- 1 to stageCount) yield Executors.newCachedThreadPool()

  def shutdown() {
stages.foreach(_.shutdown())
  }

}

/** Synchronously processes a message through the stages.
  * The message is passed stage-to-stage by the coordinator thread. */
class SyncThreadPoolProcessor(stageCount: Int) extends 
MultistageThreadPoolProcessor(stageCount) {

  def process() {

val message = new Message

val task = new Runnable() {
  def run() { message.counter += 1 }
}

for (executor <- stages)
  executor.submit(task).get()
  }
}

/** Asynchronously processes a message through the stages.
  * Every stage after finishing its processing of the message
  * passes the message directly to the next stage, without bothering the 
coordinator thread. */
class AsyncThreadPoolProcessor(stageCount: Int) extends 
MultistageThreadPoolProcessor(stageCount) {

  def process() {

val message = new Message

val task = new Runnable() {
  def run() {
message.counter += 1
if (message.counter >= stages.size)
  message.latch.countDown()
else
  stages(message.counter).submit(this)
  }
}

stages(0).submit(task)
message.latch.await()
  }
}

/** Similar to AsyncThreadPoolProcessor but it uses Akka actor system instead 
of thread pools and queues.
  * Every stage after finishing its processing of the message
  * passes the message directly to the next stage, without bothering the 
coordinator thread. */
class AkkaProcessor(stageCount: Int) {

  val system = ActorSystem()

  val stages: IndexedSeq[ActorRef] = {
for (i <- 1 to stageCount) yield system.actorOf(Props(new Actor {
  def receive = {
case m: Message =>
  m.counter += 1
  if (m.counter >= stages.size)
m.latch.countDown()
  else
stages(m.counter) ! m
  }
}))
  }

  def process() {
val message = new Message
stages(0) ! message
message.latch.await()
  }

  def shutdown() {
system.shutdown()
  }

}



object MessagingBenchmark extends App {

  def measureLatency(count: Int, f: () => Any): Double = {
val start = System.nanoTime()
for (i <- 1 to count)
  f()
val end = System.nanoTime()
(end - start).toDouble / count
  }

  val messageCount = 20
  for (stageCount <- List(2,4,8,16,32))
  {
printf("\n%d stages: \n", stageCount)
val syncProcessor = new SyncThreadPoolProcessor(stageCount)
val asyncProcessor = new AsyncThreadPoolProcessor(stageCount)
val akkaProcessor = new AkkaProcessor(stageCount)

printf("Sync:  %8.0f ns\n", measureLatency(messageCount, 
syncProcessor.process))
printf("Async: %8.0f ns\n", measureLatency(messageCount, 
asyncProcessor.process))
printf("Akka:  %8.0f ns\n", measureLatency(messageCount, 
akkaProcessor.process))

syncProcessor.shutdown()
asyncProcessor.shutdown()
akkaProcessor.shutdown()
  }
}
{noformat}

> More-efficient ExecutorService for improved throughput
> --
>
> Key: CASSANDRA-4718
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Priority: Minor
> Attachments: baq vs trunk.png, PerThreadQueue.java
>
>
>

  1   2   >