[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-19 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Attachment: E600M_summary_key_s.svg
E100M_summary_key_s.svg
E10M_summary_key_s.svg

Attached are some graphs of running reads with exponential key distributions 
over different key ranges (10M, 100M, 600M) - though over the same dataset 
(600M keys), on a 2-node c3.8xlarge cluster, with 1 c3.8xlarge load generator. 
All of the thread counts were run for 1M operations. These are the last of a 
sequence of test runs as, initially, by far the biggest determining factor was 
page cache behaviour - with each iteration the page cache's page retention 
algorithm got progressively better at retaining the best subset of the pages to 
service the requests. Note also the scales - in particular the 600M test both 
are within 1% of each other performance-wise, and I would note that the 
variability was greater than 1%, and that in prior runs the positions were 
reversed.

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Benedict
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, E100M_summary_key_s.svg, 
 E10M_summary_key_s.svg, E600M_summary_key_s.svg, PerThreadQueue.java, 
 austin_diskbound_read.svg, aws.svg, aws_read.svg, 
 backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
 various queues.ods, stress op rate with various queues.ods, 
 stress_2014May15.txt, stress_2014May16.txt, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-17 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Attachment: jason_run3.svg
jason_run2.svg
jason_run1.svg

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Benedict
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, 
 austin_diskbound_read.svg, aws.svg, aws_read.svg, 
 backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_run1.svg, jason_run2.svg, jason_run3.svg, jason_write.svg, op costs of 
 various queues.ods, stress op rate with various queues.ods, 
 stress_2014May15.txt, stress_2014May16.txt, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Jason Brown (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-4718:
---

Attachment: stress_2014May15.txt

Adding more fuel to the testing fire, i took a first pass at having a large of 
amount of data on disk (~2x the memory size of each box), and running the read 
tests - see attached file: stress_2014May15.txt. I cleared the page cache 
before switching to each branch from for the reads, and then performed 3 rounds 
of stress. The goal here was to see how the sep branch compared with 
cassandra-2.1 when doing most of the reads from disk (with a cold page cache, 
or where the cache is constantly churning due to new blocks being pulled in).

The short story is the sep branch performs slightly worse the current 
cassandra-2.1 (which includes CASSANDRA-5663) on both ops/s and latencies.

I'm going to do one more test where I preload a good chunk of the data into the 
page cache, then run the stress - hopefully to emulate the case where most 
reads come from the page cache and some go to disk. Will me to use a less naive 
key distribution alg, to ensure that we hit the hot keys, which is provided by 
stress.


 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Benedict
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
 aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_write.svg, op costs of various queues.ods, stress op rate with various 
 queues.ods, stress_2014May15.txt, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Jason Brown (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-4718:
---

Attachment: stress_2014May16.txt

Next attachment (stress_2014May16). After I created a new set of data (about 
300GB per node), generated with populate=x (rather than uniform=0..x), I 
dropped the page cache, then I attempted to pre-populate the page cache by 
running a warm up read round, like this:

{code}./tools/bin/cassandra-stress read n=180664790 -key 
dist=extr\(1..6,2\) -rate threads=75 -mode native prepared cql3 -port 
native=9043 thrift=9161  -node  {code}

I used the value 180664790 from stress's print function, which can give you a 
reasonable dist count to hit a certain percentage of coverage for the different 
models (note: i may be completely incorrect on this point, so feel free to 
correct my understanding).

Then I ran the same 3 stress tests i ran in yesterday's run (one where I load 
all 21 cols with the row, load the default count (5), then only iterate over 
the default key count (1mil)). I then cleared the page cache and performed the 
same testing for the other branch (using the existing data, but warming up the 
cache).

The results look like the sep branch had lower latencies vs. cassandra 2.1 
(same as yesterday), but less ops/sec.

tbh, I'm not sure I proved a whole lot since yesterday’s tests, as dstat showed 
I was loading 200-300Mb of data per second, so I was certainly giving the page 
cache a good workout in any case.

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Benedict
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
 aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_write.svg, op costs of various queues.ods, stress op rate with various 
 queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-16 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Attachment: austin_diskbound_read.svg

For reference, a plot of reads from a much larger than memory dataset against 
the austin cluster. There's a lot of variability, but no appreciable difference 
between the two, which is what I would expect (but good to confirm)

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Benedict
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, 
 austin_diskbound_read.svg, aws.svg, aws_read.svg, 
 backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_write.svg, op costs of various queues.ods, stress op rate with various 
 queues.ods, stress_2014May15.txt, stress_2014May16.txt, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Attachment: jason_write.svg
jason_read_latency.svg
jason_read.svg
aws.svg

I've attached read throughput graphs for a run on a 2 node AWS c3.8xlarge with 
1 load generating node (i've ignored write throughputs for now as the local 
disks are atrocious, so the numbers are pretty meaningless and all over the 
place), and graphs for Jason's runs read/write runs, including the read 
latencies (most representative of networking as they should all in-memory 
operations)

It looks like on modern/patched kernels and/or hyperthreaded modern CPUs the 
lse-batchnetty patch wins hands down - latency and throughput are all 
substantially better. I'll now try to see if a little bit of tuning can yield 
better results for all systems, as the prior version of the LSE branch was 
hitting the same throughput as 2.1-batchnetty does on our internal 
no-hyperthreading 2.6 clusters. So if we can bring that back up the case should 
be pretty compelling.

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
 backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_write.svg, op costs of various queues.ods, stress op rate with various 
 queues.ods, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-15 Thread Jason Brown (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-4718:
---

Attachment: belliotsmith_branches-stress.out.txt

attached file (belliotsmith_branches.out) are my results from running the 
latest branches on the same hardware I described above.

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
 backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_write.svg, op costs of various queues.ods, stress op rate with various 
 queues.ods, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-14 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Attachment: aws_read.svg

For those interested, comparison of the latest version versus 2.1-batchnetty 
(dropping other branches for comparison now).


 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Benedict
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
 aws_read.svg, backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_write.svg, op costs of various queues.ods, stress op rate with various 
 queues.ods, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-13 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Reviewer: Jason Brown  (was: Benedict)

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Benedict
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, aws.svg, 
 backpressure-stress.out.txt, baq vs trunk.png, 
 belliotsmith_branches-stress.out.txt, jason_read.svg, jason_read_latency.svg, 
 jason_write.svg, op costs of various queues.ods, stress op rate with various 
 queues.ods, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-05-01 Thread Jason Brown (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-4718:
---

Attachment: backpressure-stress.out.txt

I've run both of [~benedict]'s patches on my test cluster, and the results are 
quite hopeful (see backpressure-stress.out). Short story is anything we do for 
the native protocol will at least double the throughput and halve the latency, 
but that's the low end of the improvements.

Benedict's second patch is about 5-10% faster than his first, and both are 
slightly slower than my original patch. As I think it's important to have back 
pressure for any pool, at a minimum I think we should go with [~benedict]'s 
first patch. I like the ideas in the second patch, but would like another day 
to digest the implementation.

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
Priority: Minor
  Labels: performance
 Fix For: 2.1.0

 Attachments: 4718-v1.patch, PerThreadQueue.java, 
 backpressure-stress.out.txt, baq vs trunk.png, op costs of various 
 queues.ods, stress op rate with various queues.ods, v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-22 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Reviewer: Benedict  (was: Pavel Yaskevich)

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
Priority: Minor
  Labels: performance
 Fix For: 2.1

 Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
 costs of various queues.ods, stress op rate with various queues.ods, 
 v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2014-04-14 Thread Jason Brown (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-4718:
---

Attachment: 4718-v1.patch
v1-stress.out

After several month hiatus, digging into this again. This time around, though, 
I have real hardware to test on, and thus the results are more consistent 
across executions (no cloud provider intermittently throttling me). 

Testing the thrift interface (both sync and hsha), the short story is 
throughput is up ~20% vs. TPE, and 95% / 99%iles are down 40-60%. The 99.9%ile, 
however, is a bit tricker. In some of my tests it is down almost 80%, and 
sometimes it is up 40-50%. I need to dig in further to understand what is going 
on (not sure if it’s because of a shared env, reading across numa cores, and so 
on). perf and likwid are my friends in this investigation.

As to testing the native protocol interface, I’ve only tested writes (new 2.1 
stress seems broken on reads) and I get double the throughput and 40-50% lower 
latencies across the board.  

My test cluster consists of three machines, 32 cores each, 2 sockets (2 numa 
cores), 132G memory, 2.6.39 kernel, plus a similar box that generates the load.

A couple of notes about this patch:
* RequestThreadPoolExecutor now decorates a FJP. Previously we had a TPE which 
contains, of course, a (bounded) queue. The bounded queue helped with back 
pressure from incoming requests. By using a FJP, there is no queue to help with 
back pressure as the FJP always enqueue a task (without blocking). Not sure if 
we still want/need that back pressure here.
* As ForkJoinPool doesn’t expose much in terms of use metrics (like total 
completed) compared to ThreadPoolExecutor, the ForkJoinPoolMetrics is similarly 
barren. Not sure if we want to capture this on our own in DFJP or something 
like else. 
* I have made similar FJP changes to the disruptor-thrift library, and once 
this patch is committed, I’ll work with Pavel to make the changes over there 
and pull in the updated jar.

As a side note, looks like the quasar project 
(http://docs.paralleluniverse.co/quasar/) indicates the jsr166e jar has some 
optimizations (http://blog.paralleluniverse.co/2013/05/02/quasar-pulsar/) over 
the jdk7 implementation (that are included in jdk8). I pulled in those changes 
and stress tested, but didn’t see much of a difference for our use case. I can, 
however, pull them in again if any one feels strongly.


 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
Priority: Minor
  Labels: performance
 Fix For: 2.1

 Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
 costs of various queues.ods, stress op rate with various queues.ods, 
 v1-stress.out


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-10-07 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Attachment: stress op rate with various queues.ods

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
Priority: Minor
  Labels: performance
 Attachments: baq vs trunk.png, op costs of various queues.ods, 
 PerThreadQueue.java, stress op rate with various queues.ods


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-10-02 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-4718:


Attachment: op costs of various queues.ods

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Priority: Minor
  Labels: performance
 Attachments: baq vs trunk.png, op costs of various queues.ods, 
 PerThreadQueue.java


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-09-23 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-4718:
--

Labels: performance  (was: )

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Priority: Minor
  Labels: performance
 Attachments: baq vs trunk.png, PerThreadQueue.java


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2013-02-12 Thread Ryan McGuire (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McGuire updated CASSANDRA-4718:


Attachment: baq vs trunk.png

I ran a performance test comparing jonathan's baq branch with trunk and saw 
about a substantial increase in read performance.

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Priority: Minor
 Attachments: baq vs trunk.png, PerThreadQueue.java


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2012-12-06 Thread Vijay (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vijay updated CASSANDRA-4718:
-

Attachment: PerThreadQueue.java

How about Per thread queue? was curious about this ticket and ran a test and 
actually it performs better than Executors.newFixedThreadPool...

I have to also note that the stages content when we have a lot of CPU's (was 
testing with 64 core machines)

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Priority: Minor
 Attachments: PerThreadQueue.java


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2012-12-06 Thread Vijay (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vijay updated CASSANDRA-4718:
-

Attachment: PerThreadQueue.java

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Priority: Minor
 Attachments: PerThreadQueue.java


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput

2012-12-06 Thread Vijay (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vijay updated CASSANDRA-4718:
-

Attachment: (was: PerThreadQueue.java)

 More-efficient ExecutorService for improved throughput
 --

 Key: CASSANDRA-4718
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Priority: Minor
 Attachments: PerThreadQueue.java


 Currently all our execution stages dequeue tasks one at a time.  This can 
 result in contention between producers and consumers (although we do our best 
 to minimize this by using LinkedBlockingQueue).
 One approach to mitigating this would be to make consumer threads do more 
 work in bulk instead of just one task per dequeue.  (Producer threads tend 
 to be single-task oriented by nature, so I don't see an equivalent 
 opportunity there.)
 BlockingQueue has a drainTo(collection, int) method that would be perfect for 
 this.  However, no ExecutorService in the jdk supports using drainTo, nor 
 could I google one.
 What I would like to do here is create just such a beast and wire it into (at 
 least) the write and read stages.  (Other possible candidates for such an 
 optimization, such as the CommitLog and OutboundTCPConnection, are not 
 ExecutorService-based and will need to be one-offs.)
 AbstractExecutorService may be useful.  The implementations of 
 ICommitLogExecutorService may also be useful. (Despite the name these are not 
 actual ExecutorServices, although they share the most important properties of 
 one.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira