[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-06-28 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046976#comment-14046976
 ] 

Vijay edited comment on CASSANDRA-7438 at 6/28/14 9:50 PM:
---

Pushed a new project to github https://github.com/Vijay2win/lruc, including 
benchmark utils. I can move the code to Cassandra repo or use it as a library 
in Cassandra (Working on it).


was (Author: vijay2...@yahoo.com):
Pushed a new project in github https://github.com/Vijay2win/lruc, including 
benchmark utils. I can move the code to Cassandra repo or use it as a library 
in Cassandra (Working on it).

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-07-21 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069862#comment-14069862
 ] 

Vijay edited comment on CASSANDRA-7438 at 7/22/14 5:49 AM:
---

Attached patch makes the off heap/Serialization Cache configurable. (the 
default is still SerializationCache).

Regarding performance, the performance of the new cache is obviously better 
when the JNI overhead is less than the GC overhead, but for the smaller sized 
caches which can fit in memory the performance is little lower which is 
understandable (but both of them out perform pagecache performance by a large 
number). Here are the numbers.

*OffheapCacheProvider*
{panel}
Running READ with 1200 threads  for 1000 iterations
ops   ,op/s,   key/s,mean, med, .95, .99,.999, 
max,   time,   stderr
2030355   , 2029531, 2029531, 3.1, 3.1, 5.4, 5.7,61.8,  
3014.5,1.0,  0.0
2395480   ,  202845,  202845, 5.8, 5.4, 5.8,20.2,   522.4,   
545.9,2.8,  0.0
2638600   ,  221368,  221368, 5.4, 5.3, 5.8,16.3,78.8,   
131.5,3.9,  0.57860
2891705   ,  221976,  221976, 5.4, 5.3, 5.6, 6.2,15.2,
19.2,5.0,  0.60478
3147747   ,  222527,  222527, 5.4, 5.3, 5.6, 6.1,15.4,
18.2,6.2,  0.58659
3394999   ,  221527,  221527, 5.4, 5.3, 5.6, 6.6,15.9,
19.4,7.3,  0.55884
3663559   ,  226114,  226114, 5.3, 5.2, 5.6,15.0,84.4,   
110.7,8.5,  0.52924
3911154   ,  223831,  223831, 5.4, 5.3, 5.6, 6.1,15.6,
20.0,9.6,  0.50018
4152946   ,  223246,  223246, 5.4, 5.3, 5.6, 6.1,15.7,
18.8,   10.7,  0.47323
4403162   ,  228532,  228532, 5.2, 5.2, 5.6,23.2,   107.4,   
121.4,   11.8,  0.44856
4641021   ,  225196,  225196, 5.3, 5.2, 5.6, 5.9,15.3,
18.4,   12.8,  0.42557
4889523   ,  222826,  222826, 5.4, 5.3, 5.6, 6.3,16.2,
22.0,   13.9,  0.40476
5124891   ,  223203,  223203, 5.4, 5.3, 5.6, 5.8, 6.2,
14.8,   15.0,  0.38602
5375262   ,  221222,  221222, 5.4, 5.2, 5.6,18.4,94.2,   
115.1,   16.1,  0.36899
5616470   ,  224022,  224022, 5.4, 5.3, 5.6, 5.9,14.3,
17.8,   17.2,  0.35349
5866825   ,  223000,  223000, 5.4, 5.3, 5.6, 6.1,15.5,
18.2,   18.3,  0.33882
6125601   ,  225757,  225757, 5.2, 5.3, 5.6, 9.6,49.4,
72.0,   19.5,  0.32535
6348030   ,  192703,  192703, 6.3, 5.3, 9.3,14.4,77.1,
91.5,   20.6,  0.31282
6483574   ,  128520,  128520, 9.3, 8.4,10.9,19.5,88.7,
99.0,   21.7,  0.30329
6626176   ,  137199,  137199, 8.7, 8.4,10.6,14.0,32.7,
40.9,   22.7,  0.29771
6768401   ,  136860,  136860, 8.8, 8.4,10.3,14.1,35.1,
40.8,   23.8,  0.29213
6911785   ,  138204,  138204, 8.7, 8.3,10.2,13.7,34.1,
37.8,   24.8,  0.28669
7055951   ,  138633,  138633, 8.7, 8.3,10.5,32.0,40.5,
46.9,   25.8,  0.28130
7199084   ,  137731,  137731, 8.7, 8.4,10.2,14.0,33.4,
40.9,   26.9,  0.27623
7338032   ,  133201,  133201, 9.0, 8.4,10.9,34.0,39.4,
43.8,   27.9,  0.27116
7480439   ,  137059,  137059, 8.8, 8.4,10.2,13.9,35.9,
39.5,   29.0,  0.26663
7647810   ,  161209,  161209, 7.5, 7.8, 9.6,13.4,33.9,
77.9,   30.0,  0.26185
7898882   ,  226498,  226498, 5.3, 5.2, 5.6,19.7,   108.5,   
119.3,   31.1,  0.25629
8136305   ,  223840,  223840, 5.4, 5.3, 5.6, 5.9,17.3,
23.2,   32.2,  0.24838
8372076   ,  223790,  223790, 5.4, 5.3, 5.6, 6.0,15.2,
20.0,   33.2,  0.24095
8633758   ,  232914,  232914, 5.1, 5.2, 5.6,17.5,   138.4,   
182.0,   34.4,  0.23397
8869214   ,  43,  43, 5.4, 5.3, 5.6, 6.0,15.2,
17.9,   35.4,  0.22717
9121652   ,  223037,  223037, 5.4, 5.3, 5.6, 5.9,15.4,
18.8,   36.5,  0.22105
9360286   ,  225070,  225070, 5.3, 5.3, 5.6,14.8,82.7,
92.1,   37.6,  0.21524
9609676   ,  224089,  224089, 5.4, 5.3, 5.6, 5.8, 6.2,
14.3,   38.7,  0.20967
9848551   ,  222123,  222123, 5.4, 5.3, 5.6, 5.9,24.2,
27.2,   39.8,  0.20440
1000  ,  229511,  229511, 5.0, 5.2, 5.8,60.0,74.3,   
132.0,   40.5,  0.19935


Results:
real op rate  : 247211
adjusted op rate stderr   : 0
key rate  : 247211
latency mean  : 5.4
latency median: 3.5
latency 95th percentile   : 5.5
latency 99th percentile   : 6.1
latency 99.9th percentile : 83.4
latency max 

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-07-21 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069862#comment-14069862
 ] 

Vijay edited comment on CASSANDRA-7438 at 7/22/14 5:51 AM:
---

Attached patch makes the off heap/Serialization Cache configurable. (the 
default is still SerializationCache).

Regarding performance, the performance of the new cache is obviously better 
when the JNI overhead is less than the GC overhead, but for the smaller sized 
caches which can fit in memory the performance is little lower which is 
understandable (but both of them out perform pagecache performance by a large 
number). Here are the numbers.

*OffheapCacheProvider*
{panel}
Running READ with 1200 threads  for 1000 iterations
ops   ,op/s,   key/s,mean, med, .95, .99,.999, 
max,   time,   stderr
2030355   , 2029531, 2029531, 3.1, 3.1, 5.4, 5.7,61.8,  
3014.5,1.0,  0.0
2395480   ,  202845,  202845, 5.8, 5.4, 5.8,20.2,   522.4,   
545.9,2.8,  0.0
2638600   ,  221368,  221368, 5.4, 5.3, 5.8,16.3,78.8,   
131.5,3.9,  0.57860
2891705   ,  221976,  221976, 5.4, 5.3, 5.6, 6.2,15.2,
19.2,5.0,  0.60478
3147747   ,  222527,  222527, 5.4, 5.3, 5.6, 6.1,15.4,
18.2,6.2,  0.58659
3394999   ,  221527,  221527, 5.4, 5.3, 5.6, 6.6,15.9,
19.4,7.3,  0.55884
3663559   ,  226114,  226114, 5.3, 5.2, 5.6,15.0,84.4,   
110.7,8.5,  0.52924
3911154   ,  223831,  223831, 5.4, 5.3, 5.6, 6.1,15.6,
20.0,9.6,  0.50018
4152946   ,  223246,  223246, 5.4, 5.3, 5.6, 6.1,15.7,
18.8,   10.7,  0.47323
4403162   ,  228532,  228532, 5.2, 5.2, 5.6,23.2,   107.4,   
121.4,   11.8,  0.44856
4641021   ,  225196,  225196, 5.3, 5.2, 5.6, 5.9,15.3,
18.4,   12.8,  0.42557
4889523   ,  222826,  222826, 5.4, 5.3, 5.6, 6.3,16.2,
22.0,   13.9,  0.40476
5124891   ,  223203,  223203, 5.4, 5.3, 5.6, 5.8, 6.2,
14.8,   15.0,  0.38602
5375262   ,  221222,  221222, 5.4, 5.2, 5.6,18.4,94.2,   
115.1,   16.1,  0.36899
5616470   ,  224022,  224022, 5.4, 5.3, 5.6, 5.9,14.3,
17.8,   17.2,  0.35349
5866825   ,  223000,  223000, 5.4, 5.3, 5.6, 6.1,15.5,
18.2,   18.3,  0.33882
6125601   ,  225757,  225757, 5.2, 5.3, 5.6, 9.6,49.4,
72.0,   19.5,  0.32535
6348030   ,  192703,  192703, 6.3, 5.3, 9.3,14.4,77.1,
91.5,   20.6,  0.31282
6483574   ,  128520,  128520, 9.3, 8.4,10.9,19.5,88.7,
99.0,   21.7,  0.30329
6626176   ,  137199,  137199, 8.7, 8.4,10.6,14.0,32.7,
40.9,   22.7,  0.29771
6768401   ,  136860,  136860, 8.8, 8.4,10.3,14.1,35.1,
40.8,   23.8,  0.29213
6911785   ,  138204,  138204, 8.7, 8.3,10.2,13.7,34.1,
37.8,   24.8,  0.28669
7055951   ,  138633,  138633, 8.7, 8.3,10.5,32.0,40.5,
46.9,   25.8,  0.28130
7199084   ,  137731,  137731, 8.7, 8.4,10.2,14.0,33.4,
40.9,   26.9,  0.27623
7338032   ,  133201,  133201, 9.0, 8.4,10.9,34.0,39.4,
43.8,   27.9,  0.27116
7480439   ,  137059,  137059, 8.8, 8.4,10.2,13.9,35.9,
39.5,   29.0,  0.26663
7647810   ,  161209,  161209, 7.5, 7.8, 9.6,13.4,33.9,
77.9,   30.0,  0.26185
7898882   ,  226498,  226498, 5.3, 5.2, 5.6,19.7,   108.5,   
119.3,   31.1,  0.25629
8136305   ,  223840,  223840, 5.4, 5.3, 5.6, 5.9,17.3,
23.2,   32.2,  0.24838
8372076   ,  223790,  223790, 5.4, 5.3, 5.6, 6.0,15.2,
20.0,   33.2,  0.24095
8633758   ,  232914,  232914, 5.1, 5.2, 5.6,17.5,   138.4,   
182.0,   34.4,  0.23397
8869214   ,  43,  43, 5.4, 5.3, 5.6, 6.0,15.2,
17.9,   35.4,  0.22717
9121652   ,  223037,  223037, 5.4, 5.3, 5.6, 5.9,15.4,
18.8,   36.5,  0.22105
9360286   ,  225070,  225070, 5.3, 5.3, 5.6,14.8,82.7,
92.1,   37.6,  0.21524
9609676   ,  224089,  224089, 5.4, 5.3, 5.6, 5.8, 6.2,
14.3,   38.7,  0.20967
9848551   ,  222123,  222123, 5.4, 5.3, 5.6, 5.9,24.2,
27.2,   39.8,  0.20440
1000  ,  229511,  229511, 5.0, 5.2, 5.8,60.0,74.3,   
132.0,   40.5,  0.19935


Results:
real op rate  : 247211
adjusted op rate stderr   : 0
key rate  : 247211
latency mean  : 5.4
latency median: 3.5
latency 95th percentile   : 5.5
latency 99th percentile   : 6.1
latency 99.9th percentile : 83.4
latency max 

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-07-21 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069862#comment-14069862
 ] 

Vijay edited comment on CASSANDRA-7438 at 7/22/14 5:54 AM:
---

Attached patch makes the off heap/Serialization Cache configurable. (the 
default is still SerializationCache).

Regarding performance, the performance of the new cache is obviously better 
when the JNI overhead is less than the GC overhead. For smaller size cache that 
can fit in the JVM heap, the performance is a little lower understandablely 
(but both of them out perform pagecache performance by a large number). Here 
are the numbers.

*OffheapCacheProvider*
{panel}
Running READ with 1200 threads  for 1000 iterations
ops   ,op/s,   key/s,mean, med, .95, .99,.999, 
max,   time,   stderr
2030355   , 2029531, 2029531, 3.1, 3.1, 5.4, 5.7,61.8,  
3014.5,1.0,  0.0
2395480   ,  202845,  202845, 5.8, 5.4, 5.8,20.2,   522.4,   
545.9,2.8,  0.0
2638600   ,  221368,  221368, 5.4, 5.3, 5.8,16.3,78.8,   
131.5,3.9,  0.57860
2891705   ,  221976,  221976, 5.4, 5.3, 5.6, 6.2,15.2,
19.2,5.0,  0.60478
3147747   ,  222527,  222527, 5.4, 5.3, 5.6, 6.1,15.4,
18.2,6.2,  0.58659
3394999   ,  221527,  221527, 5.4, 5.3, 5.6, 6.6,15.9,
19.4,7.3,  0.55884
3663559   ,  226114,  226114, 5.3, 5.2, 5.6,15.0,84.4,   
110.7,8.5,  0.52924
3911154   ,  223831,  223831, 5.4, 5.3, 5.6, 6.1,15.6,
20.0,9.6,  0.50018
4152946   ,  223246,  223246, 5.4, 5.3, 5.6, 6.1,15.7,
18.8,   10.7,  0.47323
4403162   ,  228532,  228532, 5.2, 5.2, 5.6,23.2,   107.4,   
121.4,   11.8,  0.44856
4641021   ,  225196,  225196, 5.3, 5.2, 5.6, 5.9,15.3,
18.4,   12.8,  0.42557
4889523   ,  222826,  222826, 5.4, 5.3, 5.6, 6.3,16.2,
22.0,   13.9,  0.40476
5124891   ,  223203,  223203, 5.4, 5.3, 5.6, 5.8, 6.2,
14.8,   15.0,  0.38602
5375262   ,  221222,  221222, 5.4, 5.2, 5.6,18.4,94.2,   
115.1,   16.1,  0.36899
5616470   ,  224022,  224022, 5.4, 5.3, 5.6, 5.9,14.3,
17.8,   17.2,  0.35349
5866825   ,  223000,  223000, 5.4, 5.3, 5.6, 6.1,15.5,
18.2,   18.3,  0.33882
6125601   ,  225757,  225757, 5.2, 5.3, 5.6, 9.6,49.4,
72.0,   19.5,  0.32535
6348030   ,  192703,  192703, 6.3, 5.3, 9.3,14.4,77.1,
91.5,   20.6,  0.31282
6483574   ,  128520,  128520, 9.3, 8.4,10.9,19.5,88.7,
99.0,   21.7,  0.30329
6626176   ,  137199,  137199, 8.7, 8.4,10.6,14.0,32.7,
40.9,   22.7,  0.29771
6768401   ,  136860,  136860, 8.8, 8.4,10.3,14.1,35.1,
40.8,   23.8,  0.29213
6911785   ,  138204,  138204, 8.7, 8.3,10.2,13.7,34.1,
37.8,   24.8,  0.28669
7055951   ,  138633,  138633, 8.7, 8.3,10.5,32.0,40.5,
46.9,   25.8,  0.28130
7199084   ,  137731,  137731, 8.7, 8.4,10.2,14.0,33.4,
40.9,   26.9,  0.27623
7338032   ,  133201,  133201, 9.0, 8.4,10.9,34.0,39.4,
43.8,   27.9,  0.27116
7480439   ,  137059,  137059, 8.8, 8.4,10.2,13.9,35.9,
39.5,   29.0,  0.26663
7647810   ,  161209,  161209, 7.5, 7.8, 9.6,13.4,33.9,
77.9,   30.0,  0.26185
7898882   ,  226498,  226498, 5.3, 5.2, 5.6,19.7,   108.5,   
119.3,   31.1,  0.25629
8136305   ,  223840,  223840, 5.4, 5.3, 5.6, 5.9,17.3,
23.2,   32.2,  0.24838
8372076   ,  223790,  223790, 5.4, 5.3, 5.6, 6.0,15.2,
20.0,   33.2,  0.24095
8633758   ,  232914,  232914, 5.1, 5.2, 5.6,17.5,   138.4,   
182.0,   34.4,  0.23397
8869214   ,  43,  43, 5.4, 5.3, 5.6, 6.0,15.2,
17.9,   35.4,  0.22717
9121652   ,  223037,  223037, 5.4, 5.3, 5.6, 5.9,15.4,
18.8,   36.5,  0.22105
9360286   ,  225070,  225070, 5.3, 5.3, 5.6,14.8,82.7,
92.1,   37.6,  0.21524
9609676   ,  224089,  224089, 5.4, 5.3, 5.6, 5.8, 6.2,
14.3,   38.7,  0.20967
9848551   ,  222123,  222123, 5.4, 5.3, 5.6, 5.9,24.2,
27.2,   39.8,  0.20440
1000  ,  229511,  229511, 5.0, 5.2, 5.8,60.0,74.3,   
132.0,   40.5,  0.19935


Results:
real op rate  : 247211
adjusted op rate stderr   : 0
key rate  : 247211
latency mean  : 5.4
latency median: 3.5
latency 95th percentile   : 5.5
latency 99th percentile   : 6.1
latency 99.9th percentile : 83.4
latency max  

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-07-24 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072935#comment-14072935
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 7/24/14 7:36 AM:
--

my username on github is snazy

Do you know {{org.codehaus.mojo:native-maven-plugin}}? It allows JNI 
compilation on almost all platforms directly from Maven and does not interfere 
with SWIG - have used it on OSX, Linux, Win and Solaris.


was (Author: snazy):
my username on github is snazy

Do you know {{org.codehaus.mojo:native-maven-plugin}}? It allows JNI 
compilation on almost all platforms directly from Maven and does not interfere 
with SWIG - have used it on OSX, Linux and Win.

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-07-27 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075611#comment-14075611
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 7/27/14 12:18 PM:
---

[~vijay2...@gmail.com] do you have a C* branch with lruc integrated? Or: what 
should I do to bring lruc+C* together? Is the patch up-to-date?

I've pushed a new branch 'native-plugin' with the changes for 
native-maven-plugin. It's separate from the other code. Works for Linux and OSX 
(depending on which machine the stuff's built). Windows stuff is bit more 
complicated - it doesn't compile. Have to dig a bit deeper. Maybe delay Win 
port...


was (Author: snazy):
[~vijay2...@gmail.com] do you have a C* branch with lruc integrated? Or: what 
should I do to bring lruc+C* together? Is the patch up-to-date?

I've pushed a new branch 'native-plugin' with the changes for 
native-maven-plugin - separate from the other code. Windows stuff is bit more 
complicated - it doesn't compile. Have to dig a bit deeper. Maybe delay Win 
port...

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-22 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1495#comment-1495
 ] 

Vijay edited comment on CASSANDRA-7438 at 11/23/14 3:23 AM:


Alright the first version of pure Java version of LRUCache pushed, 
* Basically a port from the C version. (Most of the test cases pass and they 
are the same for both versions)
* As Ariel mentioned before... we can use disruptor for the ring buffer, 
current version doesnt use it yet.
* Proactive expiry in the queue thread is not implemented yet.
* Algorithm to start the rehash needs to be more configurable, and based on the 
capacity will be pushing that soon.
* Overhead in JVM heap is just the segments array, hence should be able to grow 
as much as the system can support.

https://github.com/Vijay2win/lruc/tree/master/src/main/java/com/lruc/unsafe 


was (Author: vijay2...@yahoo.com):
Alright the first version of pure Java version of LRUCache pushed, 
* Basically a port from the C version. (Most of the test cases pass and they 
are the same for both versions)
* As ariel mentioned before we can use disruptor for the ring buffer but this 
doesn't use it yet.
* Expiry in the queue thread is not implemented yet.
* Algorithm to start the rehash needs to be more configurable and based on the 
capacity will be pushing that soon.
* Overhead in JVM heap is just the segments array.

https://github.com/Vijay2win/lruc/tree/master/src/main/java/com/lruc/unsafe 

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-24 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222799#comment-14222799
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 11/24/14 8:58 AM:
---

rehashing: growing (x2) is already implemented, shrinking (/2) shouldn't be a 
big issue, too. The implementation only locks the currently processed 
partitions during rehash.
"put" operation: fixed (was definitely a bug), cleanup is running concurrently 
and trigger on "out of memory" condition
block sizes: will give it a try (fixed vs. different sizes vs. variable sized 
(no blocks))
per-partition locks: already thought about it - not sure whether it's worth the 
additional RW-lock overhead since partition lock time is very low during normal 
operation
metrics: some (very basic) metrics are already in it - will add some more timer 
metrics (configurable)

[~vijay2...@yahoo.com] can you catch {{OutOfMemoryError}} for Unsafe.allocate() 
? It should not go up the whole call stack as is to prevent C* handling that as 
"Java heap full".


was (Author: snazy):
rehashing: growing (x2) is already implemented, shrinking (/2) shouldn't be a 
big issue, too. The implementation only locks the currently processed 
partitions during rehash.
"put" operation: fixed (was definitely a bug), cleanup is running concurrently 
and trigger on "out of memory" condition
block sizes: will give it a try (fixed vs. different sizes vs. variable sized 
(no blocks))
per-partition locks: already thought about it - not sure whether it's worth the 
additional RW-lock overhead since partition lock time is very low during normal 
operation
metrics: some (very basic) metrics are already in it - will add some more timer 
metrics (configurable)

[~vijay2...@yahoo.com] can you catch {{OutOfMemoryError}} for Unsafe.allocate() 
? It should not go up the whole call stack.

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-25 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225478#comment-14225478
 ] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 11/26/14 12:29 AM:
--

bq. if we don't like the constant overhead of the cache in heap and If you are 
talking about CAS which we already do for ref counting, as mentioned before we 
need an alternative strategy for global locks for rebalance if we go with lock 
less strategy.
Just take what you have and do it off heap. You don't need to change anything 
about how locking is done, just put the segments off heap so each segment would 
be a 4-byte lock field and an 8 byte pointer to the first entry. I am not clear 
on the alignment requirements for 4 or 8 byte CAS.

bq. Until you complete a rehash you don't know if you need to hash again or 
not... Am i missing something?
https://github.com/Vijay2win/lruc/blob/master/src/main/java/com/lruc/unsafe/UnsafeConcurrentMap.java#L38

The check on line 38 races with the assignment on line 39. N threads could do 
the check and think a rehash is necessary. Each would submit a rehash task and 
the table size would be doubled N times instead of 1 time.


was (Author: aweisberg):
bq. if we don't like the constant overhead of the cache in heap and If you are 
talking about CAS which we already do for ref counting, as mentioned before we 
need an alternative strategy for global locks for rebalance if we go with lock 
less strategy.
Just take what you have and do it off heap. You don't need to change anything 
about how locking is done, just put the segments off heap so each segment would 
be a 4-byte lock field and an 8 byte pointer to the first entry.

bq. Until you complete a rehash you don't know if you need to hash again or 
not... Am i missing something?
https://github.com/Vijay2win/lruc/blob/master/src/main/java/com/lruc/unsafe/UnsafeConcurrentMap.java#L38

The check on line 38 races with the assignment on line 39. N threads could do 
the check and think a rehash is necessary. Each would submit a rehash task and 
the table size would be doubled N times instead of 1 time.

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-26 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227248#comment-14227248
 ] 

Jonathan Ellis edited comment on CASSANDRA-7438 at 11/27/14 4:26 AM:
-

bq. The row cache can contain very large rows [partitions] AFAIK

Well, it *can*, but it's almost always a bad idea.  Not something we should 
optimize for.  (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1)

bq. Does the storage engine always materialize entire rows [partitions] into 
memory for every query?

Only when it's pulling them from the off-heap cache.  (It deserializes onto the 
heap to filter out the requested results.)


was (Author: jbellis):
bq. The row cache can contain very large rows [partitions] AFAIK

Well, it *can*, but it's almost always a bad idea.  Not something we should 
optimize for.  (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1)

bq. Does the storage engine always materialize entire rows [partitions] into 
memory for every query?

Only when it's pulling them from the off-heap cache.

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228397#comment-14228397
 ] 

Benedict edited comment on CASSANDRA-7438 at 11/28/14 5:06 PM:
---

I suspect segmenting the table at a coarser granularity, so that each segment 
is maintained with mutual exclusivity, would achieve better percentiles in both 
cases due to keeping the maximum resize cost down. We could settle for a 
separate LRU-q per segment, even, to keep the complexity of this code down 
significantly - it is unlikely having a global LRU-q is significantly more 
accurate at predicting reuse than ~128 of them. It would also make it much 
easier to improve the replacement strategy beyond LRU, which would likely yield 
a bigger win for performance than any potential loss from reduced concurrency. 
The critical section for reads could be kept sufficiently small that 
competition would be very unlikely with the current state of C*, by performing 
the deserialization outside of it. There's a good chance this would yield a net 
positive performance impact, by reducing the cost per access without increasing 
the cost due to contention measurably (because contention would be infrequent).

edit: coarser, not finer. i.e., a la j.u.c.CHM


was (Author: benedict):
I suspect segmenting the table at a finer granularity, so that each segment is 
maintained with mutual exclusivity, would achieve better percentiles in both 
cases due to keeping the maximum resize cost down. We could settle for a 
separate LRU-q per segment, even, to keep the complexity of this code down 
significantly - it is unlikely having a global LRU-q is significantly more 
accurate at predicting reuse than ~128 of them. It would also make it much 
easier to improve the replacement strategy beyond LRU, which would likely yield 
a bigger win for performance than any potential loss from reduced concurrency. 
The critical section for reads could be kept sufficiently small that 
competition would be very unlikely with the current state of C*, by performing 
the deserialization outside of it. There's a good chance this would yield a net 
positive performance impact, by reducing the cost per access without increasing 
the cost due to contention measurably (because contention would be infrequent).

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-28 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228523#comment-14228523
 ] 

Vijay edited comment on CASSANDRA-7438 at 11/28/14 9:47 PM:


{quote}I would break out the performance comparison with and without warming up 
the cache so we know how it performs when you aren't measuring the resize 
pauses.{quote} 
Yep and in stedy state it is similar to get and I have verified that the 
latency is due to rehash. Better benchmarks on bug machines will be done on 
Monday.

Unfortunately -1 on partitions, it will be a lot more complex and will be hard 
to understand for users. If we have to expand the partitions, we have to figure 
out a better consistent hashing algo. "Cassandra within Cassandra may be". More 
over we will end up having the current code as is to move maps and queues 
offheap. Sorry I don't understand the argument of code complexity.

If we are talking about code complexity. The unsafe code is 1000 lines 
including the license headers :)

The current contention topic is weather to use cas for locks. Which is showing 
higher cpu cost and I agree with Pavel on locks also shows up on the numbers.


was (Author: vijay2...@yahoo.com):
{quote}I would break out the performance comparison with and without warming up 
the cache so we know how it performs when you aren't measuring the resize 
pauses.{quote} 
Yep and in stedy state it is similar to get and I have verified that the 
latency is due to rehash. Better benchmarks on bug machines will be done on 
Monday.

Unfortunately -1 on partitions, it will be a lot more complex and will be hard 
to understand for users. If we have to expand the partitions, we have to figure 
out a better consistent hashing algo. "Cassandra within Cassandra may be". More 
over we will end up having the current code as is to move maps and queues 
offheap. Sorry I don't understand the argument of code complexity.

If we are talking about code complexity. The unsafe code is 1000 lines 
including the license headers :)

The current contention topic is weather to use cas for locks. Which is showing 
higher cpu cost and I agree with Pavel on latencies as shown in the numbers.

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228563#comment-14228563
 ] 

Benedict edited comment on CASSANDRA-7438 at 11/29/14 12:23 AM:


[~aweisberg]: In my experience segments tend to be imperfectly distributed, so 
whilst there is bunching of resizes simply because they take so long, with real 
work going on at the same time they should be a _little_ spread out. Though 
with murmur3 the distribution may be significantly more uniform than my prior 
experiments. Either way, they're performed in parallel (without coordination) 
if they coincide, and are each a fraction of the size, so it's still an 
improvement.

[~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties 
of concurrent programming magnified without the normal tools. For instance, 
there are the following concerns:

* We have a spin-lock - admittedly one that should _generally_ be uncontended, 
but on a grow or a small map this is certainly not the case, which could result 
in really problematic behaviour. Pure spin locks should not be used outside of 
the kernel. 
* The queue is maintained by a separate thread that requires signalling if it 
isn't currently performing work - which, in a real C* instance where the cost 
of linking the queue item is a fraction of the other work done to service a 
request means we are likely to incur a costly unpark() for a majority of 
operations
* Reads can interleave with put/replace/remove and abort the removal of an item 
from the queue, resulting in a memory leak. 
* We perform the grow on a separate thread, but prevent all reader _or_ writer 
threads from making progress by taking the locks for all buckets immediately.
* Freeing of oldSegments is still dangerous, it's just probabilistically less 
likely to happen.
* During a grow, we can lose puts because we unlock the old segments, so with 
the right (again, unlikely) interleaving of events a writer can think the old 
table is still valid
* When growing, we only double the size of the backing table, however since 
grows happen in the background the updater can get ahead, meaning we remain 
behind and multiply the constant factor overheads, collisions and contention 
until total size tails off.

These are only the obvious problems that spring to mind from 15m perusing the 
code, I'm sure there are others. This kind of stuff is really hard, and the 
approach I'm suggesting is comparatively a doddle to get right, and is likely 
faster to boot.

I'm not sure I understand your concern with segmentation creating complexity 
with the hashing... I'm proposing the exact method used by CHM. We have an 
excellent hash algorithm to distribute the data over the segments: murmurhash3. 
Although we need to be careful to not use the bits that don't have the correct 
entropy for selecting a segment. 

Think of it as simply implementing an off-heap LinkedHashMap, wrapping it in a 
lock, and having an array of them. The user doesn't need to know anything about 
this.


was (Author: benedict):
[~aweisberg]: In my experience segments tend to be imperfectly distributed, so 
whilst there is bunching of resizes simply because they take so long, with real 
work going on at the same time they should be a _little_ spread out. Though 
with murmur3 the distribution may be significantly more uniform than my prior 
experiments. Either way, they're performed in parallel (without coordination) 
if they coincide, so it's still an improvement.

[~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties 
of concurrent programming magnified without the normal tools. For instance, 
there are the following concerns:

* We have a spin-lock - admittedly one that should _generally_ be uncontended, 
but on a grow or a small map this is certainly not the case, which could result 
in really problematic behaviour. Pure spin locks should not be used outside of 
the kernel. 
* The queue is maintained by a separate thread that requires signalling if it 
isn't currently performing work - which, in a real C* instance where the cost 
of linking the queue item is a fraction of the other work done to service a 
request means we are likely to incur a costly unpark() for a majority of 
operations
* Reads can interleave with put/replace/remove and abort the removal of an item 
from the queue, resulting in a memory leak. 
* We perform the grow on a separate thread, but prevent all reader _or_ writer 
threads from making progress by taking the locks for all buckets immediately.
* Freeing of oldSegments is still dangerous, it's just probabilistically less 
likely to happen.
* During a grow, we can lose puts because we unlock the old segments, so with 
the right (again, unlikely) interleaving of events a writer can think the old 
table is still valid
* When growing, we only double the size of the backing t

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228693#comment-14228693
 ] 

Benedict edited comment on CASSANDRA-7438 at 11/29/14 9:40 AM:
---

Good point! But invert those two statements and the behaviour is still broken.

B: 154 :map.get()
A: 187: map.remove()
A: 191: queue.deleteFromQueue()
B: 158: queue.addToQueue()


was (Author: benedict):
Invert those two statements and the behaviour is still broken.

B: 154 :map.get()
A: 187: map.remove()
A: 191: queue.deleteFromQueue()
B: 158: queue.addToQueue()

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-01 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229284#comment-14229284
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 12/1/14 9:46 AM:
--

Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has 
been nearly completely rewritten.

Architecture (in brief):
* OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads 
to more contention, more segments gives no measurable improvement.
* Each segment consists of an off-heap-hash-map (defaults: table-size=8192, 
load-factor=.75). (The hash table requires 8 bytes per bucket)
* Hash entries in a bucket are organized in a double-linked-list
* LRU replacement policy is built-in via its own double-linked-list
* Critical sections that mutually lock a segment are pretty short (code + CPU) 
- just a 'synchronized' keyword, no StampedLock/ReentrantLock
* Capacity for the cache is configured globally and managed "locally" in each 
segment
* Eviction (or "replacement" or "cleanup") is triggered when free capacity goes 
below a trigger value and cleans up to a target free capacity
* Uses murmur hash on serialized key. Most significant bits are used to find 
the segment, least significant bits for the segment's hash map. 

Non-production relevant stuff:
* Allows to start off-heap access in "debug" mode, that checks for accesses 
outside of allocated region and produces exceptions instead of SIGSEGV or 
jemalloc errors
* ohc-benchmark updated to reflect changes

About replacement policy: Currently LRU is built in - but I'm not really sold 
on LRU as is. Alternatives could be
* timestamp (not sold on this either - basically the same as LRU)
* LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead 
(space)
* 2Q (counts accesses, divides counter regularly)
* LRU+random (50/50) (may give the same result than LIRS, but without LIRS' 
overhead)

But replacement of LRU with something else is out of scope of this ticket and 
should be done with real workloads in C* - although the last one is "just" a 
additional config parameter.

IMO we should add a per-table option that configures whether the row cache 
receives data on reads+writes or just on reads. Might prevent garbage in the 
cache caused by write heavy tables.

{{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared 
to jemalloc. Reason fot it might be that JNA library (which has some 
synchronized blocks in it).

IMO OHC is ready to be merged into C* code base.

Edit: the fact that there are two double-linked lists is a left-over of several 
experiments and it will be merged into one double-linked-list. It needs to be 
and will be fixed.


was (Author: snazy):
Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has 
been nearly completely rewritten.

Architecture (in brief):
* OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads 
to more contention, more segments gives no measurable improvement.
* Each segment consists of an off-heap-hash-map (defaults: table-size=8192, 
load-factor=.75). (The hash table requires 8 bytes per bucket)
* Hash entries in a bucket are organized in a double-linked-list
* LRU replacement policy is built-in via its own double-linked-list
* Critical sections that mutually lock a segment are pretty short (code + CPU) 
- just a 'synchronized' keyword, no StampedLock/ReentrantLock
* Capacity for the cache is configured globally and managed "locally" in each 
segment
* Eviction (or "replacement" or "cleanup") is triggered when free capacity goes 
below a trigger value and cleans up to a target free capacity
* Uses murmur hash on serialized key. Most significant bits are used to find 
the segment, least significant bits for the segment's hash map. 

Non-production relevant stuff:
* Allows to start off-heap access in "debug" mode, that checks for accesses 
outside of allocated region and produces exceptions instead of SIGSEGV or 
jemalloc errors
* ohc-benchmark updated to reflect changes

About replacement policy: Currently LRU is built in - but I'm not really sold 
on LRU as is. Alternatives could be
* timestamp (not sold on this either - basically the same as LRU)
* LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead 
(space)
* 2Q (counts accesses, divides counter regularly)
* LRU+random (50/50) (may give the same result than LIRS, but without LIRS' 
overhead)
But replacement of LRU with something else is out of scope of this ticket and 
should be done with real workloads in C* - although the last one is "just" a 
additional config parameter.

IMO we should add a per-table option that configures whether the row cache 
receives data on reads+writes or just on reads. Might prevent garbage in the 
cache caused by write heavy tables.

{{Unsafe.allocateMemory()}} gives about 5-10% performance 

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-01 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229284#comment-14229284
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 12/1/14 10:14 AM:
---

Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has 
been nearly completely rewritten.

Architecture (in brief):
* OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads 
to more contention, more segments gives no measurable improvement.
* Each segment consists of an off-heap-hash-map (defaults: table-size=8192, 
load-factor=.75). (The hash table requires 8 bytes per bucket)
* Hash entries in a bucket are organized in a double-linked-list
* LRU replacement policy is built-in via its own double-linked-list
* Critical sections that mutually lock a segment are pretty short (code + CPU) 
- just a 'synchronized' keyword, no StampedLock/ReentrantLock
* Capacity for the cache is configured globally and managed "locally" in each 
segment
* Eviction (or "replacement" or "cleanup") is triggered when free capacity goes 
below a trigger value and cleans up to a target free capacity
* Uses murmur hash on serialized key. Most significant bits are used to find 
the segment, least significant bits for the segment's hash map. 

Non-production relevant stuff:
* Allows to start off-heap access in "debug" mode, that checks for accesses 
outside of allocated region and produces exceptions instead of SIGSEGV or 
jemalloc errors
* ohc-benchmark updated to reflect changes

About replacement policy: Currently LRU is built in - but I'm not really sold 
on LRU as is. Alternatives could be
* timestamp (not sold on this either - basically the same as LRU)
* LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead 
(space)
* 2Q (counts accesses, divides counter regularly)
* LRU+random (50/50) (may give the same result than LIRS, but without LIRS' 
overhead)

But replacement of LRU with something else is out of scope of this ticket and 
should be done with real workloads in C* - although the last one is "just" a 
additional config parameter.

IMO we should add a per-table option that configures whether the row cache 
receives data on reads+writes or just on reads. Might prevent garbage in the 
cache caused by write heavy tables.

{{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared 
to jemalloc. Reason fot it might be that JNA library (which has some 
synchronized blocks in it).

IMO OHC is ready to be merged into C* code base.

Edit2: (remove edit1)


was (Author: snazy):
Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has 
been nearly completely rewritten.

Architecture (in brief):
* OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads 
to more contention, more segments gives no measurable improvement.
* Each segment consists of an off-heap-hash-map (defaults: table-size=8192, 
load-factor=.75). (The hash table requires 8 bytes per bucket)
* Hash entries in a bucket are organized in a double-linked-list
* LRU replacement policy is built-in via its own double-linked-list
* Critical sections that mutually lock a segment are pretty short (code + CPU) 
- just a 'synchronized' keyword, no StampedLock/ReentrantLock
* Capacity for the cache is configured globally and managed "locally" in each 
segment
* Eviction (or "replacement" or "cleanup") is triggered when free capacity goes 
below a trigger value and cleans up to a target free capacity
* Uses murmur hash on serialized key. Most significant bits are used to find 
the segment, least significant bits for the segment's hash map. 

Non-production relevant stuff:
* Allows to start off-heap access in "debug" mode, that checks for accesses 
outside of allocated region and produces exceptions instead of SIGSEGV or 
jemalloc errors
* ohc-benchmark updated to reflect changes

About replacement policy: Currently LRU is built in - but I'm not really sold 
on LRU as is. Alternatives could be
* timestamp (not sold on this either - basically the same as LRU)
* LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead 
(space)
* 2Q (counts accesses, divides counter regularly)
* LRU+random (50/50) (may give the same result than LIRS, but without LIRS' 
overhead)

But replacement of LRU with something else is out of scope of this ticket and 
should be done with real workloads in C* - although the last one is "just" a 
additional config parameter.

IMO we should add a per-table option that configures whether the row cache 
receives data on reads+writes or just on reads. Might prevent garbage in the 
cache caused by write heavy tables.

{{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared 
to jemalloc. Reason fot it might be that JNA library (which has some 
synchronized blocks in it).

IMO OHC is ready to be merged int

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-01 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229284#comment-14229284
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 12/1/14 11:00 AM:
---

Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has 
been nearly completely rewritten.

Architecture (in brief):
* OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads 
to more contention, more segments gives no measurable improvement.
* Each segment consists of an off-heap-hash-map (defaults: table-size=8192, 
load-factor=.75). (The hash table requires 8 bytes per bucket)
* Hash entries in a bucket are organized in a single-linked-list
* LRU replacement policy is built-in via its own double-linked-list
* Critical sections that mutually lock a segment are pretty short (code + CPU) 
- just a 'synchronized' keyword, no StampedLock/ReentrantLock
* Capacity for the cache is configured globally and managed "locally" in each 
segment
* Eviction (or "replacement" or "cleanup") is triggered when free capacity goes 
below a trigger value and cleans up to a target free capacity
* Uses murmur hash on serialized key. Most significant bits are used to find 
the segment, least significant bits for the segment's hash map. 

Non-production relevant stuff:
* Allows to start off-heap access in "debug" mode, that checks for accesses 
outside of allocated region and produces exceptions instead of SIGSEGV or 
jemalloc errors
* ohc-benchmark updated to reflect changes

About replacement policy: Currently LRU is built in - but I'm not really sold 
on LRU as is. Alternatives could be
* timestamp (not sold on this either - basically the same as LRU)
* LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead 
(space)
* 2Q (counts accesses, divides counter regularly)
* LRU+random (50/50) (may give the same result than LIRS, but without LIRS' 
overhead)

But replacement of LRU with something else is out of scope of this ticket and 
should be done with real workloads in C* - although the last one is "just" a 
additional config parameter.

IMO we should add a per-table option that configures whether the row cache 
receives data on reads+writes or just on reads. Might prevent garbage in the 
cache caused by write heavy tables.

{{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared 
to jemalloc. Reason fot it might be that JNA library (which has some 
synchronized blocks in it).

IMO OHC is ready to be merged into C* code base.

Edit3: (sorry for the JIRA noise) - bucket linked list is only a 
single-linked-list - LRU linked list needs to be doubly linked


was (Author: snazy):
Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has 
been nearly completely rewritten.

Architecture (in brief):
* OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads 
to more contention, more segments gives no measurable improvement.
* Each segment consists of an off-heap-hash-map (defaults: table-size=8192, 
load-factor=.75). (The hash table requires 8 bytes per bucket)
* Hash entries in a bucket are organized in a double-linked-list
* LRU replacement policy is built-in via its own double-linked-list
* Critical sections that mutually lock a segment are pretty short (code + CPU) 
- just a 'synchronized' keyword, no StampedLock/ReentrantLock
* Capacity for the cache is configured globally and managed "locally" in each 
segment
* Eviction (or "replacement" or "cleanup") is triggered when free capacity goes 
below a trigger value and cleans up to a target free capacity
* Uses murmur hash on serialized key. Most significant bits are used to find 
the segment, least significant bits for the segment's hash map. 

Non-production relevant stuff:
* Allows to start off-heap access in "debug" mode, that checks for accesses 
outside of allocated region and produces exceptions instead of SIGSEGV or 
jemalloc errors
* ohc-benchmark updated to reflect changes

About replacement policy: Currently LRU is built in - but I'm not really sold 
on LRU as is. Alternatives could be
* timestamp (not sold on this either - basically the same as LRU)
* LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead 
(space)
* 2Q (counts accesses, divides counter regularly)
* LRU+random (50/50) (may give the same result than LIRS, but without LIRS' 
overhead)

But replacement of LRU with something else is out of scope of this ticket and 
should be done with real workloads in C* - although the last one is "just" a 
additional config parameter.

IMO we should add a per-table option that configures whether the row cache 
receives data on reads+writes or just on reads. Might prevent garbage in the 
cache caused by write heavy tables.

{{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared 
to jemalloc. Reason fot i

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230675#comment-14230675
 ] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 12/1/14 11:28 PM:
-

Look pretty nice.

Suggestions:
* Push the stats into the segments and gather them the way you do free capacity 
and cleanup count. You can drop the volatile (technically you will have to 
synchronize on read). Inside each OffHeapMap put the stats members (and 
anything mutable) as the first declared fields. In practice this can put them 
on the same cache line as the lock field in the object header. It will also be 
just one flush at the end of the critical section. Stats collection should be 
free so no reason not to leave it on all the time.
* I am not sure batch cleanup makes sense. When inserting an item into the 
cache would blow the size requirement I would just evict elements until 
inserting it wouldn't. Is there a specific efficiency you think you are going 
to get from doing it in batches?
* Cache is the wrong API to use since it doesn't allow lazy deserialization and 
zero copy. Since entries are refcounted there is no need make a copy. Might be 
something to save for later since everything upstream expects a POJO of some 
sort.
* Key buffer might be worth a thread local sized to a high watermark

Do we have a decent way to do line level code review? I can't  leave comments 
on github unless there is a pull request. Line level stuff
* Don't catch exceptions and handle inside the map. Let them all propagate to 
the caller and use try/finally to do cleanup. I know you have to wrap and 
rethrow some things, but avoid where possible.
* Compare key compares 8 bytes at a time, how does it handle trailing bytes and 
alignment?
* Agrona has an Unsafe ByteBuffer implementation that looks like it makes a 
little better use of various intrinsics then AbstractDataOutput. Does some 
other nifty stuff as well. 
https://github.com/real-logic/Agrona/blob/master/src/main/java/uk/co/real_logic/agrona/concurrent/UnsafeBuffer.java
* In OffHeapMap.touch lines 439 and 453 are not covered by tests. Coverage 
looks a little weird in that a lot of the cases are always hit but some don't 
touch both branches. If lruTail == hashEntryAddr maybe assert next is null.
* Rename mutating OffHeapMap lruNext and lruPrev to reflect that they mutate. 
In general rename mutating methods to reflect they do that such as the two 
versions of first
* I don't see why the cache can't use CPU endianness since the key/value are 
just copied.
* Did you get the UTF encoded string stuff from somewhere? I see something 
similar in the jdk, can you get that via inheritance?
* HashEntryInput, AbstractDataOutput  are low on the coverage scale and have no 
tests for some pretty gnarly UTF8 stuff.
* Continuing on that theme there is a lot of unused code to satisfy the 
interfaces being implemented, would be nice to avoid that.
* By hashing the key yourself you prevent caching the hash code in the POJO. 
Maybe hashes should be 32-bits and provided by the POJO?
* If an allocation fails maybe throw OutOfMemoryError with a message
* If an entry is too large maybe return an error of some sort? Seems like 
caller should decide if not caching is OK.
* In put, why catch VirtualMachineError and not error?  Seems like it wants a 
finally, and it shouldn't throw checked exceptions.
* If a key serializer is necessary throw in the constructor and remove other 
checks
* Hot N could use a more thorough test?
* In practice how is hot N used in C*? When people save the cache to disk do 
they save the entire cache?
* In the value loading case, I think there is some subtlety to the concurrency 
of invocations to the loader in that it doesn't call it on all of them in a 
race. It might be a minor change in behavior compared to Guava.
* Maybe do the value loading timing in nanoseconds? Performance is the same but 
precision is better.
* OffHeapMap.Table.removeLink(long,long) has no test coverage of the second 
branch that walks a bucket to find the previous entry
* I don't think storage for 16 million keys is enough? For 128 bytes per entry 
that is only 2 gigabytes. You would have to run a lot of segments which is 
probably fine, but that presents a configuration issue. Maybe allow more than 
24 bits of buckets in each segment?
* SegmentedCacheImpl contains duplicate code fro dereferencing and still has to 
delegate part of the work to the OffHeapMap. Maybe keep it all in OffHeapMap?
* Unit test wise there are some things not tested. The value loader interface, 
various things like putAll or invalidateAll.
* Release is not synchronized. Release should null pointers out so you get a 
good clean segfault. Close should maybe lock and close one segment at a time 
and invalidate as part of that.








was (Author: aweisberg):
Look pretty nice.

Suggestions:

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230675#comment-14230675
 ] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 12/1/14 11:46 PM:
-

Look pretty nice.

Suggestions:
* Push the stats into the segments and gather them the way you do free capacity 
and cleanup count. You can drop the volatile (technically you will have to 
synchronize on read). Inside each OffHeapMap put the stats members (and 
anything mutable) as the first declared fields. In practice this can put them 
on the same cache line as the lock field in the object header. It will also be 
just one flush at the end of the critical section. Stats collection should be 
free so no reason not to leave it on all the time.
* I am not sure batch cleanup makes sense. When inserting an item into the 
cache would blow the size requirement I would just evict elements until 
inserting it wouldn't. Is there a specific efficiency you think you are going 
to get from doing it in batches?
* Cache is the wrong API to use since it doesn't allow lazy deserialization and 
zero copy. Since entries are refcounted there is no need make a copy. Might be 
something to save for later since everything upstream expects a POJO of some 
sort.
* Key buffer might be worth a thread local sized to a high watermark

Do we have a decent way to do line level code review? I can't  leave comments 
on github unless there is a pull request. Line level stuff
* Don't catch exceptions and handle inside the map. Let them all propagate to 
the caller and use try/finally to do cleanup. I know you have to wrap and 
rethrow some things due to checked exceptions, but avoid where possible.
* Compare key compares 8 bytes at a time, how does it handle trailing bytes and 
alignment?
* Agrona has an Unsafe ByteBuffer implementation that looks like it makes a 
little better use of various intrinsics then AbstractDataOutput. Does some 
other nifty stuff as well. 
https://github.com/real-logic/Agrona/blob/master/src/main/java/uk/co/real_logic/agrona/concurrent/UnsafeBuffer.java
* In OffHeapMap.touch lines 439 and 453 are not covered by tests. Coverage 
looks a little weird in that a lot of the cases are always hit but some don't 
touch both branches. If lruTail == hashEntryAddr maybe assert next is null.
* Rename mutating OffHeapMap lruNext and lruPrev to reflect that they mutate. 
In general rename mutating methods to reflect they do that such as the two 
versions of first
* I don't see why the cache can't use CPU endianness since the key/value are 
just copied.
* Did you get the UTF encoded string stuff from somewhere? I see something 
similar in the jdk, can you get that via inheritance?
* HashEntryInput, AbstractDataOutput  are low on the coverage scale and have no 
tests for some pretty gnarly UTF8 stuff.
* Continuing on that theme there is a lot of unused code to satisfy the 
interfaces being implemented, would be nice to avoid that.
* By hashing the key yourself you prevent caching the hash code in the POJO. 
Maybe hashes should be 32-bits and provided by the POJO?
* If an allocation fails maybe throw OutOfMemoryError with a message
* If an entry is too large maybe return an error of some sort? Seems like 
caller should decide if not caching is OK.
* In put, why catch VirtualMachineError and not error?  Seems like it wants a 
finally, and it shouldn't throw checked exceptions.
* If a key serializer is necessary throw in the constructor and remove other 
checks
* Hot N could use a more thorough test?
* In practice how is hot N used in C*? When people save the cache to disk do 
they save the entire cache?
* In the value loading case, I think there is some subtlety to the concurrency 
of invocations to the loader in that it doesn't call it on all of them in a 
race. It might be a minor change in behavior compared to Guava.
* Maybe do the value loading timing in nanoseconds? Performance is the same but 
precision is better.
* OffHeapMap.Table.removeLink(long,long) has no test coverage of the second 
branch that walks a bucket to find the previous entry
* I don't think storage for 16 million keys is enough? For 128 bytes per entry 
that is only 2 gigabytes. You would have to run a lot of segments which is 
probably fine, but that presents a configuration issue. Maybe allow more than 
24 bits of buckets in each segment?
* SegmentedCacheImpl contains duplicate code fro dereferencing and still has to 
delegate part of the work to the OffHeapMap. Maybe keep it all in OffHeapMap?
* Unit test wise there are some things not tested. The value loader interface, 
various things like putAll or invalidateAll.
* Release is not synchronized. Release should null pointers out so you get a 
good clean segfault. Close should maybe lock and close one segment at a time 
and invalidate as part of that.








was (Author: aweisberg):
Look p

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230675#comment-14230675
 ] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 12/1/14 11:48 PM:
-

Look pretty nice.

Suggestions:
* Push the stats into the segments and gather them the way you do free capacity 
and cleanup count. You can drop the volatile (technically you will have to 
synchronize on read). Inside each OffHeapMap put the stats members (and 
anything mutable) as the first declared fields. In practice this can put them 
on the same cache line as the lock field in the object header. It will also be 
just one flush at the end of the critical section. Stats collection should be 
free so no reason not to leave it on all the time.
* I am not sure batch cleanup makes sense. When inserting an item into the 
cache would blow the size requirement I would just evict elements until 
inserting it wouldn't. Is there a specific efficiency you think you are going 
to get from doing it in batches?
* Cache is the wrong API to use since it doesn't allow lazy deserialization and 
zero copy. Since entries are refcounted there is no need make a copy. Might be 
something to save for later since everything upstream expects a POJO of some 
sort.
* Key buffer might be worth a thread local sized to a high watermark

Do we have a decent way to do line level code review? I can't  leave comments 
on github unless there is a pull request. Line level stuff
* Don't catch exceptions and handle inside the map. Let them all propagate to 
the caller and use try/finally to do cleanup. I know you have to wrap and 
rethrow some things due to checked exceptions, but avoid where possible.
* Compare key compares 8 bytes at a time, how does it handle trailing bytes and 
alignment?
* Agrona has an Unsafe ByteBuffer implementation that looks like it makes a 
little better use of various intrinsics then AbstractDataOutput. Does some 
other nifty stuff as well. 
https://github.com/real-logic/Agrona/blob/master/src/main/java/uk/co/real_logic/agrona/concurrent/UnsafeBuffer.java
* In OffHeapMap.touch lines 439 and 453 are not covered by tests. Coverage 
looks a little weird in that a lot of the cases are always hit but some don't 
touch both branches. If lruTail == hashEntryAddr maybe assert next is null.
* Rename mutating OffHeapMap lruNext and lruPrev to reflect that they mutate. 
In general rename mutating methods to reflect they do that such as the two 
versions of first
* I don't see why the cache can't use CPU endianness since the key/value are 
just copied.
* Did you get the UTF encoded string stuff from somewhere? I see something 
similar in the jdk, can you get that via inheritance?
* HashEntryInput, AbstractDataOutput  are low on the coverage scale and have no 
tests for some pretty gnarly UTF8 stuff.
* Continuing on that theme there is a lot of unused code to satisfy the 
interfaces being implemented, would be nice to avoid that.
* By hashing the key yourself you prevent caching the hash code in the POJO. 
Maybe hashes should be 32-bits and provided by the POJO?
* If an allocation fails maybe throw OutOfMemoryError with a message
* If an entry is too large maybe return an error of some sort? Seems like 
caller should decide if not caching is OK.
* In put, why catch VirtualMachineError and not error?  Seems like it wants a 
finally, and it shouldn't throw checked exceptions.
* If a key serializer is necessary throw in the constructor and remove other 
checks
* Hot N could use a more thorough test?
* In practice how is hot N used in C*? When people save the cache to disk do 
they save the entire cache? I am a little concerned about materializing the 
full list on heap. It's a lot of contiguous memory if you aren't careful.
* In the value loading case, I think there is some subtlety to the concurrency 
of invocations to the loader in that it doesn't call it on all of them in a 
race. It might be a minor change in behavior compared to Guava.
* Maybe do the value loading timing in nanoseconds? Performance is the same but 
precision is better.
* OffHeapMap.Table.removeLink(long,long) has no test coverage of the second 
branch that walks a bucket to find the previous entry
* I don't think storage for 16 million keys is enough? For 128 bytes per entry 
that is only 2 gigabytes. You would have to run a lot of segments which is 
probably fine, but that presents a configuration issue. Maybe allow more than 
24 bits of buckets in each segment?
* SegmentedCacheImpl contains duplicate code fro dereferencing and still has to 
delegate part of the work to the OffHeapMap. Maybe keep it all in OffHeapMap?
* Unit test wise there are some things not tested. The value loader interface, 
various things like putAll or invalidateAll.
* Release is not synchronized. Release should null pointers out so you get a 
good clean segfault. Close

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-02 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231327#comment-14231327
 ] 

Vijay edited comment on CASSANDRA-7438 at 12/2/14 11:33 AM:


[~snazy] I was trying to compare the OHC and found few major bugs.

There is correctness in the hashing algorithm i think. Get returns a lot of 
error and looks like there is some memory leaks too.


was (Author: vijay2...@yahoo.com):
[~snazy] I was trying to compare the OHC and found few major bugs.

1) You have individual method synchronization on the Map, which doesn't ensure 
that your get is locked before a put is performed (same with clean, hot(N), 
remove etc), look at SynchronizedMap source code to do it right else will crash 
soon.
2) Even after i fix it, there is correctness in the hashing algorithm i think. 
Get returns a lot of error and looks like there is some memory leaks too.

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-02 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231878#comment-14231878
 ] 

Vijay edited comment on CASSANDRA-7438 at 12/2/14 8:31 PM:
---

EDIT:

Here is the explanation. run the benchmark with the following options (lruc 
benchmark). 
{code}java -Djava.library.path=/usr/local/lib/ -jar ~/lrucTest.jar -t 30 -s 
6147483648 -c ohc{code}

And you will see something like this (errors == not found from the cache even 
though you have all the items you need is in the cache).

{code}
Memory consumed: 3 GB / 5 GB or 427170 / 6147483648, size 4980, queued 
(LRU q size) 0

VM total:2 GB
VM free:2 GB

Get Operation (micros)
time_taken, count, mean, median, 99thPercentile, 999thPercentile, error
4734724, 166, 2.42, 1.93, 8.58, 24.74, 166
4804375, 166, 2.40, 1.92, 4.56, 106.23, 166
4805858, 166, 2.45, 1.95, 3.94, 11.76, 166
4842886, 166, 2.40, 1.92, 7.46, 26.73, 166
{code}

You really need test cases :)

Anyways i am going to stop working on this ticket now, let me know if someone 
wants any other info.


was (Author: vijay2...@yahoo.com):
Never mind, my bad it was related the below (which needs to be more 
configurable instead) and the items where going missing earlier than i thought 
it should and looks you just evict the items per segment (If a segment is used 
more more items will disappear from that segment and the lest used segment 
items will remain).
{code}
// 12.5% if capacity less than 8GB
// 10% if capacity less than 16 GB
// 5% if capacity is higher than 16GB
{code}

Also noticed you don't have replace which Cassandra uses. 
Anyways i am going to stop working on this for now, let me know if someone 
wants any other info.

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-04 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234696#comment-14234696
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 12/4/14 10:20 PM:
---

Just pushed some OHC additions to github:
* key-iterator (used by CacheService class to invalidate column families)
* (de)serialization of cache content to disk using direct I/O from off-heap.  
Means that the row cache content does not need to go though the heap for 
serialization and deserialization. Compression should also be possible in 
off-heap using the static methods in Snappy class since these expect direct 
buffers so there's nearly no pressure for that on the heap. Background: the 
implementation basically "lies" the address and length of the cache entry into 
DirectByteBuffer class so FileChannel is able to read into it/write from it.

edit: s/hash/cache/


was (Author: snazy):
Just pushed some OHC additions to github:
* key-iterator (used by CacheService class to invalidate column families)
* (de)serialization of cache content to disk using direct I/O from off-heap.  
Means that the row cache content does not need to go though the heap for 
serialization and deserialization. Compression should also be possible in 
off-heap using the static methods in Snappy class since these expect direct 
buffers so there's nearly no pressure for that on the heap. Background: the 
implementation basically "lies" the address and length of the hash entry into 
DirectByteBuffer class so FileChannel is able to read into it/write from it.


> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-24 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257721#comment-14257721
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 12/24/14 2:21 PM:
---

I had the opportunity to test OHC on a big machine.
First: it works - very happy about that :)

Some things I want to notice:
* high number of segments do not have any really measurable influence (default 
of 2* # of cores is fine)
* throughput heavily depends on serialization (hash entry size) - Java8 gave 
about 10% to 15% improvement in some tests (either on {{Unsafe.copyMemory}} or 
something related like JNI barrier)
* the number of entries per bucket stays pretty low with the default load 
factor of .75 - vast majority has 0 or 1 entries, some 2 or 3 and few up to 8

Issue (not solvable yet):
It works great for hash entries to approx. 64kB with good to great throughput. 
Above that barrier it first works good but after some time the system spends a 
huge amount of CPU time (~95%) in {{malloc()}} / {{free()}} (with jemalloc, 
Unsafe.allocate is not worth discussing at all on Linux).
I tried to add some „memory buffer cache“ that caches free’d hash entries for 
reuse. But it turned out that in the end it would be too complex if done right. 
The current implementation is still in the code, but must be explicitly enabled 
with a system property. Workloads with small entries and high number of threads 
easily trigger Linux OOM protection (that kills the process). Please note that 
it works with large hash entries - but throughput drops dramatically to just a 
few thousand writes per second.

Some numbers (value sizes have gaussian distribution). Had to do these tests in 
a hurry because I had to give back the machine. Code used during these tests is 
tagged as {{0.1-SNAP-Bench}} in git. Throughput is limited by {{malloc()}} / 
{{free()}} and most tests did only use 50% of available CPU capacity (on 
_c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB).
* 1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k 
writes/sec, 200k reads/sec, ~8k evictions/sec, write: 8ms (99perc), read: 
3ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k 
writes/sec, 499k reads/sec, ~2k evictions/sec, write: .1ms (99perc), read: 
.03ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k 
writes/sec, 195k reads/sec, ~9k evictions/sec, write: .2ms (99perc), read: 
.1ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k 
writes/sec, 20k reads/sec, ~7k evictions/sec, write: 4ms (99perc), read: 
.07ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k 
writes/sec, 1M reads/sec, 30k evictions/sec, write: .04ms (99perc), read: 
.01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k 
writes/sec, 420k reads/sec, 125k evictions/sec, write: .06ms (99perc), read: 
.01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k 
writes/sec, 48k reads/sec, 130k evictions/sec, write: .06ms (99perc), read: 
.01ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k 
writes/sec, 1.25M reads/sec, 50k evictions/sec, write: .02ms (99perc), read: 
.005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k 
writes/sec, 530k reads/sec, 220k evictions/sec, write: .04ms (99perc), read: 
.005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k 
writes/sec, 74k reads/sec, 250k evcictions/sec, write: .04ms (99perc), read: 
.005ms(99perc)

Command line to execute the benchmark:
{code}
java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 
'uniform(1..2000)' -wkd 'uniform(1..2000)' -vs 'gaussian(1024..4096,2)' 
-r .1 -cap 320 -d 86400 -t 500 -dr 8

-r = read rate
-d = duration
-t = # of threads
-dr = # of driver threads that feed the worker threads
-rkd = read key distribution
-wkd = write key distribution
-vs = value size
-cap = capacity
{code}

Sample bucket histogram from 20M test:
{code}
[0..0]: 8118604
[1..1]: 5892298
[2..2]: 2138308
[3..3]: 518089
[4..4]: 94441
[5..5]: 13672
[6..6]: 1599
[7..7]: 189
[8..9]: 16
{code}

After trapping into that memory management issue with varying allocation sized 
of some few kB to several MB, I think that it’s still worth to work on an own 
off-heap memory management. Maybe some block-based approach (fixed or 
variable). But that’s out of the scope of this ticket.

EDIT: The problem with high system-CPU usage only persists on systems with 
multiple CPUs. Cross check with the second CPU socket disabled - calling the 
benchmark with {{taskset 0x java -jar ...}}  does not show 95% system CPU 
usage.


was (Author: snazy):
I had the opportunity to test OHC on a big machine.
First: 

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-24 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257721#comment-14257721
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 12/24/14 2:29 PM:
---

I had the opportunity to test OHC on a big machine.
First: it works - very happy about that :)

Some things I want to notice:
* high number of segments do not have any really measurable influence (default 
of 2* # of cores is fine)
* throughput heavily depends on serialization (hash entry size) - Java8 gave 
about 10% to 15% improvement in some tests (either on {{Unsafe.copyMemory}} or 
something related like JNI barrier)
* the number of entries per bucket stays pretty low with the default load 
factor of .75 - vast majority has 0 or 1 entries, some 2 or 3 and few up to 8

Issue (not solvable yet):
It works great for hash entries to approx. 64kB with good to great throughput. 
Above that barrier it first works good but after some time the system spends a 
huge amount of CPU time (~95%) in {{malloc()}} / {{free()}} (with jemalloc, 
Unsafe.allocate is not worth discussing at all on Linux).
I tried to add some „memory buffer cache“ that caches free’d hash entries for 
reuse. But it turned out that in the end it would be too complex if done right. 
The current implementation is still in the code, but must be explicitly enabled 
with a system property. Workloads with small entries and high number of threads 
easily trigger Linux OOM protection (that kills the process). Please note that 
it works with large hash entries - but throughput drops dramatically to just a 
few thousand writes per second.

Some numbers (value sizes have gaussian distribution). Had to do these tests in 
a hurry because I had to give back the machine. Code used during these tests is 
tagged as {{0.1-SNAP-Bench}} in git. Throughput is limited by {{malloc()}} / 
{{free()}} and most tests did only use 50% of available CPU capacity (on 
_c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB).
* 1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k 
writes/sec, 200k reads/sec, ~8k evictions/sec, write: 8ms (99perc), read: 
3ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k 
writes/sec, 499k reads/sec, ~2k evictions/sec, write: .1ms (99perc), read: 
.03ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k 
writes/sec, 195k reads/sec, ~9k evictions/sec, write: .2ms (99perc), read: 
.1ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k 
writes/sec, 20k reads/sec, ~7k evictions/sec, write: 4ms (99perc), read: 
.07ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k 
writes/sec, 1M reads/sec, 30k evictions/sec, write: .04ms (99perc), read: 
.01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k 
writes/sec, 420k reads/sec, 125k evictions/sec, write: .06ms (99perc), read: 
.01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k 
writes/sec, 48k reads/sec, 130k evictions/sec, write: .06ms (99perc), read: 
.01ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k 
writes/sec, 1.25M reads/sec, 50k evictions/sec, write: .02ms (99perc), read: 
.005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k 
writes/sec, 530k reads/sec, 220k evictions/sec, write: .04ms (99perc), read: 
.005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k 
writes/sec, 74k reads/sec, 250k evcictions/sec, write: .04ms (99perc), read: 
.005ms(99perc)

Command line to execute the benchmark:
{code}
java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 
'uniform(1..2000)' -wkd 'uniform(1..2000)' -vs 'gaussian(1024..4096,2)' 
-r .1 -cap 320 -d 86400 -t 500 -dr 8

-r = read rate
-d = duration
-t = # of threads
-dr = # of driver threads that feed the worker threads
-rkd = read key distribution
-wkd = write key distribution
-vs = value size
-cap = capacity
{code}

Sample bucket histogram from 20M test:
{code}
[0..0]: 8118604
[1..1]: 5892298
[2..2]: 2138308
[3..3]: 518089
[4..4]: 94441
[5..5]: 13672
[6..6]: 1599
[7..7]: 189
[8..9]: 16
{code}

After trapping into that memory management issue with varying allocation sized 
of some few kB to several MB, I think that it’s still worth to work on an own 
off-heap memory management. Maybe some block-based approach (fixed or 
variable). But that’s out of the scope of this ticket.

EDIT: The problem with high system-CPU usage only persists on systems with 
multiple CPUs. Cross check with the second CPU socket disabled - calling the 
benchmark with {{taskset 0x3ff java -jar ...}}  does not show 95% system CPU 
usage.


was (Author: snazy):
I had the opportunity to test OHC on a big machine.
First: i

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2015-01-06 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257721#comment-14257721
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 1/6/15 10:08 AM:
--

I had the opportunity to test OHC on a big machine.
First: it works - very happy about that :)

Some things I want to notice:
* high number of segments do not have any really measurable influence (default 
of 2* # of cores is fine)
* throughput heavily depends on serialization (hash entry size) - Java8 gave 
about 10% to 15% improvement in some tests (either on {{Unsafe.copyMemory}} or 
something related like JNI barrier)
* the number of entries per bucket stays pretty low with the default load 
factor of .75 - vast majority has 0 or 1 entries, some 2 or 3 and few up to 8

Issue (not solvable yet):
It works great for hash entries to approx. 64kB with good to great throughput. 
Above that barrier it first works good but after some time the system spends a 
huge amount of CPU time (~95%) in {{malloc()}} / {{free()}} (with jemalloc, 
Unsafe.allocate is not worth discussing at all on Linux).
I tried to add some „memory buffer cache“ that caches free’d hash entries for 
reuse. But it turned out that in the end it would be too complex if done right. 
The current implementation is still in the code, but must be explicitly enabled 
with a system property. Workloads with small entries and high number of threads 
easily trigger Linux OOM protection (that kills the process). Please note that 
it works with large hash entries - but throughput drops dramatically to just a 
few thousand writes per second.

Some numbers (value sizes have gaussian distribution). Had to do these tests in 
a hurry because I had to give back the machine. Code used during these tests is 
tagged as {{0.1-SNAP-Bench}} in git. Throughput is limited by {{malloc()}} / 
{{free()}} and most tests did only use 50% of available CPU capacity (on 
_c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB).
* -1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k 
writes/sec, 200k reads/sec, ~8k evictions/sec, write: 8ms (99perc), read: 
3ms(99perc)-
* -1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k 
writes/sec, 499k reads/sec, ~2k evictions/sec, write: .1ms (99perc), read: 
.03ms(99perc)-
* -1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k 
writes/sec, 195k reads/sec, ~9k evictions/sec, write: .2ms (99perc), read: 
.1ms(99perc)-
* -1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k 
writes/sec, 20k reads/sec, ~7k evictions/sec, write: 4ms (99perc), read: 
.07ms(99perc)-
* -1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k 
writes/sec, 1M reads/sec, 30k evictions/sec, write: .04ms (99perc), read: 
.01ms(99perc)-
* -1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k 
writes/sec, 420k reads/sec, 125k evictions/sec, write: .06ms (99perc), read: 
.01ms(99perc)-
* -1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k 
writes/sec, 48k reads/sec, 130k evictions/sec, write: .06ms (99perc), read: 
.01ms(99perc)-
* -1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k 
writes/sec, 1.25M reads/sec, 50k evictions/sec, write: .02ms (99perc), read: 
.005ms(99perc)-
* -1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k 
writes/sec, 530k reads/sec, 220k evictions/sec, write: .04ms (99perc), read: 
.005ms(99perc)-
* -1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k 
writes/sec, 74k reads/sec, 250k evcictions/sec, write: .04ms (99perc), read: 
.005ms(99perc)-

Command line to execute the benchmark:
{code}
java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 
'uniform(1..2000)' -wkd 'uniform(1..2000)' -vs 'gaussian(1024..4096,2)' 
-r .1 -cap 320 -d 86400 -t 500 -dr 8

-r = read rate
-d = duration
-t = # of threads
-dr = # of driver threads that feed the worker threads
-rkd = read key distribution
-wkd = write key distribution
-vs = value size
-cap = capacity
{code}

Sample bucket histogram from 20M test:
{code}
[0..0]: 8118604
[1..1]: 5892298
[2..2]: 2138308
[3..3]: 518089
[4..4]: 94441
[5..5]: 13672
[6..6]: 1599
[7..7]: 189
[8..9]: 16
{code}

After trapping into that memory management issue with varying allocation sized 
of some few kB to several MB, I think that it’s still worth to work on an own 
off-heap memory management. Maybe some block-based approach (fixed or 
variable). But that’s out of the scope of this ticket.

EDIT: The problem with high system-CPU usage only persists on systems with 
multiple CPUs. Cross check with the second CPU socket disabled - calling the 
benchmark with {{taskset 0x3ff java -jar ...}}  does not show 95% system CPU 
usage.

EDIT2: Marked benchmark values as invalid (see my comment on 01/

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2015-01-09 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271745#comment-14271745
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 1/9/15 7:21 PM:
-

Note: OHC how has cache-loader support (https://github.com/snazy/ohc/issues/3). 
Could be an alternative for RowCacheSentinel.
EDIT: in a C* follow-up ticket


was (Author: snazy):
Note: OHC how has cache-loader support (https://github.com/snazy/ohc/issues/3). 
Could be an alternative for RowCacheSentinel.

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Robert Stupp
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2015-01-12 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274005#comment-14274005
 ] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 1/12/15 7:25 PM:


If you go all the way down the JMH rabbit hole you don't need to do any of your 
own timing and JMH will actually do some smart things to give you accurate 
timing and ameliorate the impact of non-scalable/expensive timing measurement. 
Metrics uses System.nanoTime() internally so it isn't really any better as far 
as I can tell. System.nanoTime() on Linux is pretty scalable 
http://shipilev.net/blog/2014/nanotrusting-nanotime/. When I tested it in JMH 
it actually seemed to be linearly scalable, but JMH will solve that for you 
even on platforms where nanoTime is finicky.

The C* integration looks good. I'm glad it was easy. When it comes to exposing 
configuration parameters less is more. I would prefer not to expose anything 
new because once people start using them they don't like to have the options 
taken away (or disabled). We should make an effort to set them automatically 
(or good enough) and if that fails we can add user visible configuration. My 
preference is to make the options accessible via properties as an escape hatch 
in production, and then add them to config if we really can't set them 
automatically.

The stress tool when used without workload profiles does some validation. It 
checks that values are there and that the contents are correct.

Did not know about the JNA synchronized block. That is surprising, but I am 
glad to hear it is getting fixed. For access to jemalloc I recommend using 
unsafe and LD_PRELOAD jemalloc. I think that would be the recommended approach 
and the one you should benchmark against and JNA would be there as a fallback. 
That gives you a JNI call for allocation/deallocation.

I am trying out the JMH benchmark and looking at the new linked implementation 
right now. How are you starting the JMH benchmark?



was (Author: aweisberg):
If you go all the way down the JMH rabbit hole you don't need to do any of your 
own timing and JMH will actually do some smart things to give you accurate 
timing and ameliorate the impact of non-scalable/expensive timing measurement. 
Metrics uses System.nanoTime() internally so it isn't really any better as far 
as I can tell. System.nanoTime() on Linux is pretty scalable 
http://shipilev.net/blog/2014/nanotrusting-nanotime/. When I tested it in JMH 
it actually seemed to be linearly scalable, but JMH will solve that for you 
even on platforms where nanoTime is finicky.

The C* integration looks good. I'm glad it was easy. When it comes to exposing 
configuration parameters less is more

The stress tool when used without workload profiles does some validation. It 
checks that values are there and that the contents are correct.

Did not know about the JNA synchronized block. That is surprising, but I am 
glad to hear it is getting fixed. For access to jemalloc I recommend using 
unsafe and LD_PRELOAD jemalloc. I think that would be the recommended approach 
and the one you should benchmark against and JNA would be there as a fallback. 
That gives you a JNI call for allocation/deallocation.

I am trying out the JMH benchmark and looking at the new linked implementation 
right now. How are you starting the JMH benchmark?


> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Robert Stupp
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2015-01-12 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274005#comment-14274005
 ] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 1/12/15 7:28 PM:


If you go all the way down the JMH rabbit hole you don't need to do any of your 
own timing and JMH will actually do some smart things to give you accurate 
timing and ameliorate the impact of non-scalable/expensive timing measurement. 
Metrics uses System.nanoTime() internally so it isn't really any better as far 
as I can tell. System.nanoTime() on Linux is pretty scalable 
http://shipilev.net/blog/2014/nanotrusting-nanotime/. When I tested it in JMH 
it actually seemed to be linearly scalable, but JMH will solve that for you 
even on platforms where nanoTime is finicky.

The C* integration looks good. I'm glad it was easy. When it comes to exposing 
configuration parameters less is more. I would prefer not to expose anything 
new because once people start using them they don't like to have the options 
taken away (or disabled). We should make an effort to set them automatically 
(or good enough) and if that fails we can add user visible configuration. My 
preference is to make the options accessible via properties as an escape hatch 
in production, and then add them to config if we really can't set them 
automatically.

Can you prefix any System properties you have with a classname/package or 
something that makes it clear they are part of OHC?

The stress tool when used without workload profiles does some validation. It 
checks that values are there and that the contents are correct.

Did not know about the JNA synchronized block. That is surprising, but I am 
glad to hear it is getting fixed. For access to jemalloc I recommend using 
unsafe and LD_PRELOAD jemalloc. I think that would be the recommended approach 
and the one you should benchmark against and JNA would be there as a fallback. 
That gives you a JNI call for allocation/deallocation.

I am trying out the JMH benchmark and looking at the new linked implementation 
right now. How are you starting the JMH benchmark?



was (Author: aweisberg):
If you go all the way down the JMH rabbit hole you don't need to do any of your 
own timing and JMH will actually do some smart things to give you accurate 
timing and ameliorate the impact of non-scalable/expensive timing measurement. 
Metrics uses System.nanoTime() internally so it isn't really any better as far 
as I can tell. System.nanoTime() on Linux is pretty scalable 
http://shipilev.net/blog/2014/nanotrusting-nanotime/. When I tested it in JMH 
it actually seemed to be linearly scalable, but JMH will solve that for you 
even on platforms where nanoTime is finicky.

The C* integration looks good. I'm glad it was easy. When it comes to exposing 
configuration parameters less is more. I would prefer not to expose anything 
new because once people start using them they don't like to have the options 
taken away (or disabled). We should make an effort to set them automatically 
(or good enough) and if that fails we can add user visible configuration. My 
preference is to make the options accessible via properties as an escape hatch 
in production, and then add them to config if we really can't set them 
automatically.

The stress tool when used without workload profiles does some validation. It 
checks that values are there and that the contents are correct.

Did not know about the JNA synchronized block. That is surprising, but I am 
glad to hear it is getting fixed. For access to jemalloc I recommend using 
unsafe and LD_PRELOAD jemalloc. I think that would be the recommended approach 
and the one you should benchmark against and JNA would be there as a fallback. 
That gives you a JNI call for allocation/deallocation.

I am trying out the JMH benchmark and looking at the new linked implementation 
right now. How are you starting the JMH benchmark?


> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Robert Stupp
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic comple

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2015-01-19 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282467#comment-14282467
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 1/19/15 12:41 PM:
---

I think the possibly best alternative to access malloc/free is {{Unsafe}} with 
jemalloc in LD_PRELOAD. Native code of {{Unsafe.allocateMemory}} is basically 
just a wrapper around {{malloc()}}/{{free()}}.

Updated the git branch with the following changes:
* update to OHC 0.3
* benchmark: add new command line option to specify key length (-kl)
* free capacity handling moved to segments
* allow to specify preferred memory allocation via system property 
"org.caffinitas.ohc.allocator"
* allow to specify defaults of OHCacheBuilder via system properties prefixed 
with "org.caffinitas.org."
* benchmark: make metrics in local to the driver threads
* benchmark: disable bucket histogram in stats by default

I did not change the default number of segments = 2 * CPUs - but I thought 
about that (since you experienced that 256 segments on c3.8xlarge gives some 
improvement). A naive approach to say e.g. 8 * CPUs feels too heavy for small 
systems (with one socket) and might be too much outside of benchmarking. If 
someone wants to get most out of it in production and really hits the number of 
segments, he can always configure it better. WDYT?

Using jemalloc on Linux via LD_PRELOAD is probably the way to go in C* (since 
off-heap is also used elsewhere).
I think we should leave the OS allocator on OSX.
Don't know much about allocator performance on Windows.

For now I do not plan any new features in OHC for C* - so maybe we shall start 
a final review round?


was (Author: snazy):
I think the possibly best alternative to access malloc/free is {{Unsafe}} with 
jemalloc in LD_PRELOAD. Native code of {{Unsafe.allocateMemory}} is basically 
just a wrapper around {{malloc()}}/{{free()}}.

Updated the git branch with the following changes:
* update to OHC 0.3
* benchmark: add new command line option to specify key length (-kl)
* free capacity handling moved to segments
* allow to specify preferred memory allocation via system property 
"org.caffinitas.ohc.allocator"
* allow to specify defaults of OHCacheBuilder via system properties prefixed 
with "org.caffinitas.org."
* benchmark: make metrics in local to the driver threads
* benchmark: disable bucket histogram in stats by default

I did not change the default number of segments = 2 * CPUs - but I thought 
about that (since you experienced that 256 segments on c3.8xlarge gives some 
improvement). A naive approach to say e.g. 8 * CPUs feels too heavy for small 
systems (with one socket) and might be too much outside of benchmarking. If 
someone wants to get most out of it in production and really hits the number of 
segments, he can always configure it better. WDYT?

Using jemalloc on Linux via LD_PRELOAD is probably the way to go in C* (since 
off-heap is also used elsewhere).
I think we should leave the OS allocator on OSX.
Don't know much about allocator performance on Windows.

For now I do not plan any new features for C* - so maybe we shall start a final 
review round?

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Robert Stupp
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-09-23 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145338#comment-14145338
 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 9/23/14 8:01 PM:
--

(note: [~vijay2...@yahoo.com], please use the other nick)

Some quick notes:
* Can you add the assertion for {{capacity <= 0}} to 
{{OffheapCacheProvider.create}} - the current error message if 
{{row_cache_size_in_mb}} is not set (or invalid) "capacity should be set" could 
be more fleshy
* Additionally the {{capacity}} check should also check for negative values (it 
starts with a negative value - don't know what happens if it is negative...)
* {{org.apache.cassandra.db.RowCacheTest#testRowCacheCleanup}} fails at the 
last assertion - all other unit tests seem to work
* Documentation in cassandra.yaml for row_cache_provider could be a bit more 
verbose - just some abstract about the characteristics and limitation (e.g. 
Offheap does only work on Linux + OSX) of both implementations
* IMO it would be fine to have a general unit test for 
{{com.lruc.api.LRUCache}} in C* code, too
* Please add an adopted copy of {{RowCacheTest}} for OffheapCacheProvider
* unit tests using OffheapCacheProvider must not start on Windows builds - 
please add an assertion in OffHeapCacheProvider to assert that it runs on Linux 
or OSX

Sorry for the late reply


was (Author: snazy):
(note: [~vijay2...@gmail.com], please use the other nick)

Some quick notes:
* Can you add the assertion for {{capacity <= 0}} to 
{{OffheapCacheProvider.create}} - the current error message if 
{{row_cache_size_in_mb}} is not set (or invalid) "capacity should be set" could 
be more fleshy
* Additionally the {{capacity}} check should also check for negative values (it 
starts with a negative value - don't know what happens if it is negative...)
* {{org.apache.cassandra.db.RowCacheTest#testRowCacheCleanup}} fails at the 
last assertion - all other unit tests seem to work
* Documentation in cassandra.yaml for row_cache_provider could be a bit more 
verbose - just some abstract about the characteristics and limitation (e.g. 
Offheap does only work on Linux + OSX) of both implementations
* IMO it would be fine to have a general unit test for 
{{com.lruc.api.LRUCache}} in C* code, too
* Please add an adopted copy of {{RowCacheTest}} for OffheapCacheProvider
* unit tests using OffheapCacheProvider must not start on Windows builds - 
please add an assertion in OffHeapCacheProvider to assert that it runs on Linux 
or OSX

Sorry for the late reply

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-10-03 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157989#comment-14157989
 ] 

Jonathan Ellis edited comment on CASSANDRA-7438 at 10/3/14 1:55 PM:


Are you still working on this, [~vijay2...@yahoo.com]?


was (Author: jbellis):
Are you still working on this, [~vijay2...@gmail.com]?

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-03 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195418#comment-14195418
 ] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 11/4/14 12:06 AM:
-

RE refcount:

I think hazard pointers (never used them personally) are the no-gc no-refcount 
way of handling this. It also won't be fetched twice if it is uncontended which 
in many cases it will be since it should be decrefd as soon as the data is 
copied.

I think that with the right QA work this solves the problem of running 
arbitrarily large caches. That means running a validating workload in 
continuous integration that demonstrates the cache doesn't lock up, leak, or 
return the wrong answer. I would probably test directly against the cache to 
get more iterations in.

RE Implementation as a library via JNI:

We give up something by using JNI so it only makes sense if we get something 
else in return. The QA and release work created by JNI is pretty large. You 
really need a plan for running something like Valgrind or similar against a 
comprehensive suite of tests. Valgrind doesn't run well with Java AFAIK so you 
end up doing things like running the native code in a separate process, and 
have to write an interface amenable to that. Valgrind is also slow enough that 
if you try and run all your tests against a configuration using it a lot you 
end up with timeouts and many hours to run all the tests plus time spent 
interpreting results.

Unsafe is worse some respects because there is no Valgrind and I can attest 
that debugging an off-heap red-black tree is not fun.

I am not clear on why the JNI is justified. It really seems like this could be 
written against Unsafe and then it would work on any platform. There are no 
libraries or system calls in use that are only accessible via JNI. I think JNI 
would make more sense if we were pulling in existing code like memcached that 
already handles memory pooling, fragmentation, and concurrency. If it were in 
Java you could use Disruptor for the queue and would only need to implement a 
thread safe off heap hash table.

RE Performance and implementation:

What kind of hardware was the benchmark run on? Server class NUMA? I am just 
wondering if there are enough cores to bring out any scalability issues in the 
cache implementation.

It would be nice to see a benchmark that showed the on heap cache falling over 
while the off heap cache provides good performance.

Subsequent comments aren't particularly useful if performance is satisfactory 
under relevant configurations.

Given the use of a heap allocator and locking it might not make sense to have a 
background thread do expiration. I think that splitting the cache into several 
instances with one lock around each instance might result in less contention 
overall and it would scale up in a more straightforward way.

It appears that some common operations will hit a global lock in may_expire() 
quite frequently? It seems like there are other globally shared frequently 
mutated cache lines in the write path like stats.

Is there something subtle in the locking that makes the use of the custom queue 
and maps necessary or could you use stuff from Intel TBB and still make it 
work? It is hypothetically less code to have to QA and maintain.

I still need to dig more, but I am also not clear on why locks are necessary 
for individual items. It looks like there is a table for all of them? Random 
intuition is that it could be done without a lock or at least a discrete lock. 
Striping against a padded pool of locks might make sense if that isn't going to 
cause deadlocks. Apparently every pthread_mutex_t is 40 bytes according to a 
random stack overflow post. It might make sense to use the same cache line as 
the refcount to store a lock field, or the bucket in the hash table?

Another implementation question is do we want to use C++11? It would remove a 
lot of platform and compiler specific code.


was (Author: aweisberg):
RE refcount:

I think hazard pointers (never used them personally) are the no-gc no-refcount 
way of handling this. It also won't be fetched twice if it is uncontended which 
in many cases it will be since it should be decrefd as soon as the data is 
copied.

I think that with the right QA work this solves the problem of running 
arbitrarily large caches. That means running a validating workload in 
continuous integration that demonstrates the cache doesn't lock up, leak, or 
return the wrong answer. I would probably test directly against the cache to 
get more iterations in.

RE Implementation as a library via JNI:

We give up something by using JNI so it only makes sense if we get something 
else in return. The QA and release work created by JNI is pretty large. You 
really need a plan for running something like Valgrind or similar against a 
comprehensive suite of tests.

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-03 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195679#comment-14195679
 ] 

Vijay edited comment on CASSANDRA-7438 at 11/4/14 3:54 AM:
---

Thanks for reviewing!
{quote}
 I am also not clear on why locks are necessary for individual items.
{quote}
No we don't. We have locks per Segment, this is very similar to lock 
stripping/Java's concurrent hash map.
{quote}
 global lock in may_expire() quite frequently?
{quote}
Not really we lock globally when we reach 100% of the space and we freeup to 
80% of the space and we spread the overhead to other threads based on who ever 
has the item partition lock. It won't be hard to make this part of the queue 
thread and will try it for the next release of lruc.
{quote}
What kind of hardware was the benchmark run on?
{quote}
32 core 100GB RAM with numa and intel xeon. There is a benchmark util which is 
also checked in as a part of the lruc code which does exactly the same kind of 
test.
{quote}
You really need a plan for running something like Valgrind
{quote}
Good point, I was part way down that road and still have the code i can 
resuruct it for the next lruc version.
{quote}
I am not clear on why the JNI is justified
{quote}
There is some comments above which has the reasoning for it (please see the 
above comments). PS: I believe there was some tickets on Current RowCache 
complaining about the overhead.
{quote}
I think JNI would make more sense if we were pulling in existing code like 
memcached
{quote}
If you look at the code closer to memcached. Actually I started of stripping 
memcached code so we can run it in process instead of running as a separate 
process and removing the global locks in queue reallocation etc and eventually 
diverged too much from it. The other reason it doesn't use slab allocators is 
because we wanted the memory allocators to do the right thing we already have 
tested Cassandra with Jemalloc.

To confort a bit lruc is running in our production already :)


was (Author: vijay2...@yahoo.com):
Thanks for reviewing!
{quote}
 I am also not clear on why locks are necessary for individual items.
{quote}
No we don't. We have locks per Segment, this is very similar to lock stripping 
or the smiler to Java's concurrent hash map.
{quote}
 global lock in may_expire() quite frequently?
{quote}
Not really we lock globally when we reach 100% of the space and we freeup to 
80% of the space and we spread the overhead to other threads based on who ever 
has the item partition lock. It won't be hard to make this part of the queue 
thread and will try it for the next release of lruc.
{quote}
What kind of hardware was the benchmark run on?
{quote}
32 core 100GB RAM with numa and intel xeon. There is a benchmark util which is 
also checked in as a part of the lruc code which does exactly the same kind of 
test.
{quote}
You really need a plan for running something like Valgrind
{quote}
Good point, I was part way down that road and still have the code i can 
resuruct it for the next lruc version.
{quote}
I am not clear on why the JNI is justified
{quote}
There is some comments above which has the reasoning for it (please see the 
above comments). PS: I believe there was some tickets on Current RowCache 
complaining about the overhead.
{quote}
I think JNI would make more sense if we were pulling in existing code like 
memcached
{quote}
If you look at the code closer to memcached. Actually I started of stripping 
memcached code so we can run it in process instead of running as a separate 
process and removing the global locks in queue reallocation etc and eventually 
diverged too much from it. The other reason it doesn't use slab allocators is 
because we wanted the memory allocators to do the right thing we already have 
tested Cassandra with Jemalloc.

To confort a bit lruc is running in our production already :)

> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to e

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-04 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196296#comment-14196296
 ] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 11/4/14 4:13 PM:


bq. No we don't. We have locks per Segment, this is very similar to lock 
stripping/Java's concurrent hash map.
Thanks for clearing that up

bq. Not really we lock globally when we reach 100% of the space and we freeup 
to 80% of the space and we spread the overhead to other threads based on who 
ever has the item partition lock. It won't be hard to make this part of the 
queue thread and will try it for the next release of lruc.

OK, that make sense. 20% of the cache could be many milliseconds of work if you 
are using many gigabytes of cache. That's not a great thing to foist on a 
random victim thread. If you handed that to the queue thread, well I think you 
run into another issue which is that the ring buffer doesn't appear to check 
for queue full? The queue thread could go out to lunch for a while. Not a big 
deal, but finer grained scheduling will probably be necessary.

bq. If you look at the code closer to memcached. Actually I started of 
stripping memcached code so we can run it in process instead of running as a 
separate process and removing the global locks in queue reallocation etc and 
eventually diverged too much from it. The other reason it doesn't use slab 
allocators is because we wanted the memory allocators to do the right thing we 
already have tested Cassandra with Jemalloc.

Ah very cool.

jemalloc is not a moving allocator where as it looks like memcached slabs 
implement rebalancing to accommodate changes in size distribution. That would 
actually be one of the really nice things to keep IMO. On large memory systems 
with a cache that scales and performs you would end up dedicating as much RAM 
as possible to the row cache/key cache and not the page cache since the page 
cache is not as granular (correct me if the story for C* is different). If you 
dedicate 80% of RAM to the cache that doesn't leave a lot of space left for 
fragmentation. By using a heap allocator you also lose the ability to implement 
hard predictable limits on memory used by the cache since you didn't map it 
yourself. I could be totally off base and jemalloc might be good enough.

bq. There is some comments above which has the reasoning for it (please see the 
above comments). PS: I believe there was some tickets on Current RowCache 
complaining about the overhead.
I don't have a performance beef with JNI, especially the way you have done 
which I think is pretty efficient. I think the overhead of JNI (one or two 
slightly more expensive function calls) would be eclipsed by things like the 
cache misses, coherence, and pipeline stalls that are part of accessing and 
maintaining a concurrent cache (Java or C++). It's all just intuition without 
comparative microbenchmarks of the two caches. Java might look a little faster 
just due to allocator performance, but we know you pay for that in other ways.

I think what you have made scratches the itch for a large cache quite well, and 
beats the status quo. I don't agree that Unsafe couldn't do the exact same 
thing with no on heap references.

The hash table, ring buffer, and individual item entries are all being malloced 
and you can do that from Java using Unsafe. You don't need to implement a ring 
buffer because you can use Disruptor. I also wonder if splitting the cache into 
several instances each with a coarse lock per instance wouldn't result in 
simpler, and I know performance is not an issue, fast enough code. I don't want 
to advocate doing something different for performance, but rather that there is 
the possibility of a relatively simple implementation via Unsafe.

You could coalesce all the contended fields for each instance (stats, lock 
field, LRU head) into a single cache line, and then rely on a single barrier 
when releasing a coarse grained lock. The fine grained locking and CASing 
results in several pipeline stalls because the memory barriers that are 
implicit in each one require the store buffers to drain. There may even be a 
suitable off heap map implementation out there already.


was (Author: aweisberg):
.bq No we don't. We have locks per Segment, this is very similar to lock 
stripping/Java's concurrent hash map.
Thanks for clearing that up

.bq Not really we lock globally when we reach 100% of the space and we freeup 
to 80% of the space and we spread the overhead to other threads based on who 
ever has the item partition lock. It won't be hard to make this part of the 
queue thread and will try it for the next release of lruc.

OK, that make sense. 20% of the cache could be many milliseconds of work if you 
are using many gigabytes of cache. That's not a great thing to foist on a 
random victim thread. If you handed tha

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-04 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196492#comment-14196492
 ] 

Vijay edited comment on CASSANDRA-7438 at 11/4/14 6:15 PM:
---

{quote}
 well I think you run into another issue which is that the ring buffer doesn't 
appear to check for queue full? 
{quote}
Yeah i thought about it, we need to handle those and thats why didn't have it 
in the first place. Should not be really bad though.
{quote}
I don't agree that Unsafe couldn't do the exact same thing with no on heap 
references
{quote}
Probably, since we figured most of the implementation detail sure we can but 
still there is always many different ways to solve the problem (May not be very 
efficient to copy multiple bytes to get to the next item in map etc... GC and 
CPU overhead would be more IMHO). For example Memcached used expiration time 
set by the clients to remove the items which made it easier for them to do the 
slab allocator but this is something we removed it in lruc and just a queue.
{quote}
I also wonder if splitting the cache into several instances each with a coarse 
lock per instance wouldn't result in simpler
{quote}
The problem there is how will you invalidate the last used items, since they 
are different partitions you really don't know which ones to invalidate... 
there is also a problem of load balancing when to expand the buckets etc which 
will bring us back to the current lock stripping solutions IMHO.

I can do some benchmarks if thats exactly what we need at this point Thanks!



was (Author: vijay2...@yahoo.com):
{quote}
 well I think you run into another issue which is that the ring buffer doesn't 
appear to check for queue full? 
{quote}
Yeah i thought about it, we need to handle those and thats why didn't have it 
in the first place. Should not be really bad though.
{quote}
I don't agree that Unsafe couldn't do the exact same thing with no on heap 
references
{quote}
Probably, since we figured most of the implementation detail sure we can but 
still there is always many different ways to solve the problem (Even though it 
will be in efficient to copy multiple bytes to get to the next items in map 
etc... GC and CPU overhead would be more IMHO). For example Memcached used 
expiration time set by the clients to remove the items which made it easier for 
them to do the slab allocator but this is something we removed it in lruc and 
just a queue.
{quote}
I also wonder if splitting the cache into several instances each with a coarse 
lock per instance wouldn't result in simpler
{quote}
The problem there is how will you invalidate the last used items, since they 
are different partitions you really don't know which ones to invalidate... 
there is also a problem of load balancing when to expand the buckets etc which 
will bring us back to the current lock stripping solutions IMHO.

I can do some benchmarks if thats exactly what we need at this point Thanks!


> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-04 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196492#comment-14196492
 ] 

Vijay edited comment on CASSANDRA-7438 at 11/4/14 6:16 PM:
---

{quote}
 well I think you run into another issue which is that the ring buffer doesn't 
appear to check for queue full? 
{quote}
Yeah i thought about it, we need to handle those and thats why didn't have it 
in the first place. Should not be really bad though.
{quote}
I don't agree that Unsafe couldn't do the exact same thing with no on heap 
references
{quote}
Probably, since we figured most of the implementation detail sure we can but 
still there is always many different ways to solve the problem (May not be very 
efficient to copy multiple bytes to get to the next item in map etc... GC and 
CPU overhead would be more IMHO). For example Memcached used expiration time 
set by the clients to remove the items which made it easier for them to do the 
slab allocator but this is something we removed it in lruc and just a queue.
{quote}
I also wonder if splitting the cache into several instances each with a coarse 
lock per instance wouldn't result in simpler
{quote}
The problem there is how will you invalidate the least used items, since they 
are different partitions you really don't know which ones to invalidate... 
there is also a problem of load balancing when to expand the buckets etc which 
will bring us back to the current lock stripping solutions IMHO.

I can do some benchmarks if thats exactly what we need at this point Thanks!



was (Author: vijay2...@yahoo.com):
{quote}
 well I think you run into another issue which is that the ring buffer doesn't 
appear to check for queue full? 
{quote}
Yeah i thought about it, we need to handle those and thats why didn't have it 
in the first place. Should not be really bad though.
{quote}
I don't agree that Unsafe couldn't do the exact same thing with no on heap 
references
{quote}
Probably, since we figured most of the implementation detail sure we can but 
still there is always many different ways to solve the problem (May not be very 
efficient to copy multiple bytes to get to the next item in map etc... GC and 
CPU overhead would be more IMHO). For example Memcached used expiration time 
set by the clients to remove the items which made it easier for them to do the 
slab allocator but this is something we removed it in lruc and just a queue.
{quote}
I also wonder if splitting the cache into several instances each with a coarse 
lock per instance wouldn't result in simpler
{quote}
The problem there is how will you invalidate the last used items, since they 
are different partitions you really don't know which ones to invalidate... 
there is also a problem of load balancing when to expand the buckets etc which 
will bring us back to the current lock stripping solutions IMHO.

I can do some benchmarks if thats exactly what we need at this point Thanks!


> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-04 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196492#comment-14196492
 ] 

Vijay edited comment on CASSANDRA-7438 at 11/4/14 6:17 PM:
---

{quote}
 well I think you run into another issue which is that the ring buffer doesn't 
appear to check for queue full? 
{quote}
Yeah i thought about it, we need to handle those and thats why didn't have it 
in the first place. Should not be really bad though.
{quote}
I don't agree that Unsafe couldn't do the exact same thing with no on heap 
references
{quote}
Probably, since we figured most of the implementation detail sure we can but 
still there is always many different ways to solve the problem (May not be very 
efficient to copy multiple bytes to get to the next item in map etc... GC and 
CPU overhead would be more IMHO). For example Memcached used expiration time 
set by the clients to remove the items which made it easier for them to do the 
slab allocator but this is something we removed it in lruc and just a queue.
{quote}
I also wonder if splitting the cache into several instances each with a coarse 
lock per instance wouldn't result in simpler
{quote}
The problem there is how will you invalidate the least used items, since they 
are different partitions you really don't know which ones to invalidate... 
there is also a problem of load balancing when to expand the buckets etc which 
will bring us back to the current lock striping solutions IMHO.

I can do some benchmarks if thats exactly what we need at this point Thanks!



was (Author: vijay2...@yahoo.com):
{quote}
 well I think you run into another issue which is that the ring buffer doesn't 
appear to check for queue full? 
{quote}
Yeah i thought about it, we need to handle those and thats why didn't have it 
in the first place. Should not be really bad though.
{quote}
I don't agree that Unsafe couldn't do the exact same thing with no on heap 
references
{quote}
Probably, since we figured most of the implementation detail sure we can but 
still there is always many different ways to solve the problem (May not be very 
efficient to copy multiple bytes to get to the next item in map etc... GC and 
CPU overhead would be more IMHO). For example Memcached used expiration time 
set by the clients to remove the items which made it easier for them to do the 
slab allocator but this is something we removed it in lruc and just a queue.
{quote}
I also wonder if splitting the cache into several instances each with a coarse 
lock per instance wouldn't result in simpler
{quote}
The problem there is how will you invalidate the least used items, since they 
are different partitions you really don't know which ones to invalidate... 
there is also a problem of load balancing when to expand the buckets etc which 
will bring us back to the current lock stripping solutions IMHO.

I can do some benchmarks if thats exactly what we need at this point Thanks!


> Serializing Row cache alternative (Fully off heap)
> --
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Linux
>Reporter: Vijay
>Assignee: Vijay
>  Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)