[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046976#comment-14046976 ] Vijay edited comment on CASSANDRA-7438 at 6/28/14 9:50 PM: --- Pushed a new project to github https://github.com/Vijay2win/lruc, including benchmark utils. I can move the code to Cassandra repo or use it as a library in Cassandra (Working on it). was (Author: vijay2...@yahoo.com): Pushed a new project in github https://github.com/Vijay2win/lruc, including benchmark utils. I can move the code to Cassandra repo or use it as a library in Cassandra (Working on it). > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069862#comment-14069862 ] Vijay edited comment on CASSANDRA-7438 at 7/22/14 5:49 AM: --- Attached patch makes the off heap/Serialization Cache configurable. (the default is still SerializationCache). Regarding performance, the performance of the new cache is obviously better when the JNI overhead is less than the GC overhead, but for the smaller sized caches which can fit in memory the performance is little lower which is understandable (but both of them out perform pagecache performance by a large number). Here are the numbers. *OffheapCacheProvider* {panel} Running READ with 1200 threads for 1000 iterations ops ,op/s, key/s,mean, med, .95, .99,.999, max, time, stderr 2030355 , 2029531, 2029531, 3.1, 3.1, 5.4, 5.7,61.8, 3014.5,1.0, 0.0 2395480 , 202845, 202845, 5.8, 5.4, 5.8,20.2, 522.4, 545.9,2.8, 0.0 2638600 , 221368, 221368, 5.4, 5.3, 5.8,16.3,78.8, 131.5,3.9, 0.57860 2891705 , 221976, 221976, 5.4, 5.3, 5.6, 6.2,15.2, 19.2,5.0, 0.60478 3147747 , 222527, 222527, 5.4, 5.3, 5.6, 6.1,15.4, 18.2,6.2, 0.58659 3394999 , 221527, 221527, 5.4, 5.3, 5.6, 6.6,15.9, 19.4,7.3, 0.55884 3663559 , 226114, 226114, 5.3, 5.2, 5.6,15.0,84.4, 110.7,8.5, 0.52924 3911154 , 223831, 223831, 5.4, 5.3, 5.6, 6.1,15.6, 20.0,9.6, 0.50018 4152946 , 223246, 223246, 5.4, 5.3, 5.6, 6.1,15.7, 18.8, 10.7, 0.47323 4403162 , 228532, 228532, 5.2, 5.2, 5.6,23.2, 107.4, 121.4, 11.8, 0.44856 4641021 , 225196, 225196, 5.3, 5.2, 5.6, 5.9,15.3, 18.4, 12.8, 0.42557 4889523 , 222826, 222826, 5.4, 5.3, 5.6, 6.3,16.2, 22.0, 13.9, 0.40476 5124891 , 223203, 223203, 5.4, 5.3, 5.6, 5.8, 6.2, 14.8, 15.0, 0.38602 5375262 , 221222, 221222, 5.4, 5.2, 5.6,18.4,94.2, 115.1, 16.1, 0.36899 5616470 , 224022, 224022, 5.4, 5.3, 5.6, 5.9,14.3, 17.8, 17.2, 0.35349 5866825 , 223000, 223000, 5.4, 5.3, 5.6, 6.1,15.5, 18.2, 18.3, 0.33882 6125601 , 225757, 225757, 5.2, 5.3, 5.6, 9.6,49.4, 72.0, 19.5, 0.32535 6348030 , 192703, 192703, 6.3, 5.3, 9.3,14.4,77.1, 91.5, 20.6, 0.31282 6483574 , 128520, 128520, 9.3, 8.4,10.9,19.5,88.7, 99.0, 21.7, 0.30329 6626176 , 137199, 137199, 8.7, 8.4,10.6,14.0,32.7, 40.9, 22.7, 0.29771 6768401 , 136860, 136860, 8.8, 8.4,10.3,14.1,35.1, 40.8, 23.8, 0.29213 6911785 , 138204, 138204, 8.7, 8.3,10.2,13.7,34.1, 37.8, 24.8, 0.28669 7055951 , 138633, 138633, 8.7, 8.3,10.5,32.0,40.5, 46.9, 25.8, 0.28130 7199084 , 137731, 137731, 8.7, 8.4,10.2,14.0,33.4, 40.9, 26.9, 0.27623 7338032 , 133201, 133201, 9.0, 8.4,10.9,34.0,39.4, 43.8, 27.9, 0.27116 7480439 , 137059, 137059, 8.8, 8.4,10.2,13.9,35.9, 39.5, 29.0, 0.26663 7647810 , 161209, 161209, 7.5, 7.8, 9.6,13.4,33.9, 77.9, 30.0, 0.26185 7898882 , 226498, 226498, 5.3, 5.2, 5.6,19.7, 108.5, 119.3, 31.1, 0.25629 8136305 , 223840, 223840, 5.4, 5.3, 5.6, 5.9,17.3, 23.2, 32.2, 0.24838 8372076 , 223790, 223790, 5.4, 5.3, 5.6, 6.0,15.2, 20.0, 33.2, 0.24095 8633758 , 232914, 232914, 5.1, 5.2, 5.6,17.5, 138.4, 182.0, 34.4, 0.23397 8869214 , 43, 43, 5.4, 5.3, 5.6, 6.0,15.2, 17.9, 35.4, 0.22717 9121652 , 223037, 223037, 5.4, 5.3, 5.6, 5.9,15.4, 18.8, 36.5, 0.22105 9360286 , 225070, 225070, 5.3, 5.3, 5.6,14.8,82.7, 92.1, 37.6, 0.21524 9609676 , 224089, 224089, 5.4, 5.3, 5.6, 5.8, 6.2, 14.3, 38.7, 0.20967 9848551 , 222123, 222123, 5.4, 5.3, 5.6, 5.9,24.2, 27.2, 39.8, 0.20440 1000 , 229511, 229511, 5.0, 5.2, 5.8,60.0,74.3, 132.0, 40.5, 0.19935 Results: real op rate : 247211 adjusted op rate stderr : 0 key rate : 247211 latency mean : 5.4 latency median: 3.5 latency 95th percentile : 5.5 latency 99th percentile : 6.1 latency 99.9th percentile : 83.4 latency max
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069862#comment-14069862 ] Vijay edited comment on CASSANDRA-7438 at 7/22/14 5:51 AM: --- Attached patch makes the off heap/Serialization Cache configurable. (the default is still SerializationCache). Regarding performance, the performance of the new cache is obviously better when the JNI overhead is less than the GC overhead, but for the smaller sized caches which can fit in memory the performance is little lower which is understandable (but both of them out perform pagecache performance by a large number). Here are the numbers. *OffheapCacheProvider* {panel} Running READ with 1200 threads for 1000 iterations ops ,op/s, key/s,mean, med, .95, .99,.999, max, time, stderr 2030355 , 2029531, 2029531, 3.1, 3.1, 5.4, 5.7,61.8, 3014.5,1.0, 0.0 2395480 , 202845, 202845, 5.8, 5.4, 5.8,20.2, 522.4, 545.9,2.8, 0.0 2638600 , 221368, 221368, 5.4, 5.3, 5.8,16.3,78.8, 131.5,3.9, 0.57860 2891705 , 221976, 221976, 5.4, 5.3, 5.6, 6.2,15.2, 19.2,5.0, 0.60478 3147747 , 222527, 222527, 5.4, 5.3, 5.6, 6.1,15.4, 18.2,6.2, 0.58659 3394999 , 221527, 221527, 5.4, 5.3, 5.6, 6.6,15.9, 19.4,7.3, 0.55884 3663559 , 226114, 226114, 5.3, 5.2, 5.6,15.0,84.4, 110.7,8.5, 0.52924 3911154 , 223831, 223831, 5.4, 5.3, 5.6, 6.1,15.6, 20.0,9.6, 0.50018 4152946 , 223246, 223246, 5.4, 5.3, 5.6, 6.1,15.7, 18.8, 10.7, 0.47323 4403162 , 228532, 228532, 5.2, 5.2, 5.6,23.2, 107.4, 121.4, 11.8, 0.44856 4641021 , 225196, 225196, 5.3, 5.2, 5.6, 5.9,15.3, 18.4, 12.8, 0.42557 4889523 , 222826, 222826, 5.4, 5.3, 5.6, 6.3,16.2, 22.0, 13.9, 0.40476 5124891 , 223203, 223203, 5.4, 5.3, 5.6, 5.8, 6.2, 14.8, 15.0, 0.38602 5375262 , 221222, 221222, 5.4, 5.2, 5.6,18.4,94.2, 115.1, 16.1, 0.36899 5616470 , 224022, 224022, 5.4, 5.3, 5.6, 5.9,14.3, 17.8, 17.2, 0.35349 5866825 , 223000, 223000, 5.4, 5.3, 5.6, 6.1,15.5, 18.2, 18.3, 0.33882 6125601 , 225757, 225757, 5.2, 5.3, 5.6, 9.6,49.4, 72.0, 19.5, 0.32535 6348030 , 192703, 192703, 6.3, 5.3, 9.3,14.4,77.1, 91.5, 20.6, 0.31282 6483574 , 128520, 128520, 9.3, 8.4,10.9,19.5,88.7, 99.0, 21.7, 0.30329 6626176 , 137199, 137199, 8.7, 8.4,10.6,14.0,32.7, 40.9, 22.7, 0.29771 6768401 , 136860, 136860, 8.8, 8.4,10.3,14.1,35.1, 40.8, 23.8, 0.29213 6911785 , 138204, 138204, 8.7, 8.3,10.2,13.7,34.1, 37.8, 24.8, 0.28669 7055951 , 138633, 138633, 8.7, 8.3,10.5,32.0,40.5, 46.9, 25.8, 0.28130 7199084 , 137731, 137731, 8.7, 8.4,10.2,14.0,33.4, 40.9, 26.9, 0.27623 7338032 , 133201, 133201, 9.0, 8.4,10.9,34.0,39.4, 43.8, 27.9, 0.27116 7480439 , 137059, 137059, 8.8, 8.4,10.2,13.9,35.9, 39.5, 29.0, 0.26663 7647810 , 161209, 161209, 7.5, 7.8, 9.6,13.4,33.9, 77.9, 30.0, 0.26185 7898882 , 226498, 226498, 5.3, 5.2, 5.6,19.7, 108.5, 119.3, 31.1, 0.25629 8136305 , 223840, 223840, 5.4, 5.3, 5.6, 5.9,17.3, 23.2, 32.2, 0.24838 8372076 , 223790, 223790, 5.4, 5.3, 5.6, 6.0,15.2, 20.0, 33.2, 0.24095 8633758 , 232914, 232914, 5.1, 5.2, 5.6,17.5, 138.4, 182.0, 34.4, 0.23397 8869214 , 43, 43, 5.4, 5.3, 5.6, 6.0,15.2, 17.9, 35.4, 0.22717 9121652 , 223037, 223037, 5.4, 5.3, 5.6, 5.9,15.4, 18.8, 36.5, 0.22105 9360286 , 225070, 225070, 5.3, 5.3, 5.6,14.8,82.7, 92.1, 37.6, 0.21524 9609676 , 224089, 224089, 5.4, 5.3, 5.6, 5.8, 6.2, 14.3, 38.7, 0.20967 9848551 , 222123, 222123, 5.4, 5.3, 5.6, 5.9,24.2, 27.2, 39.8, 0.20440 1000 , 229511, 229511, 5.0, 5.2, 5.8,60.0,74.3, 132.0, 40.5, 0.19935 Results: real op rate : 247211 adjusted op rate stderr : 0 key rate : 247211 latency mean : 5.4 latency median: 3.5 latency 95th percentile : 5.5 latency 99th percentile : 6.1 latency 99.9th percentile : 83.4 latency max
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069862#comment-14069862 ] Vijay edited comment on CASSANDRA-7438 at 7/22/14 5:54 AM: --- Attached patch makes the off heap/Serialization Cache configurable. (the default is still SerializationCache). Regarding performance, the performance of the new cache is obviously better when the JNI overhead is less than the GC overhead. For smaller size cache that can fit in the JVM heap, the performance is a little lower understandablely (but both of them out perform pagecache performance by a large number). Here are the numbers. *OffheapCacheProvider* {panel} Running READ with 1200 threads for 1000 iterations ops ,op/s, key/s,mean, med, .95, .99,.999, max, time, stderr 2030355 , 2029531, 2029531, 3.1, 3.1, 5.4, 5.7,61.8, 3014.5,1.0, 0.0 2395480 , 202845, 202845, 5.8, 5.4, 5.8,20.2, 522.4, 545.9,2.8, 0.0 2638600 , 221368, 221368, 5.4, 5.3, 5.8,16.3,78.8, 131.5,3.9, 0.57860 2891705 , 221976, 221976, 5.4, 5.3, 5.6, 6.2,15.2, 19.2,5.0, 0.60478 3147747 , 222527, 222527, 5.4, 5.3, 5.6, 6.1,15.4, 18.2,6.2, 0.58659 3394999 , 221527, 221527, 5.4, 5.3, 5.6, 6.6,15.9, 19.4,7.3, 0.55884 3663559 , 226114, 226114, 5.3, 5.2, 5.6,15.0,84.4, 110.7,8.5, 0.52924 3911154 , 223831, 223831, 5.4, 5.3, 5.6, 6.1,15.6, 20.0,9.6, 0.50018 4152946 , 223246, 223246, 5.4, 5.3, 5.6, 6.1,15.7, 18.8, 10.7, 0.47323 4403162 , 228532, 228532, 5.2, 5.2, 5.6,23.2, 107.4, 121.4, 11.8, 0.44856 4641021 , 225196, 225196, 5.3, 5.2, 5.6, 5.9,15.3, 18.4, 12.8, 0.42557 4889523 , 222826, 222826, 5.4, 5.3, 5.6, 6.3,16.2, 22.0, 13.9, 0.40476 5124891 , 223203, 223203, 5.4, 5.3, 5.6, 5.8, 6.2, 14.8, 15.0, 0.38602 5375262 , 221222, 221222, 5.4, 5.2, 5.6,18.4,94.2, 115.1, 16.1, 0.36899 5616470 , 224022, 224022, 5.4, 5.3, 5.6, 5.9,14.3, 17.8, 17.2, 0.35349 5866825 , 223000, 223000, 5.4, 5.3, 5.6, 6.1,15.5, 18.2, 18.3, 0.33882 6125601 , 225757, 225757, 5.2, 5.3, 5.6, 9.6,49.4, 72.0, 19.5, 0.32535 6348030 , 192703, 192703, 6.3, 5.3, 9.3,14.4,77.1, 91.5, 20.6, 0.31282 6483574 , 128520, 128520, 9.3, 8.4,10.9,19.5,88.7, 99.0, 21.7, 0.30329 6626176 , 137199, 137199, 8.7, 8.4,10.6,14.0,32.7, 40.9, 22.7, 0.29771 6768401 , 136860, 136860, 8.8, 8.4,10.3,14.1,35.1, 40.8, 23.8, 0.29213 6911785 , 138204, 138204, 8.7, 8.3,10.2,13.7,34.1, 37.8, 24.8, 0.28669 7055951 , 138633, 138633, 8.7, 8.3,10.5,32.0,40.5, 46.9, 25.8, 0.28130 7199084 , 137731, 137731, 8.7, 8.4,10.2,14.0,33.4, 40.9, 26.9, 0.27623 7338032 , 133201, 133201, 9.0, 8.4,10.9,34.0,39.4, 43.8, 27.9, 0.27116 7480439 , 137059, 137059, 8.8, 8.4,10.2,13.9,35.9, 39.5, 29.0, 0.26663 7647810 , 161209, 161209, 7.5, 7.8, 9.6,13.4,33.9, 77.9, 30.0, 0.26185 7898882 , 226498, 226498, 5.3, 5.2, 5.6,19.7, 108.5, 119.3, 31.1, 0.25629 8136305 , 223840, 223840, 5.4, 5.3, 5.6, 5.9,17.3, 23.2, 32.2, 0.24838 8372076 , 223790, 223790, 5.4, 5.3, 5.6, 6.0,15.2, 20.0, 33.2, 0.24095 8633758 , 232914, 232914, 5.1, 5.2, 5.6,17.5, 138.4, 182.0, 34.4, 0.23397 8869214 , 43, 43, 5.4, 5.3, 5.6, 6.0,15.2, 17.9, 35.4, 0.22717 9121652 , 223037, 223037, 5.4, 5.3, 5.6, 5.9,15.4, 18.8, 36.5, 0.22105 9360286 , 225070, 225070, 5.3, 5.3, 5.6,14.8,82.7, 92.1, 37.6, 0.21524 9609676 , 224089, 224089, 5.4, 5.3, 5.6, 5.8, 6.2, 14.3, 38.7, 0.20967 9848551 , 222123, 222123, 5.4, 5.3, 5.6, 5.9,24.2, 27.2, 39.8, 0.20440 1000 , 229511, 229511, 5.0, 5.2, 5.8,60.0,74.3, 132.0, 40.5, 0.19935 Results: real op rate : 247211 adjusted op rate stderr : 0 key rate : 247211 latency mean : 5.4 latency median: 3.5 latency 95th percentile : 5.5 latency 99th percentile : 6.1 latency 99.9th percentile : 83.4 latency max
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072935#comment-14072935 ] Robert Stupp edited comment on CASSANDRA-7438 at 7/24/14 7:36 AM: -- my username on github is snazy Do you know {{org.codehaus.mojo:native-maven-plugin}}? It allows JNI compilation on almost all platforms directly from Maven and does not interfere with SWIG - have used it on OSX, Linux, Win and Solaris. was (Author: snazy): my username on github is snazy Do you know {{org.codehaus.mojo:native-maven-plugin}}? It allows JNI compilation on almost all platforms directly from Maven and does not interfere with SWIG - have used it on OSX, Linux and Win. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075611#comment-14075611 ] Robert Stupp edited comment on CASSANDRA-7438 at 7/27/14 12:18 PM: --- [~vijay2...@gmail.com] do you have a C* branch with lruc integrated? Or: what should I do to bring lruc+C* together? Is the patch up-to-date? I've pushed a new branch 'native-plugin' with the changes for native-maven-plugin. It's separate from the other code. Works for Linux and OSX (depending on which machine the stuff's built). Windows stuff is bit more complicated - it doesn't compile. Have to dig a bit deeper. Maybe delay Win port... was (Author: snazy): [~vijay2...@gmail.com] do you have a C* branch with lruc integrated? Or: what should I do to bring lruc+C* together? Is the patch up-to-date? I've pushed a new branch 'native-plugin' with the changes for native-maven-plugin - separate from the other code. Windows stuff is bit more complicated - it doesn't compile. Have to dig a bit deeper. Maybe delay Win port... > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1495#comment-1495 ] Vijay edited comment on CASSANDRA-7438 at 11/23/14 3:23 AM: Alright the first version of pure Java version of LRUCache pushed, * Basically a port from the C version. (Most of the test cases pass and they are the same for both versions) * As Ariel mentioned before... we can use disruptor for the ring buffer, current version doesnt use it yet. * Proactive expiry in the queue thread is not implemented yet. * Algorithm to start the rehash needs to be more configurable, and based on the capacity will be pushing that soon. * Overhead in JVM heap is just the segments array, hence should be able to grow as much as the system can support. https://github.com/Vijay2win/lruc/tree/master/src/main/java/com/lruc/unsafe was (Author: vijay2...@yahoo.com): Alright the first version of pure Java version of LRUCache pushed, * Basically a port from the C version. (Most of the test cases pass and they are the same for both versions) * As ariel mentioned before we can use disruptor for the ring buffer but this doesn't use it yet. * Expiry in the queue thread is not implemented yet. * Algorithm to start the rehash needs to be more configurable and based on the capacity will be pushing that soon. * Overhead in JVM heap is just the segments array. https://github.com/Vijay2win/lruc/tree/master/src/main/java/com/lruc/unsafe > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222799#comment-14222799 ] Robert Stupp edited comment on CASSANDRA-7438 at 11/24/14 8:58 AM: --- rehashing: growing (x2) is already implemented, shrinking (/2) shouldn't be a big issue, too. The implementation only locks the currently processed partitions during rehash. "put" operation: fixed (was definitely a bug), cleanup is running concurrently and trigger on "out of memory" condition block sizes: will give it a try (fixed vs. different sizes vs. variable sized (no blocks)) per-partition locks: already thought about it - not sure whether it's worth the additional RW-lock overhead since partition lock time is very low during normal operation metrics: some (very basic) metrics are already in it - will add some more timer metrics (configurable) [~vijay2...@yahoo.com] can you catch {{OutOfMemoryError}} for Unsafe.allocate() ? It should not go up the whole call stack as is to prevent C* handling that as "Java heap full". was (Author: snazy): rehashing: growing (x2) is already implemented, shrinking (/2) shouldn't be a big issue, too. The implementation only locks the currently processed partitions during rehash. "put" operation: fixed (was definitely a bug), cleanup is running concurrently and trigger on "out of memory" condition block sizes: will give it a try (fixed vs. different sizes vs. variable sized (no blocks)) per-partition locks: already thought about it - not sure whether it's worth the additional RW-lock overhead since partition lock time is very low during normal operation metrics: some (very basic) metrics are already in it - will add some more timer metrics (configurable) [~vijay2...@yahoo.com] can you catch {{OutOfMemoryError}} for Unsafe.allocate() ? It should not go up the whole call stack. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225478#comment-14225478 ] Ariel Weisberg edited comment on CASSANDRA-7438 at 11/26/14 12:29 AM: -- bq. if we don't like the constant overhead of the cache in heap and If you are talking about CAS which we already do for ref counting, as mentioned before we need an alternative strategy for global locks for rebalance if we go with lock less strategy. Just take what you have and do it off heap. You don't need to change anything about how locking is done, just put the segments off heap so each segment would be a 4-byte lock field and an 8 byte pointer to the first entry. I am not clear on the alignment requirements for 4 or 8 byte CAS. bq. Until you complete a rehash you don't know if you need to hash again or not... Am i missing something? https://github.com/Vijay2win/lruc/blob/master/src/main/java/com/lruc/unsafe/UnsafeConcurrentMap.java#L38 The check on line 38 races with the assignment on line 39. N threads could do the check and think a rehash is necessary. Each would submit a rehash task and the table size would be doubled N times instead of 1 time. was (Author: aweisberg): bq. if we don't like the constant overhead of the cache in heap and If you are talking about CAS which we already do for ref counting, as mentioned before we need an alternative strategy for global locks for rebalance if we go with lock less strategy. Just take what you have and do it off heap. You don't need to change anything about how locking is done, just put the segments off heap so each segment would be a 4-byte lock field and an 8 byte pointer to the first entry. bq. Until you complete a rehash you don't know if you need to hash again or not... Am i missing something? https://github.com/Vijay2win/lruc/blob/master/src/main/java/com/lruc/unsafe/UnsafeConcurrentMap.java#L38 The check on line 38 races with the assignment on line 39. N threads could do the check and think a rehash is necessary. Each would submit a rehash task and the table size would be doubled N times instead of 1 time. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227248#comment-14227248 ] Jonathan Ellis edited comment on CASSANDRA-7438 at 11/27/14 4:26 AM: - bq. The row cache can contain very large rows [partitions] AFAIK Well, it *can*, but it's almost always a bad idea. Not something we should optimize for. (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1) bq. Does the storage engine always materialize entire rows [partitions] into memory for every query? Only when it's pulling them from the off-heap cache. (It deserializes onto the heap to filter out the requested results.) was (Author: jbellis): bq. The row cache can contain very large rows [partitions] AFAIK Well, it *can*, but it's almost always a bad idea. Not something we should optimize for. (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1) bq. Does the storage engine always materialize entire rows [partitions] into memory for every query? Only when it's pulling them from the off-heap cache. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228397#comment-14228397 ] Benedict edited comment on CASSANDRA-7438 at 11/28/14 5:06 PM: --- I suspect segmenting the table at a coarser granularity, so that each segment is maintained with mutual exclusivity, would achieve better percentiles in both cases due to keeping the maximum resize cost down. We could settle for a separate LRU-q per segment, even, to keep the complexity of this code down significantly - it is unlikely having a global LRU-q is significantly more accurate at predicting reuse than ~128 of them. It would also make it much easier to improve the replacement strategy beyond LRU, which would likely yield a bigger win for performance than any potential loss from reduced concurrency. The critical section for reads could be kept sufficiently small that competition would be very unlikely with the current state of C*, by performing the deserialization outside of it. There's a good chance this would yield a net positive performance impact, by reducing the cost per access without increasing the cost due to contention measurably (because contention would be infrequent). edit: coarser, not finer. i.e., a la j.u.c.CHM was (Author: benedict): I suspect segmenting the table at a finer granularity, so that each segment is maintained with mutual exclusivity, would achieve better percentiles in both cases due to keeping the maximum resize cost down. We could settle for a separate LRU-q per segment, even, to keep the complexity of this code down significantly - it is unlikely having a global LRU-q is significantly more accurate at predicting reuse than ~128 of them. It would also make it much easier to improve the replacement strategy beyond LRU, which would likely yield a bigger win for performance than any potential loss from reduced concurrency. The critical section for reads could be kept sufficiently small that competition would be very unlikely with the current state of C*, by performing the deserialization outside of it. There's a good chance this would yield a net positive performance impact, by reducing the cost per access without increasing the cost due to contention measurably (because contention would be infrequent). > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228523#comment-14228523 ] Vijay edited comment on CASSANDRA-7438 at 11/28/14 9:47 PM: {quote}I would break out the performance comparison with and without warming up the cache so we know how it performs when you aren't measuring the resize pauses.{quote} Yep and in stedy state it is similar to get and I have verified that the latency is due to rehash. Better benchmarks on bug machines will be done on Monday. Unfortunately -1 on partitions, it will be a lot more complex and will be hard to understand for users. If we have to expand the partitions, we have to figure out a better consistent hashing algo. "Cassandra within Cassandra may be". More over we will end up having the current code as is to move maps and queues offheap. Sorry I don't understand the argument of code complexity. If we are talking about code complexity. The unsafe code is 1000 lines including the license headers :) The current contention topic is weather to use cas for locks. Which is showing higher cpu cost and I agree with Pavel on locks also shows up on the numbers. was (Author: vijay2...@yahoo.com): {quote}I would break out the performance comparison with and without warming up the cache so we know how it performs when you aren't measuring the resize pauses.{quote} Yep and in stedy state it is similar to get and I have verified that the latency is due to rehash. Better benchmarks on bug machines will be done on Monday. Unfortunately -1 on partitions, it will be a lot more complex and will be hard to understand for users. If we have to expand the partitions, we have to figure out a better consistent hashing algo. "Cassandra within Cassandra may be". More over we will end up having the current code as is to move maps and queues offheap. Sorry I don't understand the argument of code complexity. If we are talking about code complexity. The unsafe code is 1000 lines including the license headers :) The current contention topic is weather to use cas for locks. Which is showing higher cpu cost and I agree with Pavel on latencies as shown in the numbers. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228563#comment-14228563 ] Benedict edited comment on CASSANDRA-7438 at 11/29/14 12:23 AM: [~aweisberg]: In my experience segments tend to be imperfectly distributed, so whilst there is bunching of resizes simply because they take so long, with real work going on at the same time they should be a _little_ spread out. Though with murmur3 the distribution may be significantly more uniform than my prior experiments. Either way, they're performed in parallel (without coordination) if they coincide, and are each a fraction of the size, so it's still an improvement. [~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties of concurrent programming magnified without the normal tools. For instance, there are the following concerns: * We have a spin-lock - admittedly one that should _generally_ be uncontended, but on a grow or a small map this is certainly not the case, which could result in really problematic behaviour. Pure spin locks should not be used outside of the kernel. * The queue is maintained by a separate thread that requires signalling if it isn't currently performing work - which, in a real C* instance where the cost of linking the queue item is a fraction of the other work done to service a request means we are likely to incur a costly unpark() for a majority of operations * Reads can interleave with put/replace/remove and abort the removal of an item from the queue, resulting in a memory leak. * We perform the grow on a separate thread, but prevent all reader _or_ writer threads from making progress by taking the locks for all buckets immediately. * Freeing of oldSegments is still dangerous, it's just probabilistically less likely to happen. * During a grow, we can lose puts because we unlock the old segments, so with the right (again, unlikely) interleaving of events a writer can think the old table is still valid * When growing, we only double the size of the backing table, however since grows happen in the background the updater can get ahead, meaning we remain behind and multiply the constant factor overheads, collisions and contention until total size tails off. These are only the obvious problems that spring to mind from 15m perusing the code, I'm sure there are others. This kind of stuff is really hard, and the approach I'm suggesting is comparatively a doddle to get right, and is likely faster to boot. I'm not sure I understand your concern with segmentation creating complexity with the hashing... I'm proposing the exact method used by CHM. We have an excellent hash algorithm to distribute the data over the segments: murmurhash3. Although we need to be careful to not use the bits that don't have the correct entropy for selecting a segment. Think of it as simply implementing an off-heap LinkedHashMap, wrapping it in a lock, and having an array of them. The user doesn't need to know anything about this. was (Author: benedict): [~aweisberg]: In my experience segments tend to be imperfectly distributed, so whilst there is bunching of resizes simply because they take so long, with real work going on at the same time they should be a _little_ spread out. Though with murmur3 the distribution may be significantly more uniform than my prior experiments. Either way, they're performed in parallel (without coordination) if they coincide, so it's still an improvement. [~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties of concurrent programming magnified without the normal tools. For instance, there are the following concerns: * We have a spin-lock - admittedly one that should _generally_ be uncontended, but on a grow or a small map this is certainly not the case, which could result in really problematic behaviour. Pure spin locks should not be used outside of the kernel. * The queue is maintained by a separate thread that requires signalling if it isn't currently performing work - which, in a real C* instance where the cost of linking the queue item is a fraction of the other work done to service a request means we are likely to incur a costly unpark() for a majority of operations * Reads can interleave with put/replace/remove and abort the removal of an item from the queue, resulting in a memory leak. * We perform the grow on a separate thread, but prevent all reader _or_ writer threads from making progress by taking the locks for all buckets immediately. * Freeing of oldSegments is still dangerous, it's just probabilistically less likely to happen. * During a grow, we can lose puts because we unlock the old segments, so with the right (again, unlikely) interleaving of events a writer can think the old table is still valid * When growing, we only double the size of the backing t
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228693#comment-14228693 ] Benedict edited comment on CASSANDRA-7438 at 11/29/14 9:40 AM: --- Good point! But invert those two statements and the behaviour is still broken. B: 154 :map.get() A: 187: map.remove() A: 191: queue.deleteFromQueue() B: 158: queue.addToQueue() was (Author: benedict): Invert those two statements and the behaviour is still broken. B: 154 :map.get() A: 187: map.remove() A: 191: queue.deleteFromQueue() B: 158: queue.addToQueue() > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229284#comment-14229284 ] Robert Stupp edited comment on CASSANDRA-7438 at 12/1/14 9:46 AM: -- Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has been nearly completely rewritten. Architecture (in brief): * OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads to more contention, more segments gives no measurable improvement. * Each segment consists of an off-heap-hash-map (defaults: table-size=8192, load-factor=.75). (The hash table requires 8 bytes per bucket) * Hash entries in a bucket are organized in a double-linked-list * LRU replacement policy is built-in via its own double-linked-list * Critical sections that mutually lock a segment are pretty short (code + CPU) - just a 'synchronized' keyword, no StampedLock/ReentrantLock * Capacity for the cache is configured globally and managed "locally" in each segment * Eviction (or "replacement" or "cleanup") is triggered when free capacity goes below a trigger value and cleans up to a target free capacity * Uses murmur hash on serialized key. Most significant bits are used to find the segment, least significant bits for the segment's hash map. Non-production relevant stuff: * Allows to start off-heap access in "debug" mode, that checks for accesses outside of allocated region and produces exceptions instead of SIGSEGV or jemalloc errors * ohc-benchmark updated to reflect changes About replacement policy: Currently LRU is built in - but I'm not really sold on LRU as is. Alternatives could be * timestamp (not sold on this either - basically the same as LRU) * LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead (space) * 2Q (counts accesses, divides counter regularly) * LRU+random (50/50) (may give the same result than LIRS, but without LIRS' overhead) But replacement of LRU with something else is out of scope of this ticket and should be done with real workloads in C* - although the last one is "just" a additional config parameter. IMO we should add a per-table option that configures whether the row cache receives data on reads+writes or just on reads. Might prevent garbage in the cache caused by write heavy tables. {{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared to jemalloc. Reason fot it might be that JNA library (which has some synchronized blocks in it). IMO OHC is ready to be merged into C* code base. Edit: the fact that there are two double-linked lists is a left-over of several experiments and it will be merged into one double-linked-list. It needs to be and will be fixed. was (Author: snazy): Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has been nearly completely rewritten. Architecture (in brief): * OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads to more contention, more segments gives no measurable improvement. * Each segment consists of an off-heap-hash-map (defaults: table-size=8192, load-factor=.75). (The hash table requires 8 bytes per bucket) * Hash entries in a bucket are organized in a double-linked-list * LRU replacement policy is built-in via its own double-linked-list * Critical sections that mutually lock a segment are pretty short (code + CPU) - just a 'synchronized' keyword, no StampedLock/ReentrantLock * Capacity for the cache is configured globally and managed "locally" in each segment * Eviction (or "replacement" or "cleanup") is triggered when free capacity goes below a trigger value and cleans up to a target free capacity * Uses murmur hash on serialized key. Most significant bits are used to find the segment, least significant bits for the segment's hash map. Non-production relevant stuff: * Allows to start off-heap access in "debug" mode, that checks for accesses outside of allocated region and produces exceptions instead of SIGSEGV or jemalloc errors * ohc-benchmark updated to reflect changes About replacement policy: Currently LRU is built in - but I'm not really sold on LRU as is. Alternatives could be * timestamp (not sold on this either - basically the same as LRU) * LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead (space) * 2Q (counts accesses, divides counter regularly) * LRU+random (50/50) (may give the same result than LIRS, but without LIRS' overhead) But replacement of LRU with something else is out of scope of this ticket and should be done with real workloads in C* - although the last one is "just" a additional config parameter. IMO we should add a per-table option that configures whether the row cache receives data on reads+writes or just on reads. Might prevent garbage in the cache caused by write heavy tables. {{Unsafe.allocateMemory()}} gives about 5-10% performance
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229284#comment-14229284 ] Robert Stupp edited comment on CASSANDRA-7438 at 12/1/14 10:14 AM: --- Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has been nearly completely rewritten. Architecture (in brief): * OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads to more contention, more segments gives no measurable improvement. * Each segment consists of an off-heap-hash-map (defaults: table-size=8192, load-factor=.75). (The hash table requires 8 bytes per bucket) * Hash entries in a bucket are organized in a double-linked-list * LRU replacement policy is built-in via its own double-linked-list * Critical sections that mutually lock a segment are pretty short (code + CPU) - just a 'synchronized' keyword, no StampedLock/ReentrantLock * Capacity for the cache is configured globally and managed "locally" in each segment * Eviction (or "replacement" or "cleanup") is triggered when free capacity goes below a trigger value and cleans up to a target free capacity * Uses murmur hash on serialized key. Most significant bits are used to find the segment, least significant bits for the segment's hash map. Non-production relevant stuff: * Allows to start off-heap access in "debug" mode, that checks for accesses outside of allocated region and produces exceptions instead of SIGSEGV or jemalloc errors * ohc-benchmark updated to reflect changes About replacement policy: Currently LRU is built in - but I'm not really sold on LRU as is. Alternatives could be * timestamp (not sold on this either - basically the same as LRU) * LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead (space) * 2Q (counts accesses, divides counter regularly) * LRU+random (50/50) (may give the same result than LIRS, but without LIRS' overhead) But replacement of LRU with something else is out of scope of this ticket and should be done with real workloads in C* - although the last one is "just" a additional config parameter. IMO we should add a per-table option that configures whether the row cache receives data on reads+writes or just on reads. Might prevent garbage in the cache caused by write heavy tables. {{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared to jemalloc. Reason fot it might be that JNA library (which has some synchronized blocks in it). IMO OHC is ready to be merged into C* code base. Edit2: (remove edit1) was (Author: snazy): Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has been nearly completely rewritten. Architecture (in brief): * OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads to more contention, more segments gives no measurable improvement. * Each segment consists of an off-heap-hash-map (defaults: table-size=8192, load-factor=.75). (The hash table requires 8 bytes per bucket) * Hash entries in a bucket are organized in a double-linked-list * LRU replacement policy is built-in via its own double-linked-list * Critical sections that mutually lock a segment are pretty short (code + CPU) - just a 'synchronized' keyword, no StampedLock/ReentrantLock * Capacity for the cache is configured globally and managed "locally" in each segment * Eviction (or "replacement" or "cleanup") is triggered when free capacity goes below a trigger value and cleans up to a target free capacity * Uses murmur hash on serialized key. Most significant bits are used to find the segment, least significant bits for the segment's hash map. Non-production relevant stuff: * Allows to start off-heap access in "debug" mode, that checks for accesses outside of allocated region and produces exceptions instead of SIGSEGV or jemalloc errors * ohc-benchmark updated to reflect changes About replacement policy: Currently LRU is built in - but I'm not really sold on LRU as is. Alternatives could be * timestamp (not sold on this either - basically the same as LRU) * LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead (space) * 2Q (counts accesses, divides counter regularly) * LRU+random (50/50) (may give the same result than LIRS, but without LIRS' overhead) But replacement of LRU with something else is out of scope of this ticket and should be done with real workloads in C* - although the last one is "just" a additional config parameter. IMO we should add a per-table option that configures whether the row cache receives data on reads+writes or just on reads. Might prevent garbage in the cache caused by write heavy tables. {{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared to jemalloc. Reason fot it might be that JNA library (which has some synchronized blocks in it). IMO OHC is ready to be merged int
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229284#comment-14229284 ] Robert Stupp edited comment on CASSANDRA-7438 at 12/1/14 11:00 AM: --- Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has been nearly completely rewritten. Architecture (in brief): * OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads to more contention, more segments gives no measurable improvement. * Each segment consists of an off-heap-hash-map (defaults: table-size=8192, load-factor=.75). (The hash table requires 8 bytes per bucket) * Hash entries in a bucket are organized in a single-linked-list * LRU replacement policy is built-in via its own double-linked-list * Critical sections that mutually lock a segment are pretty short (code + CPU) - just a 'synchronized' keyword, no StampedLock/ReentrantLock * Capacity for the cache is configured globally and managed "locally" in each segment * Eviction (or "replacement" or "cleanup") is triggered when free capacity goes below a trigger value and cleans up to a target free capacity * Uses murmur hash on serialized key. Most significant bits are used to find the segment, least significant bits for the segment's hash map. Non-production relevant stuff: * Allows to start off-heap access in "debug" mode, that checks for accesses outside of allocated region and produces exceptions instead of SIGSEGV or jemalloc errors * ohc-benchmark updated to reflect changes About replacement policy: Currently LRU is built in - but I'm not really sold on LRU as is. Alternatives could be * timestamp (not sold on this either - basically the same as LRU) * LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead (space) * 2Q (counts accesses, divides counter regularly) * LRU+random (50/50) (may give the same result than LIRS, but without LIRS' overhead) But replacement of LRU with something else is out of scope of this ticket and should be done with real workloads in C* - although the last one is "just" a additional config parameter. IMO we should add a per-table option that configures whether the row cache receives data on reads+writes or just on reads. Might prevent garbage in the cache caused by write heavy tables. {{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared to jemalloc. Reason fot it might be that JNA library (which has some synchronized blocks in it). IMO OHC is ready to be merged into C* code base. Edit3: (sorry for the JIRA noise) - bucket linked list is only a single-linked-list - LRU linked list needs to be doubly linked was (Author: snazy): Have pushed the latest changes of OHC to https://github.com/snazy/ohc. It has been nearly completely rewritten. Architecture (in brief): * OHC consists of multiple segments (default: 2 x #CPUs). Less segments leads to more contention, more segments gives no measurable improvement. * Each segment consists of an off-heap-hash-map (defaults: table-size=8192, load-factor=.75). (The hash table requires 8 bytes per bucket) * Hash entries in a bucket are organized in a double-linked-list * LRU replacement policy is built-in via its own double-linked-list * Critical sections that mutually lock a segment are pretty short (code + CPU) - just a 'synchronized' keyword, no StampedLock/ReentrantLock * Capacity for the cache is configured globally and managed "locally" in each segment * Eviction (or "replacement" or "cleanup") is triggered when free capacity goes below a trigger value and cleans up to a target free capacity * Uses murmur hash on serialized key. Most significant bits are used to find the segment, least significant bits for the segment's hash map. Non-production relevant stuff: * Allows to start off-heap access in "debug" mode, that checks for accesses outside of allocated region and produces exceptions instead of SIGSEGV or jemalloc errors * ohc-benchmark updated to reflect changes About replacement policy: Currently LRU is built in - but I'm not really sold on LRU as is. Alternatives could be * timestamp (not sold on this either - basically the same as LRU) * LIRS (https://en.wikipedia.org/wiki/LIRS_caching_algorithm), big overhead (space) * 2Q (counts accesses, divides counter regularly) * LRU+random (50/50) (may give the same result than LIRS, but without LIRS' overhead) But replacement of LRU with something else is out of scope of this ticket and should be done with real workloads in C* - although the last one is "just" a additional config parameter. IMO we should add a per-table option that configures whether the row cache receives data on reads+writes or just on reads. Might prevent garbage in the cache caused by write heavy tables. {{Unsafe.allocateMemory()}} gives about 5-10% performance improvement compared to jemalloc. Reason fot i
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230675#comment-14230675 ] Ariel Weisberg edited comment on CASSANDRA-7438 at 12/1/14 11:28 PM: - Look pretty nice. Suggestions: * Push the stats into the segments and gather them the way you do free capacity and cleanup count. You can drop the volatile (technically you will have to synchronize on read). Inside each OffHeapMap put the stats members (and anything mutable) as the first declared fields. In practice this can put them on the same cache line as the lock field in the object header. It will also be just one flush at the end of the critical section. Stats collection should be free so no reason not to leave it on all the time. * I am not sure batch cleanup makes sense. When inserting an item into the cache would blow the size requirement I would just evict elements until inserting it wouldn't. Is there a specific efficiency you think you are going to get from doing it in batches? * Cache is the wrong API to use since it doesn't allow lazy deserialization and zero copy. Since entries are refcounted there is no need make a copy. Might be something to save for later since everything upstream expects a POJO of some sort. * Key buffer might be worth a thread local sized to a high watermark Do we have a decent way to do line level code review? I can't leave comments on github unless there is a pull request. Line level stuff * Don't catch exceptions and handle inside the map. Let them all propagate to the caller and use try/finally to do cleanup. I know you have to wrap and rethrow some things, but avoid where possible. * Compare key compares 8 bytes at a time, how does it handle trailing bytes and alignment? * Agrona has an Unsafe ByteBuffer implementation that looks like it makes a little better use of various intrinsics then AbstractDataOutput. Does some other nifty stuff as well. https://github.com/real-logic/Agrona/blob/master/src/main/java/uk/co/real_logic/agrona/concurrent/UnsafeBuffer.java * In OffHeapMap.touch lines 439 and 453 are not covered by tests. Coverage looks a little weird in that a lot of the cases are always hit but some don't touch both branches. If lruTail == hashEntryAddr maybe assert next is null. * Rename mutating OffHeapMap lruNext and lruPrev to reflect that they mutate. In general rename mutating methods to reflect they do that such as the two versions of first * I don't see why the cache can't use CPU endianness since the key/value are just copied. * Did you get the UTF encoded string stuff from somewhere? I see something similar in the jdk, can you get that via inheritance? * HashEntryInput, AbstractDataOutput are low on the coverage scale and have no tests for some pretty gnarly UTF8 stuff. * Continuing on that theme there is a lot of unused code to satisfy the interfaces being implemented, would be nice to avoid that. * By hashing the key yourself you prevent caching the hash code in the POJO. Maybe hashes should be 32-bits and provided by the POJO? * If an allocation fails maybe throw OutOfMemoryError with a message * If an entry is too large maybe return an error of some sort? Seems like caller should decide if not caching is OK. * In put, why catch VirtualMachineError and not error? Seems like it wants a finally, and it shouldn't throw checked exceptions. * If a key serializer is necessary throw in the constructor and remove other checks * Hot N could use a more thorough test? * In practice how is hot N used in C*? When people save the cache to disk do they save the entire cache? * In the value loading case, I think there is some subtlety to the concurrency of invocations to the loader in that it doesn't call it on all of them in a race. It might be a minor change in behavior compared to Guava. * Maybe do the value loading timing in nanoseconds? Performance is the same but precision is better. * OffHeapMap.Table.removeLink(long,long) has no test coverage of the second branch that walks a bucket to find the previous entry * I don't think storage for 16 million keys is enough? For 128 bytes per entry that is only 2 gigabytes. You would have to run a lot of segments which is probably fine, but that presents a configuration issue. Maybe allow more than 24 bits of buckets in each segment? * SegmentedCacheImpl contains duplicate code fro dereferencing and still has to delegate part of the work to the OffHeapMap. Maybe keep it all in OffHeapMap? * Unit test wise there are some things not tested. The value loader interface, various things like putAll or invalidateAll. * Release is not synchronized. Release should null pointers out so you get a good clean segfault. Close should maybe lock and close one segment at a time and invalidate as part of that. was (Author: aweisberg): Look pretty nice. Suggestions:
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230675#comment-14230675 ] Ariel Weisberg edited comment on CASSANDRA-7438 at 12/1/14 11:46 PM: - Look pretty nice. Suggestions: * Push the stats into the segments and gather them the way you do free capacity and cleanup count. You can drop the volatile (technically you will have to synchronize on read). Inside each OffHeapMap put the stats members (and anything mutable) as the first declared fields. In practice this can put them on the same cache line as the lock field in the object header. It will also be just one flush at the end of the critical section. Stats collection should be free so no reason not to leave it on all the time. * I am not sure batch cleanup makes sense. When inserting an item into the cache would blow the size requirement I would just evict elements until inserting it wouldn't. Is there a specific efficiency you think you are going to get from doing it in batches? * Cache is the wrong API to use since it doesn't allow lazy deserialization and zero copy. Since entries are refcounted there is no need make a copy. Might be something to save for later since everything upstream expects a POJO of some sort. * Key buffer might be worth a thread local sized to a high watermark Do we have a decent way to do line level code review? I can't leave comments on github unless there is a pull request. Line level stuff * Don't catch exceptions and handle inside the map. Let them all propagate to the caller and use try/finally to do cleanup. I know you have to wrap and rethrow some things due to checked exceptions, but avoid where possible. * Compare key compares 8 bytes at a time, how does it handle trailing bytes and alignment? * Agrona has an Unsafe ByteBuffer implementation that looks like it makes a little better use of various intrinsics then AbstractDataOutput. Does some other nifty stuff as well. https://github.com/real-logic/Agrona/blob/master/src/main/java/uk/co/real_logic/agrona/concurrent/UnsafeBuffer.java * In OffHeapMap.touch lines 439 and 453 are not covered by tests. Coverage looks a little weird in that a lot of the cases are always hit but some don't touch both branches. If lruTail == hashEntryAddr maybe assert next is null. * Rename mutating OffHeapMap lruNext and lruPrev to reflect that they mutate. In general rename mutating methods to reflect they do that such as the two versions of first * I don't see why the cache can't use CPU endianness since the key/value are just copied. * Did you get the UTF encoded string stuff from somewhere? I see something similar in the jdk, can you get that via inheritance? * HashEntryInput, AbstractDataOutput are low on the coverage scale and have no tests for some pretty gnarly UTF8 stuff. * Continuing on that theme there is a lot of unused code to satisfy the interfaces being implemented, would be nice to avoid that. * By hashing the key yourself you prevent caching the hash code in the POJO. Maybe hashes should be 32-bits and provided by the POJO? * If an allocation fails maybe throw OutOfMemoryError with a message * If an entry is too large maybe return an error of some sort? Seems like caller should decide if not caching is OK. * In put, why catch VirtualMachineError and not error? Seems like it wants a finally, and it shouldn't throw checked exceptions. * If a key serializer is necessary throw in the constructor and remove other checks * Hot N could use a more thorough test? * In practice how is hot N used in C*? When people save the cache to disk do they save the entire cache? * In the value loading case, I think there is some subtlety to the concurrency of invocations to the loader in that it doesn't call it on all of them in a race. It might be a minor change in behavior compared to Guava. * Maybe do the value loading timing in nanoseconds? Performance is the same but precision is better. * OffHeapMap.Table.removeLink(long,long) has no test coverage of the second branch that walks a bucket to find the previous entry * I don't think storage for 16 million keys is enough? For 128 bytes per entry that is only 2 gigabytes. You would have to run a lot of segments which is probably fine, but that presents a configuration issue. Maybe allow more than 24 bits of buckets in each segment? * SegmentedCacheImpl contains duplicate code fro dereferencing and still has to delegate part of the work to the OffHeapMap. Maybe keep it all in OffHeapMap? * Unit test wise there are some things not tested. The value loader interface, various things like putAll or invalidateAll. * Release is not synchronized. Release should null pointers out so you get a good clean segfault. Close should maybe lock and close one segment at a time and invalidate as part of that. was (Author: aweisberg): Look p
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230675#comment-14230675 ] Ariel Weisberg edited comment on CASSANDRA-7438 at 12/1/14 11:48 PM: - Look pretty nice. Suggestions: * Push the stats into the segments and gather them the way you do free capacity and cleanup count. You can drop the volatile (technically you will have to synchronize on read). Inside each OffHeapMap put the stats members (and anything mutable) as the first declared fields. In practice this can put them on the same cache line as the lock field in the object header. It will also be just one flush at the end of the critical section. Stats collection should be free so no reason not to leave it on all the time. * I am not sure batch cleanup makes sense. When inserting an item into the cache would blow the size requirement I would just evict elements until inserting it wouldn't. Is there a specific efficiency you think you are going to get from doing it in batches? * Cache is the wrong API to use since it doesn't allow lazy deserialization and zero copy. Since entries are refcounted there is no need make a copy. Might be something to save for later since everything upstream expects a POJO of some sort. * Key buffer might be worth a thread local sized to a high watermark Do we have a decent way to do line level code review? I can't leave comments on github unless there is a pull request. Line level stuff * Don't catch exceptions and handle inside the map. Let them all propagate to the caller and use try/finally to do cleanup. I know you have to wrap and rethrow some things due to checked exceptions, but avoid where possible. * Compare key compares 8 bytes at a time, how does it handle trailing bytes and alignment? * Agrona has an Unsafe ByteBuffer implementation that looks like it makes a little better use of various intrinsics then AbstractDataOutput. Does some other nifty stuff as well. https://github.com/real-logic/Agrona/blob/master/src/main/java/uk/co/real_logic/agrona/concurrent/UnsafeBuffer.java * In OffHeapMap.touch lines 439 and 453 are not covered by tests. Coverage looks a little weird in that a lot of the cases are always hit but some don't touch both branches. If lruTail == hashEntryAddr maybe assert next is null. * Rename mutating OffHeapMap lruNext and lruPrev to reflect that they mutate. In general rename mutating methods to reflect they do that such as the two versions of first * I don't see why the cache can't use CPU endianness since the key/value are just copied. * Did you get the UTF encoded string stuff from somewhere? I see something similar in the jdk, can you get that via inheritance? * HashEntryInput, AbstractDataOutput are low on the coverage scale and have no tests for some pretty gnarly UTF8 stuff. * Continuing on that theme there is a lot of unused code to satisfy the interfaces being implemented, would be nice to avoid that. * By hashing the key yourself you prevent caching the hash code in the POJO. Maybe hashes should be 32-bits and provided by the POJO? * If an allocation fails maybe throw OutOfMemoryError with a message * If an entry is too large maybe return an error of some sort? Seems like caller should decide if not caching is OK. * In put, why catch VirtualMachineError and not error? Seems like it wants a finally, and it shouldn't throw checked exceptions. * If a key serializer is necessary throw in the constructor and remove other checks * Hot N could use a more thorough test? * In practice how is hot N used in C*? When people save the cache to disk do they save the entire cache? I am a little concerned about materializing the full list on heap. It's a lot of contiguous memory if you aren't careful. * In the value loading case, I think there is some subtlety to the concurrency of invocations to the loader in that it doesn't call it on all of them in a race. It might be a minor change in behavior compared to Guava. * Maybe do the value loading timing in nanoseconds? Performance is the same but precision is better. * OffHeapMap.Table.removeLink(long,long) has no test coverage of the second branch that walks a bucket to find the previous entry * I don't think storage for 16 million keys is enough? For 128 bytes per entry that is only 2 gigabytes. You would have to run a lot of segments which is probably fine, but that presents a configuration issue. Maybe allow more than 24 bits of buckets in each segment? * SegmentedCacheImpl contains duplicate code fro dereferencing and still has to delegate part of the work to the OffHeapMap. Maybe keep it all in OffHeapMap? * Unit test wise there are some things not tested. The value loader interface, various things like putAll or invalidateAll. * Release is not synchronized. Release should null pointers out so you get a good clean segfault. Close
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231327#comment-14231327 ] Vijay edited comment on CASSANDRA-7438 at 12/2/14 11:33 AM: [~snazy] I was trying to compare the OHC and found few major bugs. There is correctness in the hashing algorithm i think. Get returns a lot of error and looks like there is some memory leaks too. was (Author: vijay2...@yahoo.com): [~snazy] I was trying to compare the OHC and found few major bugs. 1) You have individual method synchronization on the Map, which doesn't ensure that your get is locked before a put is performed (same with clean, hot(N), remove etc), look at SynchronizedMap source code to do it right else will crash soon. 2) Even after i fix it, there is correctness in the hashing algorithm i think. Get returns a lot of error and looks like there is some memory leaks too. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231878#comment-14231878 ] Vijay edited comment on CASSANDRA-7438 at 12/2/14 8:31 PM: --- EDIT: Here is the explanation. run the benchmark with the following options (lruc benchmark). {code}java -Djava.library.path=/usr/local/lib/ -jar ~/lrucTest.jar -t 30 -s 6147483648 -c ohc{code} And you will see something like this (errors == not found from the cache even though you have all the items you need is in the cache). {code} Memory consumed: 3 GB / 5 GB or 427170 / 6147483648, size 4980, queued (LRU q size) 0 VM total:2 GB VM free:2 GB Get Operation (micros) time_taken, count, mean, median, 99thPercentile, 999thPercentile, error 4734724, 166, 2.42, 1.93, 8.58, 24.74, 166 4804375, 166, 2.40, 1.92, 4.56, 106.23, 166 4805858, 166, 2.45, 1.95, 3.94, 11.76, 166 4842886, 166, 2.40, 1.92, 7.46, 26.73, 166 {code} You really need test cases :) Anyways i am going to stop working on this ticket now, let me know if someone wants any other info. was (Author: vijay2...@yahoo.com): Never mind, my bad it was related the below (which needs to be more configurable instead) and the items where going missing earlier than i thought it should and looks you just evict the items per segment (If a segment is used more more items will disappear from that segment and the lest used segment items will remain). {code} // 12.5% if capacity less than 8GB // 10% if capacity less than 16 GB // 5% if capacity is higher than 16GB {code} Also noticed you don't have replace which Cassandra uses. Anyways i am going to stop working on this for now, let me know if someone wants any other info. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234696#comment-14234696 ] Robert Stupp edited comment on CASSANDRA-7438 at 12/4/14 10:20 PM: --- Just pushed some OHC additions to github: * key-iterator (used by CacheService class to invalidate column families) * (de)serialization of cache content to disk using direct I/O from off-heap. Means that the row cache content does not need to go though the heap for serialization and deserialization. Compression should also be possible in off-heap using the static methods in Snappy class since these expect direct buffers so there's nearly no pressure for that on the heap. Background: the implementation basically "lies" the address and length of the cache entry into DirectByteBuffer class so FileChannel is able to read into it/write from it. edit: s/hash/cache/ was (Author: snazy): Just pushed some OHC additions to github: * key-iterator (used by CacheService class to invalidate column families) * (de)serialization of cache content to disk using direct I/O from off-heap. Means that the row cache content does not need to go though the heap for serialization and deserialization. Compression should also be possible in off-heap using the static methods in Snappy class since these expect direct buffers so there's nearly no pressure for that on the heap. Background: the implementation basically "lies" the address and length of the hash entry into DirectByteBuffer class so FileChannel is able to read into it/write from it. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257721#comment-14257721 ] Robert Stupp edited comment on CASSANDRA-7438 at 12/24/14 2:21 PM: --- I had the opportunity to test OHC on a big machine. First: it works - very happy about that :) Some things I want to notice: * high number of segments do not have any really measurable influence (default of 2* # of cores is fine) * throughput heavily depends on serialization (hash entry size) - Java8 gave about 10% to 15% improvement in some tests (either on {{Unsafe.copyMemory}} or something related like JNI barrier) * the number of entries per bucket stays pretty low with the default load factor of .75 - vast majority has 0 or 1 entries, some 2 or 3 and few up to 8 Issue (not solvable yet): It works great for hash entries to approx. 64kB with good to great throughput. Above that barrier it first works good but after some time the system spends a huge amount of CPU time (~95%) in {{malloc()}} / {{free()}} (with jemalloc, Unsafe.allocate is not worth discussing at all on Linux). I tried to add some „memory buffer cache“ that caches free’d hash entries for reuse. But it turned out that in the end it would be too complex if done right. The current implementation is still in the code, but must be explicitly enabled with a system property. Workloads with small entries and high number of threads easily trigger Linux OOM protection (that kills the process). Please note that it works with large hash entries - but throughput drops dramatically to just a few thousand writes per second. Some numbers (value sizes have gaussian distribution). Had to do these tests in a hurry because I had to give back the machine. Code used during these tests is tagged as {{0.1-SNAP-Bench}} in git. Throughput is limited by {{malloc()}} / {{free()}} and most tests did only use 50% of available CPU capacity (on _c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB). * 1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k writes/sec, 200k reads/sec, ~8k evictions/sec, write: 8ms (99perc), read: 3ms(99perc) * 1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k writes/sec, 499k reads/sec, ~2k evictions/sec, write: .1ms (99perc), read: .03ms(99perc) * 1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k writes/sec, 195k reads/sec, ~9k evictions/sec, write: .2ms (99perc), read: .1ms(99perc) * 1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k writes/sec, 20k reads/sec, ~7k evictions/sec, write: 4ms (99perc), read: .07ms(99perc) * 1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k writes/sec, 1M reads/sec, 30k evictions/sec, write: .04ms (99perc), read: .01ms(99perc) * 1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k writes/sec, 420k reads/sec, 125k evictions/sec, write: .06ms (99perc), read: .01ms(99perc) * 1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k writes/sec, 48k reads/sec, 130k evictions/sec, write: .06ms (99perc), read: .01ms(99perc) * 1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k writes/sec, 1.25M reads/sec, 50k evictions/sec, write: .02ms (99perc), read: .005ms(99perc) * 1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k writes/sec, 530k reads/sec, 220k evictions/sec, write: .04ms (99perc), read: .005ms(99perc) * 1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k writes/sec, 74k reads/sec, 250k evcictions/sec, write: .04ms (99perc), read: .005ms(99perc) Command line to execute the benchmark: {code} java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 'uniform(1..2000)' -wkd 'uniform(1..2000)' -vs 'gaussian(1024..4096,2)' -r .1 -cap 320 -d 86400 -t 500 -dr 8 -r = read rate -d = duration -t = # of threads -dr = # of driver threads that feed the worker threads -rkd = read key distribution -wkd = write key distribution -vs = value size -cap = capacity {code} Sample bucket histogram from 20M test: {code} [0..0]: 8118604 [1..1]: 5892298 [2..2]: 2138308 [3..3]: 518089 [4..4]: 94441 [5..5]: 13672 [6..6]: 1599 [7..7]: 189 [8..9]: 16 {code} After trapping into that memory management issue with varying allocation sized of some few kB to several MB, I think that it’s still worth to work on an own off-heap memory management. Maybe some block-based approach (fixed or variable). But that’s out of the scope of this ticket. EDIT: The problem with high system-CPU usage only persists on systems with multiple CPUs. Cross check with the second CPU socket disabled - calling the benchmark with {{taskset 0x java -jar ...}} does not show 95% system CPU usage. was (Author: snazy): I had the opportunity to test OHC on a big machine. First:
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257721#comment-14257721 ] Robert Stupp edited comment on CASSANDRA-7438 at 12/24/14 2:29 PM: --- I had the opportunity to test OHC on a big machine. First: it works - very happy about that :) Some things I want to notice: * high number of segments do not have any really measurable influence (default of 2* # of cores is fine) * throughput heavily depends on serialization (hash entry size) - Java8 gave about 10% to 15% improvement in some tests (either on {{Unsafe.copyMemory}} or something related like JNI barrier) * the number of entries per bucket stays pretty low with the default load factor of .75 - vast majority has 0 or 1 entries, some 2 or 3 and few up to 8 Issue (not solvable yet): It works great for hash entries to approx. 64kB with good to great throughput. Above that barrier it first works good but after some time the system spends a huge amount of CPU time (~95%) in {{malloc()}} / {{free()}} (with jemalloc, Unsafe.allocate is not worth discussing at all on Linux). I tried to add some „memory buffer cache“ that caches free’d hash entries for reuse. But it turned out that in the end it would be too complex if done right. The current implementation is still in the code, but must be explicitly enabled with a system property. Workloads with small entries and high number of threads easily trigger Linux OOM protection (that kills the process). Please note that it works with large hash entries - but throughput drops dramatically to just a few thousand writes per second. Some numbers (value sizes have gaussian distribution). Had to do these tests in a hurry because I had to give back the machine. Code used during these tests is tagged as {{0.1-SNAP-Bench}} in git. Throughput is limited by {{malloc()}} / {{free()}} and most tests did only use 50% of available CPU capacity (on _c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB). * 1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k writes/sec, 200k reads/sec, ~8k evictions/sec, write: 8ms (99perc), read: 3ms(99perc) * 1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k writes/sec, 499k reads/sec, ~2k evictions/sec, write: .1ms (99perc), read: .03ms(99perc) * 1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k writes/sec, 195k reads/sec, ~9k evictions/sec, write: .2ms (99perc), read: .1ms(99perc) * 1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k writes/sec, 20k reads/sec, ~7k evictions/sec, write: 4ms (99perc), read: .07ms(99perc) * 1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k writes/sec, 1M reads/sec, 30k evictions/sec, write: .04ms (99perc), read: .01ms(99perc) * 1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k writes/sec, 420k reads/sec, 125k evictions/sec, write: .06ms (99perc), read: .01ms(99perc) * 1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k writes/sec, 48k reads/sec, 130k evictions/sec, write: .06ms (99perc), read: .01ms(99perc) * 1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k writes/sec, 1.25M reads/sec, 50k evictions/sec, write: .02ms (99perc), read: .005ms(99perc) * 1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k writes/sec, 530k reads/sec, 220k evictions/sec, write: .04ms (99perc), read: .005ms(99perc) * 1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k writes/sec, 74k reads/sec, 250k evcictions/sec, write: .04ms (99perc), read: .005ms(99perc) Command line to execute the benchmark: {code} java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 'uniform(1..2000)' -wkd 'uniform(1..2000)' -vs 'gaussian(1024..4096,2)' -r .1 -cap 320 -d 86400 -t 500 -dr 8 -r = read rate -d = duration -t = # of threads -dr = # of driver threads that feed the worker threads -rkd = read key distribution -wkd = write key distribution -vs = value size -cap = capacity {code} Sample bucket histogram from 20M test: {code} [0..0]: 8118604 [1..1]: 5892298 [2..2]: 2138308 [3..3]: 518089 [4..4]: 94441 [5..5]: 13672 [6..6]: 1599 [7..7]: 189 [8..9]: 16 {code} After trapping into that memory management issue with varying allocation sized of some few kB to several MB, I think that it’s still worth to work on an own off-heap memory management. Maybe some block-based approach (fixed or variable). But that’s out of the scope of this ticket. EDIT: The problem with high system-CPU usage only persists on systems with multiple CPUs. Cross check with the second CPU socket disabled - calling the benchmark with {{taskset 0x3ff java -jar ...}} does not show 95% system CPU usage. was (Author: snazy): I had the opportunity to test OHC on a big machine. First: i
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257721#comment-14257721 ] Robert Stupp edited comment on CASSANDRA-7438 at 1/6/15 10:08 AM: -- I had the opportunity to test OHC on a big machine. First: it works - very happy about that :) Some things I want to notice: * high number of segments do not have any really measurable influence (default of 2* # of cores is fine) * throughput heavily depends on serialization (hash entry size) - Java8 gave about 10% to 15% improvement in some tests (either on {{Unsafe.copyMemory}} or something related like JNI barrier) * the number of entries per bucket stays pretty low with the default load factor of .75 - vast majority has 0 or 1 entries, some 2 or 3 and few up to 8 Issue (not solvable yet): It works great for hash entries to approx. 64kB with good to great throughput. Above that barrier it first works good but after some time the system spends a huge amount of CPU time (~95%) in {{malloc()}} / {{free()}} (with jemalloc, Unsafe.allocate is not worth discussing at all on Linux). I tried to add some „memory buffer cache“ that caches free’d hash entries for reuse. But it turned out that in the end it would be too complex if done right. The current implementation is still in the code, but must be explicitly enabled with a system property. Workloads with small entries and high number of threads easily trigger Linux OOM protection (that kills the process). Please note that it works with large hash entries - but throughput drops dramatically to just a few thousand writes per second. Some numbers (value sizes have gaussian distribution). Had to do these tests in a hurry because I had to give back the machine. Code used during these tests is tagged as {{0.1-SNAP-Bench}} in git. Throughput is limited by {{malloc()}} / {{free()}} and most tests did only use 50% of available CPU capacity (on _c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB). * -1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k writes/sec, 200k reads/sec, ~8k evictions/sec, write: 8ms (99perc), read: 3ms(99perc)- * -1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k writes/sec, 499k reads/sec, ~2k evictions/sec, write: .1ms (99perc), read: .03ms(99perc)- * -1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k writes/sec, 195k reads/sec, ~9k evictions/sec, write: .2ms (99perc), read: .1ms(99perc)- * -1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k writes/sec, 20k reads/sec, ~7k evictions/sec, write: 4ms (99perc), read: .07ms(99perc)- * -1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k writes/sec, 1M reads/sec, 30k evictions/sec, write: .04ms (99perc), read: .01ms(99perc)- * -1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k writes/sec, 420k reads/sec, 125k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)- * -1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k writes/sec, 48k reads/sec, 130k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)- * -1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k writes/sec, 1.25M reads/sec, 50k evictions/sec, write: .02ms (99perc), read: .005ms(99perc)- * -1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k writes/sec, 530k reads/sec, 220k evictions/sec, write: .04ms (99perc), read: .005ms(99perc)- * -1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k writes/sec, 74k reads/sec, 250k evcictions/sec, write: .04ms (99perc), read: .005ms(99perc)- Command line to execute the benchmark: {code} java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 'uniform(1..2000)' -wkd 'uniform(1..2000)' -vs 'gaussian(1024..4096,2)' -r .1 -cap 320 -d 86400 -t 500 -dr 8 -r = read rate -d = duration -t = # of threads -dr = # of driver threads that feed the worker threads -rkd = read key distribution -wkd = write key distribution -vs = value size -cap = capacity {code} Sample bucket histogram from 20M test: {code} [0..0]: 8118604 [1..1]: 5892298 [2..2]: 2138308 [3..3]: 518089 [4..4]: 94441 [5..5]: 13672 [6..6]: 1599 [7..7]: 189 [8..9]: 16 {code} After trapping into that memory management issue with varying allocation sized of some few kB to several MB, I think that it’s still worth to work on an own off-heap memory management. Maybe some block-based approach (fixed or variable). But that’s out of the scope of this ticket. EDIT: The problem with high system-CPU usage only persists on systems with multiple CPUs. Cross check with the second CPU socket disabled - calling the benchmark with {{taskset 0x3ff java -jar ...}} does not show 95% system CPU usage. EDIT2: Marked benchmark values as invalid (see my comment on 01/
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271745#comment-14271745 ] Robert Stupp edited comment on CASSANDRA-7438 at 1/9/15 7:21 PM: - Note: OHC how has cache-loader support (https://github.com/snazy/ohc/issues/3). Could be an alternative for RowCacheSentinel. EDIT: in a C* follow-up ticket was (Author: snazy): Note: OHC how has cache-loader support (https://github.com/snazy/ohc/issues/3). Could be an alternative for RowCacheSentinel. > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Robert Stupp > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274005#comment-14274005 ] Ariel Weisberg edited comment on CASSANDRA-7438 at 1/12/15 7:25 PM: If you go all the way down the JMH rabbit hole you don't need to do any of your own timing and JMH will actually do some smart things to give you accurate timing and ameliorate the impact of non-scalable/expensive timing measurement. Metrics uses System.nanoTime() internally so it isn't really any better as far as I can tell. System.nanoTime() on Linux is pretty scalable http://shipilev.net/blog/2014/nanotrusting-nanotime/. When I tested it in JMH it actually seemed to be linearly scalable, but JMH will solve that for you even on platforms where nanoTime is finicky. The C* integration looks good. I'm glad it was easy. When it comes to exposing configuration parameters less is more. I would prefer not to expose anything new because once people start using them they don't like to have the options taken away (or disabled). We should make an effort to set them automatically (or good enough) and if that fails we can add user visible configuration. My preference is to make the options accessible via properties as an escape hatch in production, and then add them to config if we really can't set them automatically. The stress tool when used without workload profiles does some validation. It checks that values are there and that the contents are correct. Did not know about the JNA synchronized block. That is surprising, but I am glad to hear it is getting fixed. For access to jemalloc I recommend using unsafe and LD_PRELOAD jemalloc. I think that would be the recommended approach and the one you should benchmark against and JNA would be there as a fallback. That gives you a JNI call for allocation/deallocation. I am trying out the JMH benchmark and looking at the new linked implementation right now. How are you starting the JMH benchmark? was (Author: aweisberg): If you go all the way down the JMH rabbit hole you don't need to do any of your own timing and JMH will actually do some smart things to give you accurate timing and ameliorate the impact of non-scalable/expensive timing measurement. Metrics uses System.nanoTime() internally so it isn't really any better as far as I can tell. System.nanoTime() on Linux is pretty scalable http://shipilev.net/blog/2014/nanotrusting-nanotime/. When I tested it in JMH it actually seemed to be linearly scalable, but JMH will solve that for you even on platforms where nanoTime is finicky. The C* integration looks good. I'm glad it was easy. When it comes to exposing configuration parameters less is more The stress tool when used without workload profiles does some validation. It checks that values are there and that the contents are correct. Did not know about the JNA synchronized block. That is surprising, but I am glad to hear it is getting fixed. For access to jemalloc I recommend using unsafe and LD_PRELOAD jemalloc. I think that would be the recommended approach and the one you should benchmark against and JNA would be there as a fallback. That gives you a JNI call for allocation/deallocation. I am trying out the JMH benchmark and looking at the new linked implementation right now. How are you starting the JMH benchmark? > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Robert Stupp > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274005#comment-14274005 ] Ariel Weisberg edited comment on CASSANDRA-7438 at 1/12/15 7:28 PM: If you go all the way down the JMH rabbit hole you don't need to do any of your own timing and JMH will actually do some smart things to give you accurate timing and ameliorate the impact of non-scalable/expensive timing measurement. Metrics uses System.nanoTime() internally so it isn't really any better as far as I can tell. System.nanoTime() on Linux is pretty scalable http://shipilev.net/blog/2014/nanotrusting-nanotime/. When I tested it in JMH it actually seemed to be linearly scalable, but JMH will solve that for you even on platforms where nanoTime is finicky. The C* integration looks good. I'm glad it was easy. When it comes to exposing configuration parameters less is more. I would prefer not to expose anything new because once people start using them they don't like to have the options taken away (or disabled). We should make an effort to set them automatically (or good enough) and if that fails we can add user visible configuration. My preference is to make the options accessible via properties as an escape hatch in production, and then add them to config if we really can't set them automatically. Can you prefix any System properties you have with a classname/package or something that makes it clear they are part of OHC? The stress tool when used without workload profiles does some validation. It checks that values are there and that the contents are correct. Did not know about the JNA synchronized block. That is surprising, but I am glad to hear it is getting fixed. For access to jemalloc I recommend using unsafe and LD_PRELOAD jemalloc. I think that would be the recommended approach and the one you should benchmark against and JNA would be there as a fallback. That gives you a JNI call for allocation/deallocation. I am trying out the JMH benchmark and looking at the new linked implementation right now. How are you starting the JMH benchmark? was (Author: aweisberg): If you go all the way down the JMH rabbit hole you don't need to do any of your own timing and JMH will actually do some smart things to give you accurate timing and ameliorate the impact of non-scalable/expensive timing measurement. Metrics uses System.nanoTime() internally so it isn't really any better as far as I can tell. System.nanoTime() on Linux is pretty scalable http://shipilev.net/blog/2014/nanotrusting-nanotime/. When I tested it in JMH it actually seemed to be linearly scalable, but JMH will solve that for you even on platforms where nanoTime is finicky. The C* integration looks good. I'm glad it was easy. When it comes to exposing configuration parameters less is more. I would prefer not to expose anything new because once people start using them they don't like to have the options taken away (or disabled). We should make an effort to set them automatically (or good enough) and if that fails we can add user visible configuration. My preference is to make the options accessible via properties as an escape hatch in production, and then add them to config if we really can't set them automatically. The stress tool when used without workload profiles does some validation. It checks that values are there and that the contents are correct. Did not know about the JNA synchronized block. That is surprising, but I am glad to hear it is getting fixed. For access to jemalloc I recommend using unsafe and LD_PRELOAD jemalloc. I think that would be the recommended approach and the one you should benchmark against and JNA would be there as a fallback. That gives you a JNI call for allocation/deallocation. I am trying out the JMH benchmark and looking at the new linked implementation right now. How are you starting the JMH benchmark? > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Robert Stupp > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic comple
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282467#comment-14282467 ] Robert Stupp edited comment on CASSANDRA-7438 at 1/19/15 12:41 PM: --- I think the possibly best alternative to access malloc/free is {{Unsafe}} with jemalloc in LD_PRELOAD. Native code of {{Unsafe.allocateMemory}} is basically just a wrapper around {{malloc()}}/{{free()}}. Updated the git branch with the following changes: * update to OHC 0.3 * benchmark: add new command line option to specify key length (-kl) * free capacity handling moved to segments * allow to specify preferred memory allocation via system property "org.caffinitas.ohc.allocator" * allow to specify defaults of OHCacheBuilder via system properties prefixed with "org.caffinitas.org." * benchmark: make metrics in local to the driver threads * benchmark: disable bucket histogram in stats by default I did not change the default number of segments = 2 * CPUs - but I thought about that (since you experienced that 256 segments on c3.8xlarge gives some improvement). A naive approach to say e.g. 8 * CPUs feels too heavy for small systems (with one socket) and might be too much outside of benchmarking. If someone wants to get most out of it in production and really hits the number of segments, he can always configure it better. WDYT? Using jemalloc on Linux via LD_PRELOAD is probably the way to go in C* (since off-heap is also used elsewhere). I think we should leave the OS allocator on OSX. Don't know much about allocator performance on Windows. For now I do not plan any new features in OHC for C* - so maybe we shall start a final review round? was (Author: snazy): I think the possibly best alternative to access malloc/free is {{Unsafe}} with jemalloc in LD_PRELOAD. Native code of {{Unsafe.allocateMemory}} is basically just a wrapper around {{malloc()}}/{{free()}}. Updated the git branch with the following changes: * update to OHC 0.3 * benchmark: add new command line option to specify key length (-kl) * free capacity handling moved to segments * allow to specify preferred memory allocation via system property "org.caffinitas.ohc.allocator" * allow to specify defaults of OHCacheBuilder via system properties prefixed with "org.caffinitas.org." * benchmark: make metrics in local to the driver threads * benchmark: disable bucket histogram in stats by default I did not change the default number of segments = 2 * CPUs - but I thought about that (since you experienced that 256 segments on c3.8xlarge gives some improvement). A naive approach to say e.g. 8 * CPUs feels too heavy for small systems (with one socket) and might be too much outside of benchmarking. If someone wants to get most out of it in production and really hits the number of segments, he can always configure it better. WDYT? Using jemalloc on Linux via LD_PRELOAD is probably the way to go in C* (since off-heap is also used elsewhere). I think we should leave the OS allocator on OSX. Don't know much about allocator performance on Windows. For now I do not plan any new features for C* - so maybe we shall start a final review round? > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Robert Stupp > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch, tests.zip > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145338#comment-14145338 ] Robert Stupp edited comment on CASSANDRA-7438 at 9/23/14 8:01 PM: -- (note: [~vijay2...@yahoo.com], please use the other nick) Some quick notes: * Can you add the assertion for {{capacity <= 0}} to {{OffheapCacheProvider.create}} - the current error message if {{row_cache_size_in_mb}} is not set (or invalid) "capacity should be set" could be more fleshy * Additionally the {{capacity}} check should also check for negative values (it starts with a negative value - don't know what happens if it is negative...) * {{org.apache.cassandra.db.RowCacheTest#testRowCacheCleanup}} fails at the last assertion - all other unit tests seem to work * Documentation in cassandra.yaml for row_cache_provider could be a bit more verbose - just some abstract about the characteristics and limitation (e.g. Offheap does only work on Linux + OSX) of both implementations * IMO it would be fine to have a general unit test for {{com.lruc.api.LRUCache}} in C* code, too * Please add an adopted copy of {{RowCacheTest}} for OffheapCacheProvider * unit tests using OffheapCacheProvider must not start on Windows builds - please add an assertion in OffHeapCacheProvider to assert that it runs on Linux or OSX Sorry for the late reply was (Author: snazy): (note: [~vijay2...@gmail.com], please use the other nick) Some quick notes: * Can you add the assertion for {{capacity <= 0}} to {{OffheapCacheProvider.create}} - the current error message if {{row_cache_size_in_mb}} is not set (or invalid) "capacity should be set" could be more fleshy * Additionally the {{capacity}} check should also check for negative values (it starts with a negative value - don't know what happens if it is negative...) * {{org.apache.cassandra.db.RowCacheTest#testRowCacheCleanup}} fails at the last assertion - all other unit tests seem to work * Documentation in cassandra.yaml for row_cache_provider could be a bit more verbose - just some abstract about the characteristics and limitation (e.g. Offheap does only work on Linux + OSX) of both implementations * IMO it would be fine to have a general unit test for {{com.lruc.api.LRUCache}} in C* code, too * Please add an adopted copy of {{RowCacheTest}} for OffheapCacheProvider * unit tests using OffheapCacheProvider must not start on Windows builds - please add an assertion in OffHeapCacheProvider to assert that it runs on Linux or OSX Sorry for the late reply > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157989#comment-14157989 ] Jonathan Ellis edited comment on CASSANDRA-7438 at 10/3/14 1:55 PM: Are you still working on this, [~vijay2...@yahoo.com]? was (Author: jbellis): Are you still working on this, [~vijay2...@gmail.com]? > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195418#comment-14195418 ] Ariel Weisberg edited comment on CASSANDRA-7438 at 11/4/14 12:06 AM: - RE refcount: I think hazard pointers (never used them personally) are the no-gc no-refcount way of handling this. It also won't be fetched twice if it is uncontended which in many cases it will be since it should be decrefd as soon as the data is copied. I think that with the right QA work this solves the problem of running arbitrarily large caches. That means running a validating workload in continuous integration that demonstrates the cache doesn't lock up, leak, or return the wrong answer. I would probably test directly against the cache to get more iterations in. RE Implementation as a library via JNI: We give up something by using JNI so it only makes sense if we get something else in return. The QA and release work created by JNI is pretty large. You really need a plan for running something like Valgrind or similar against a comprehensive suite of tests. Valgrind doesn't run well with Java AFAIK so you end up doing things like running the native code in a separate process, and have to write an interface amenable to that. Valgrind is also slow enough that if you try and run all your tests against a configuration using it a lot you end up with timeouts and many hours to run all the tests plus time spent interpreting results. Unsafe is worse some respects because there is no Valgrind and I can attest that debugging an off-heap red-black tree is not fun. I am not clear on why the JNI is justified. It really seems like this could be written against Unsafe and then it would work on any platform. There are no libraries or system calls in use that are only accessible via JNI. I think JNI would make more sense if we were pulling in existing code like memcached that already handles memory pooling, fragmentation, and concurrency. If it were in Java you could use Disruptor for the queue and would only need to implement a thread safe off heap hash table. RE Performance and implementation: What kind of hardware was the benchmark run on? Server class NUMA? I am just wondering if there are enough cores to bring out any scalability issues in the cache implementation. It would be nice to see a benchmark that showed the on heap cache falling over while the off heap cache provides good performance. Subsequent comments aren't particularly useful if performance is satisfactory under relevant configurations. Given the use of a heap allocator and locking it might not make sense to have a background thread do expiration. I think that splitting the cache into several instances with one lock around each instance might result in less contention overall and it would scale up in a more straightforward way. It appears that some common operations will hit a global lock in may_expire() quite frequently? It seems like there are other globally shared frequently mutated cache lines in the write path like stats. Is there something subtle in the locking that makes the use of the custom queue and maps necessary or could you use stuff from Intel TBB and still make it work? It is hypothetically less code to have to QA and maintain. I still need to dig more, but I am also not clear on why locks are necessary for individual items. It looks like there is a table for all of them? Random intuition is that it could be done without a lock or at least a discrete lock. Striping against a padded pool of locks might make sense if that isn't going to cause deadlocks. Apparently every pthread_mutex_t is 40 bytes according to a random stack overflow post. It might make sense to use the same cache line as the refcount to store a lock field, or the bucket in the hash table? Another implementation question is do we want to use C++11? It would remove a lot of platform and compiler specific code. was (Author: aweisberg): RE refcount: I think hazard pointers (never used them personally) are the no-gc no-refcount way of handling this. It also won't be fetched twice if it is uncontended which in many cases it will be since it should be decrefd as soon as the data is copied. I think that with the right QA work this solves the problem of running arbitrarily large caches. That means running a validating workload in continuous integration that demonstrates the cache doesn't lock up, leak, or return the wrong answer. I would probably test directly against the cache to get more iterations in. RE Implementation as a library via JNI: We give up something by using JNI so it only makes sense if we get something else in return. The QA and release work created by JNI is pretty large. You really need a plan for running something like Valgrind or similar against a comprehensive suite of tests.
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195679#comment-14195679 ] Vijay edited comment on CASSANDRA-7438 at 11/4/14 3:54 AM: --- Thanks for reviewing! {quote} I am also not clear on why locks are necessary for individual items. {quote} No we don't. We have locks per Segment, this is very similar to lock stripping/Java's concurrent hash map. {quote} global lock in may_expire() quite frequently? {quote} Not really we lock globally when we reach 100% of the space and we freeup to 80% of the space and we spread the overhead to other threads based on who ever has the item partition lock. It won't be hard to make this part of the queue thread and will try it for the next release of lruc. {quote} What kind of hardware was the benchmark run on? {quote} 32 core 100GB RAM with numa and intel xeon. There is a benchmark util which is also checked in as a part of the lruc code which does exactly the same kind of test. {quote} You really need a plan for running something like Valgrind {quote} Good point, I was part way down that road and still have the code i can resuruct it for the next lruc version. {quote} I am not clear on why the JNI is justified {quote} There is some comments above which has the reasoning for it (please see the above comments). PS: I believe there was some tickets on Current RowCache complaining about the overhead. {quote} I think JNI would make more sense if we were pulling in existing code like memcached {quote} If you look at the code closer to memcached. Actually I started of stripping memcached code so we can run it in process instead of running as a separate process and removing the global locks in queue reallocation etc and eventually diverged too much from it. The other reason it doesn't use slab allocators is because we wanted the memory allocators to do the right thing we already have tested Cassandra with Jemalloc. To confort a bit lruc is running in our production already :) was (Author: vijay2...@yahoo.com): Thanks for reviewing! {quote} I am also not clear on why locks are necessary for individual items. {quote} No we don't. We have locks per Segment, this is very similar to lock stripping or the smiler to Java's concurrent hash map. {quote} global lock in may_expire() quite frequently? {quote} Not really we lock globally when we reach 100% of the space and we freeup to 80% of the space and we spread the overhead to other threads based on who ever has the item partition lock. It won't be hard to make this part of the queue thread and will try it for the next release of lruc. {quote} What kind of hardware was the benchmark run on? {quote} 32 core 100GB RAM with numa and intel xeon. There is a benchmark util which is also checked in as a part of the lruc code which does exactly the same kind of test. {quote} You really need a plan for running something like Valgrind {quote} Good point, I was part way down that road and still have the code i can resuruct it for the next lruc version. {quote} I am not clear on why the JNI is justified {quote} There is some comments above which has the reasoning for it (please see the above comments). PS: I believe there was some tickets on Current RowCache complaining about the overhead. {quote} I think JNI would make more sense if we were pulling in existing code like memcached {quote} If you look at the code closer to memcached. Actually I started of stripping memcached code so we can run it in process instead of running as a separate process and removing the global locks in queue reallocation etc and eventually diverged too much from it. The other reason it doesn't use slab allocators is because we wanted the memory allocators to do the right thing we already have tested Cassandra with Jemalloc. To confort a bit lruc is running in our production already :) > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to e
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196296#comment-14196296 ] Ariel Weisberg edited comment on CASSANDRA-7438 at 11/4/14 4:13 PM: bq. No we don't. We have locks per Segment, this is very similar to lock stripping/Java's concurrent hash map. Thanks for clearing that up bq. Not really we lock globally when we reach 100% of the space and we freeup to 80% of the space and we spread the overhead to other threads based on who ever has the item partition lock. It won't be hard to make this part of the queue thread and will try it for the next release of lruc. OK, that make sense. 20% of the cache could be many milliseconds of work if you are using many gigabytes of cache. That's not a great thing to foist on a random victim thread. If you handed that to the queue thread, well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? The queue thread could go out to lunch for a while. Not a big deal, but finer grained scheduling will probably be necessary. bq. If you look at the code closer to memcached. Actually I started of stripping memcached code so we can run it in process instead of running as a separate process and removing the global locks in queue reallocation etc and eventually diverged too much from it. The other reason it doesn't use slab allocators is because we wanted the memory allocators to do the right thing we already have tested Cassandra with Jemalloc. Ah very cool. jemalloc is not a moving allocator where as it looks like memcached slabs implement rebalancing to accommodate changes in size distribution. That would actually be one of the really nice things to keep IMO. On large memory systems with a cache that scales and performs you would end up dedicating as much RAM as possible to the row cache/key cache and not the page cache since the page cache is not as granular (correct me if the story for C* is different). If you dedicate 80% of RAM to the cache that doesn't leave a lot of space left for fragmentation. By using a heap allocator you also lose the ability to implement hard predictable limits on memory used by the cache since you didn't map it yourself. I could be totally off base and jemalloc might be good enough. bq. There is some comments above which has the reasoning for it (please see the above comments). PS: I believe there was some tickets on Current RowCache complaining about the overhead. I don't have a performance beef with JNI, especially the way you have done which I think is pretty efficient. I think the overhead of JNI (one or two slightly more expensive function calls) would be eclipsed by things like the cache misses, coherence, and pipeline stalls that are part of accessing and maintaining a concurrent cache (Java or C++). It's all just intuition without comparative microbenchmarks of the two caches. Java might look a little faster just due to allocator performance, but we know you pay for that in other ways. I think what you have made scratches the itch for a large cache quite well, and beats the status quo. I don't agree that Unsafe couldn't do the exact same thing with no on heap references. The hash table, ring buffer, and individual item entries are all being malloced and you can do that from Java using Unsafe. You don't need to implement a ring buffer because you can use Disruptor. I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler, and I know performance is not an issue, fast enough code. I don't want to advocate doing something different for performance, but rather that there is the possibility of a relatively simple implementation via Unsafe. You could coalesce all the contended fields for each instance (stats, lock field, LRU head) into a single cache line, and then rely on a single barrier when releasing a coarse grained lock. The fine grained locking and CASing results in several pipeline stalls because the memory barriers that are implicit in each one require the store buffers to drain. There may even be a suitable off heap map implementation out there already. was (Author: aweisberg): .bq No we don't. We have locks per Segment, this is very similar to lock stripping/Java's concurrent hash map. Thanks for clearing that up .bq Not really we lock globally when we reach 100% of the space and we freeup to 80% of the space and we spread the overhead to other threads based on who ever has the item partition lock. It won't be hard to make this part of the queue thread and will try it for the next release of lruc. OK, that make sense. 20% of the cache could be many milliseconds of work if you are using many gigabytes of cache. That's not a great thing to foist on a random victim thread. If you handed tha
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196492#comment-14196492 ] Vijay edited comment on CASSANDRA-7438 at 11/4/14 6:15 PM: --- {quote} well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? {quote} Yeah i thought about it, we need to handle those and thats why didn't have it in the first place. Should not be really bad though. {quote} I don't agree that Unsafe couldn't do the exact same thing with no on heap references {quote} Probably, since we figured most of the implementation detail sure we can but still there is always many different ways to solve the problem (May not be very efficient to copy multiple bytes to get to the next item in map etc... GC and CPU overhead would be more IMHO). For example Memcached used expiration time set by the clients to remove the items which made it easier for them to do the slab allocator but this is something we removed it in lruc and just a queue. {quote} I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler {quote} The problem there is how will you invalidate the last used items, since they are different partitions you really don't know which ones to invalidate... there is also a problem of load balancing when to expand the buckets etc which will bring us back to the current lock stripping solutions IMHO. I can do some benchmarks if thats exactly what we need at this point Thanks! was (Author: vijay2...@yahoo.com): {quote} well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? {quote} Yeah i thought about it, we need to handle those and thats why didn't have it in the first place. Should not be really bad though. {quote} I don't agree that Unsafe couldn't do the exact same thing with no on heap references {quote} Probably, since we figured most of the implementation detail sure we can but still there is always many different ways to solve the problem (Even though it will be in efficient to copy multiple bytes to get to the next items in map etc... GC and CPU overhead would be more IMHO). For example Memcached used expiration time set by the clients to remove the items which made it easier for them to do the slab allocator but this is something we removed it in lruc and just a queue. {quote} I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler {quote} The problem there is how will you invalidate the last used items, since they are different partitions you really don't know which ones to invalidate... there is also a problem of load balancing when to expand the buckets etc which will bring us back to the current lock stripping solutions IMHO. I can do some benchmarks if thats exactly what we need at this point Thanks! > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196492#comment-14196492 ] Vijay edited comment on CASSANDRA-7438 at 11/4/14 6:16 PM: --- {quote} well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? {quote} Yeah i thought about it, we need to handle those and thats why didn't have it in the first place. Should not be really bad though. {quote} I don't agree that Unsafe couldn't do the exact same thing with no on heap references {quote} Probably, since we figured most of the implementation detail sure we can but still there is always many different ways to solve the problem (May not be very efficient to copy multiple bytes to get to the next item in map etc... GC and CPU overhead would be more IMHO). For example Memcached used expiration time set by the clients to remove the items which made it easier for them to do the slab allocator but this is something we removed it in lruc and just a queue. {quote} I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler {quote} The problem there is how will you invalidate the least used items, since they are different partitions you really don't know which ones to invalidate... there is also a problem of load balancing when to expand the buckets etc which will bring us back to the current lock stripping solutions IMHO. I can do some benchmarks if thats exactly what we need at this point Thanks! was (Author: vijay2...@yahoo.com): {quote} well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? {quote} Yeah i thought about it, we need to handle those and thats why didn't have it in the first place. Should not be really bad though. {quote} I don't agree that Unsafe couldn't do the exact same thing with no on heap references {quote} Probably, since we figured most of the implementation detail sure we can but still there is always many different ways to solve the problem (May not be very efficient to copy multiple bytes to get to the next item in map etc... GC and CPU overhead would be more IMHO). For example Memcached used expiration time set by the clients to remove the items which made it easier for them to do the slab allocator but this is something we removed it in lruc and just a queue. {quote} I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler {quote} The problem there is how will you invalidate the last used items, since they are different partitions you really don't know which ones to invalidate... there is also a problem of load balancing when to expand the buckets etc which will bring us back to the current lock stripping solutions IMHO. I can do some benchmarks if thats exactly what we need at this point Thanks! > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196492#comment-14196492 ] Vijay edited comment on CASSANDRA-7438 at 11/4/14 6:17 PM: --- {quote} well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? {quote} Yeah i thought about it, we need to handle those and thats why didn't have it in the first place. Should not be really bad though. {quote} I don't agree that Unsafe couldn't do the exact same thing with no on heap references {quote} Probably, since we figured most of the implementation detail sure we can but still there is always many different ways to solve the problem (May not be very efficient to copy multiple bytes to get to the next item in map etc... GC and CPU overhead would be more IMHO). For example Memcached used expiration time set by the clients to remove the items which made it easier for them to do the slab allocator but this is something we removed it in lruc and just a queue. {quote} I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler {quote} The problem there is how will you invalidate the least used items, since they are different partitions you really don't know which ones to invalidate... there is also a problem of load balancing when to expand the buckets etc which will bring us back to the current lock striping solutions IMHO. I can do some benchmarks if thats exactly what we need at this point Thanks! was (Author: vijay2...@yahoo.com): {quote} well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? {quote} Yeah i thought about it, we need to handle those and thats why didn't have it in the first place. Should not be really bad though. {quote} I don't agree that Unsafe couldn't do the exact same thing with no on heap references {quote} Probably, since we figured most of the implementation detail sure we can but still there is always many different ways to solve the problem (May not be very efficient to copy multiple bytes to get to the next item in map etc... GC and CPU overhead would be more IMHO). For example Memcached used expiration time set by the clients to remove the items which made it easier for them to do the slab allocator but this is something we removed it in lruc and just a queue. {quote} I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler {quote} The problem there is how will you invalidate the least used items, since they are different partitions you really don't know which ones to invalidate... there is also a problem of load balancing when to expand the buckets etc which will bring us back to the current lock stripping solutions IMHO. I can do some benchmarks if thats exactly what we need at this point Thanks! > Serializing Row cache alternative (Fully off heap) > -- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux >Reporter: Vijay >Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)