[jira] [Commented] (CASSANDRA-7247) Provide top ten most frequent keys per column family
[ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142509#comment-14142509 ] Brandon Williams commented on CASSANDRA-7247: - Totally missed this when I made CASSANDRA-7974, but I'm not surprised Chris and I think alike. :) > Provide top ten most frequent keys per column family > > > Key: CASSANDRA-7247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7247 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Attachments: cassandra-2.1-7247.txt, jconsole.png, patch.txt > > > Since already have the nice addthis stream library, can use it to keep track > of most frequent DecoratedKeys that come through the system using > StreamSummaries ([nice > explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]). > Then provide a new metric to access them via JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7247) Provide top ten most frequent keys per column family
[ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142375#comment-14142375 ] Benedict commented on CASSANDRA-7247: - It's probably better to construct a lightweight wrapper around the data you're using for equality (key bytes / token), with knowledge of _how_ to turn it into a string, and to do so only when we're asked for the TopK. It could well be worth enabling this on a per-CF / per-KS basis, though, or configuring the size of the sample in the yaml. If you have large keys (64K), the structure as it stands will take up > 128Mb per key space, or > 64Mb with the adjustment I've just suggested. Either way that's non-trivial, especially since we have two of them. Admittedly such large keys are not likely to be common. > Provide top ten most frequent keys per column family > > > Key: CASSANDRA-7247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7247 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Attachments: cassandra-2.1-7247.txt, jconsole.png, patch.txt > > > Since already have the nice addthis stream library, can use it to keep track > of most frequent DecoratedKeys that come through the system using > StreamSummaries ([nice > explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]). > Then provide a new metric to access them via JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7247) Provide top ten most frequent keys per column family
[ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142343#comment-14142343 ] Chris Lohfink commented on CASSANDRA-7247: -- Updated to always do it, but I think 2 or 3 are equally viable - its still using executor to single-thread it for more performant StreamSummary and provide a 1k backlog cap, especially since im not sure about performance impact of now using the AbstractType. Instead of using the DecoratedKey.toString I changed it to use the human readable format from the partitions type which makes it more useful for debugging. If keeping this as an always on option I can add a nodetool command to list them out in a nice format. > Provide top ten most frequent keys per column family > > > Key: CASSANDRA-7247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7247 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Attachments: cassandra-2.1-7247.txt, jconsole.png, patch.txt > > > Since already have the nice addthis stream library, can use it to keep track > of most frequent DecoratedKeys that come through the system using > StreamSummaries ([nice > explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]). > Then provide a new metric to access them via JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7247) Provide top ten most frequent keys per column family
[ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036113#comment-14036113 ] Jonathan Ellis commented on CASSANDRA-7247: --- Using tracingProbability as a coarse on/off feels wrong to me. I'd prefer one of these options: # If it's cheap, just do it always # If it makes sense to do it along with other tracing ops, trace if we've enabled a trace state already # Otherwise, introduce a new per-table setting > Provide top ten most frequent keys per column family > > > Key: CASSANDRA-7247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7247 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Attachments: jconsole.png, patch.txt > > > Since already have the nice addthis stream library, can use it to keep track > of most frequent DecoratedKeys that come through the system using > StreamSummaries ([nice > explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]). > Then provide a new metric to access them via JMX. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7247) Provide top ten most frequent keys per column family
[ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004005#comment-14004005 ] Chris Lohfink commented on CASSANDRA-7247: -- Added patch that uses the trace executor to track the partition thats updated the most, has the most columns inserted (useful for finding rows that are too wide) and the partitions with slowest insertion times. Will only track if trace probability > 0. > Provide top ten most frequent keys per column family > > > Key: CASSANDRA-7247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7247 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Attachments: jconsole.png, patch.txt > > > Since already have the nice addthis stream library, can use it to keep track > of most frequent DecoratedKeys that come through the system using > StreamSummaries ([nice > explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]). > Then provide a new metric to access them via JMX. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7247) Provide top ten most frequent keys per column family
[ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001741#comment-14001741 ] Chris Lohfink commented on CASSANDRA-7247: -- Id like to rework this I think to use the trace executor > Provide top ten most frequent keys per column family > > > Key: CASSANDRA-7247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7247 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Attachments: patch.diff > > > Since already have the nice addthis stream library, can use it to keep track > of most frequent DecoratedKeys that come through the system using > StreamSummaries ([nice > explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]). > Then provide a new metric to access them via JMX. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7247) Provide top ten most frequent keys per column family
[ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999621#comment-13999621 ] Chris Lohfink commented on CASSANDRA-7247: -- Problem is StreamSummary is not thread safe. There is a ConcurrentStreamSummary, which I found in this implementation to be ~5x slower then a synchronized block around the offer of the non-thread safe one. Concurrent did perform similarly when also wrapped in synchronized block which I will show below but because it would lose any benefit of being a concurrent implementation when access is serialized I think the faster impl is best. Done on 2013 retina MBP with 500gb ssd: {code:title=No Changes} id, ops ,op/s, key/s,mean, med, .95, .99,.999, max, time, stderr 4 threadCount, 634450, 21692, 21692, 0.2, 0.2, 0.2, 0.2, 0.4, 740.1, 29.2, 0.01188 8 threadCount, 886600, 29762, 29762, 0.3, 0.2, 0.3, 0.4, 1.3, 1007.3, 29.8, 0.01220 16 threadCount, 912050, 29035, 29035, 0.5, 0.3, 0.9, 2.5,11.2, 1393.8, 31.4, 0.01162 24 threadCount, 1022250 , 32681, 32681, 0.7, 0.5, 1.0, 2.9,13.5, 1126.5, 31.3, 0.00923 36 threadCount, 946550, 30900, 30900, 1.2, 0.8, 1.4, 3.0,22.5, 1369.2, 30.6, 0.01089 {code} {code:title=With Patch} id, ops ,op/s, key/s,mean, med, .95, .99,.999, max, time, stderr 4 threadCount, 643900, 21700, 21700, 0.2, 0.2, 0.2, 0.2, 0.9, 941.1, 29.7, 0.01079 8 threadCount, 942100, 32300, 32300, 0.2, 0.2, 0.3, 0.3, 1.2, 849.5, 29.2, 0.01519 16 threadCount, 907400, 30650, 30650, 0.5, 0.3, 0.8, 1.9,10.7, 1124.0, 29.6, 0.01112 24 threadCount, 1026150 , 31753, 31753, 0.7, 0.5, 0.9, 3.3,20.6, 1299.0, 32.3, 0.01295 36 threadCount, 980600, 30077, 30077, 1.2, 0.8, 1.3, 2.7,24.9, 1394.3, 32.6, 0.01747 {code} > Provide top ten most frequent keys per column family > > > Key: CASSANDRA-7247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7247 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Priority: Minor > Attachments: patch.diff > > > Since already have the nice addthis stream library, can use it to keep track > of most frequent DecoratedKeys that come through the system using > StreamSummaries ([nice > explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]). > Then provide a new metric to access them via JMX. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7247) Provide top ten most frequent keys per column family
[ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999849#comment-13999849 ] Chris Lohfink commented on CASSANDRA-7247: -- Another option might be to spin off this and other metrics into the MiscStage, it only has single thread so no synchronization required and wont be as bad to put additional metrics in there as well for additional visibility like topK size in bytes, worst latencies and such. I wouldn't expect much difference performance-wise with just the one stream summary above since enqueuing onto the LinkedBlockingQueue should have similar locking performance (synchronization on putlock), but then reading of metric would never cause contention (albeit very small) on write path. If theres any interest I can give it a shot though and maybe throw in some additional metrics. > Provide top ten most frequent keys per column family > > > Key: CASSANDRA-7247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7247 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Priority: Minor > Attachments: patch.diff > > > Since already have the nice addthis stream library, can use it to keep track > of most frequent DecoratedKeys that come through the system using > StreamSummaries ([nice > explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]). > Then provide a new metric to access them via JMX. -- This message was sent by Atlassian JIRA (v6.2#6252)