Your extract of cfhistograms show that there are no particular "wide rows". The widest has 61214 cells which is big but not that huge to be really a concern.
Turning on trace probabilty only tells give you some "hints" about what kind of queries are done, it does not give the exact partition key nor other statement values, especially when you are using prepared statements ... "I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet. Find the real issue and fix it instead of changing parameter to hide it. One solution would be on client side, to activate some logging to show the CQL3 statements the application is issuing that may overload the server. I know that's better said than done but I don't have any other idea for the moment -------- Shameless self-promotion ------ To support this kind of live prod debugging & investigation that I add a new dynamic query logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging Once you hit the issue, this kind of feature may save your day... On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kwri...@nanigans.com> wrote: > I can see from cfhistograms that I do have some wide rows (see below). I > set trace probability as you suggested but the output doesn’t appear to > tell me what row was actually read unless I missed something. I just see > executing prepared statement. Any ideas how I can find the row in > question? > > I am considering reducing read_request_timeout_in_ms: 5000 in > cassandra.yaml so that it reduces the impact when this occurs. > > Any help in identifying my issue would be GREATLY appreciated > > Cell Count per Partition > > 1 cells: 50449950 > > 2 cells: 14281828 > > 3 cells: 8093366 > > 4 cells: 5029200 > > 5 cells: 3103023 > > 6 cells: 3059903 > > 7 cells: 1903018 > > 8 cells: 1509297 > > 10 cells: 2420359 > > 12 cells: 1624895 > > 14 cells: 1171678 > > 17 cells: 1289391 > > 20 cells: 909777 > > 24 cells: 852081 > > 29 cells: 722925 > > 35 cells: 587067 > > 42 cells: 459473 > > 50 cells: 358744 > > 60 cells: 304146 > > 72 cells: 244682 > > 86 cells: 191045 > > 103 cells: 155337 > > 124 cells: 127061 > > 149 cells: 98913 > > 179 cells: 77454 > > 215 cells: 59849 > > 258 cells: 46117 > > 310 cells: 35321 > > 372 cells: 26319 > > 446 cells: 19379 > > 535 cells: 13783 > > 642 cells: 9993 > > 770 cells: 6973 > > 924 cells: 4713 > > 1109 cells: 3229 > > 1331 cells: 2062 > > 1597 cells: 1338 > > 1916 cells: 773 > > 2299 cells: 495 > > 2759 cells: 268 > > 3311 cells: 150 > > 3973 cells: 100 > > 4768 cells: 42 > > 5722 cells: 24 > > 6866 cells: 12 > > 8239 cells: 9 > > 9887 cells: 3 > > 11864 cells: 0 > > 14237 cells: 5 > > 17084 cells: 1 > > 20501 cells: 0 > > 24601 cells: 2 > > 29521 cells: 0 > > 35425 cells: 0 > > 42510 cells: 0 > > 51012 cells: 0 > > 61214 cells: 2 > > From: DuyHai Doan <doanduy...@gmail.com> > Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Date: Thursday, July 24, 2014 at 3:01 PM > To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Subject: Re: Hot, large row > > "How can I detect wide rows?" --> > > nodetool cfhistograms <keyspace> <suspected column family> > > Look at column "Column count" (last column) and identify a line in this > column with very high value of "Offset". In a well designed application you > should have a gaussian distribution where 80% of your row have a similar > number of columns. > > "Anyone know what debug level I can set so that I can see what reads the > hot node is handling? " --> > > "nodetool settraceprobability <value>", where value is a small number > (0.001) on the node where you encounter the issue. Activate the tracing for > a while (5 mins) and deactivate it (value = 0). Then look into > system_traces tables "events" & "sessions". It may help or not since the > tracing is done once every 1000. > > "Any way to get the server to blacklist these wide rows automatically?" > --> No > > > On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kwri...@nanigans.com> > wrote: > >> Hi all, >> >> We are seeing an issue where basically daily one of our nodes spikes >> in load and is churning in CMS heap pressure. It appears that reads are >> backing up and my guess is that our application is reading a large row >> repeatedly. Our write structure can lead itself to wide rows very >> infrequently (<0.001%) and we do our best to detect and delete them but >> obviously we’re missing a case. Hoping for assistance on the following >> questions: >> >> - How can I detect wide rows? >> - Anyone know what debug level I can set so that I can see what reads >> the hot node is handling? I’m hoping to see the “bad” row >> - Any way to get the server to blacklist these wide rows >> automatically? >> >> We’re using C* 2.0.6 with Vnodes. >> >> Thanks >> > >