"If I run nodetool tpstats, I see a high number of items in the Pending
phase for ReadStage.  Other items mostly appear near empty.  In addition, I
see dropped reads" --> Have a look at system I/O & CPU stats to check for
possible bottlenecks.

This symptom is not necessarily caused by widerows. Sometimes your disk I/O
just can't keep up with the intense read burst.




On Thu, Jul 24, 2014 at 10:00 PM, Keith Wright <kwri...@nanigans.com> wrote:

> I appreciate the feedback, doing it on the client side is interesting and
> I will start looking into that.  To be clear, here are the symptoms I am
> seeing:
>
>
>    - A node will start showing high load and the CMS collection time
>    jumps to 100+ ms/sec (per new is also up)
>    - If I run nodetool tpstats, I see a high number of items in the
>    Pending phase for ReadStage.  Other items mostly appear near empty.  In
>    addition, I see dropped reads.
>    - If I shutdown the effected node, two other nodes show increased
>    load.  I am at RF=3 so this would again suggest slow reads against a single
>    row
>
> Does it sound correct that my best course of action is to investigate a
> large row?  If it were a small row being called repeatedly, I assume OS/key
> cache would make that a VERY fast operation.
>
> Thanks
>
> From: DuyHai Doan <doanduy...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Thursday, July 24, 2014 at 3:53 PM
>
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Hot, large row
>
> Your extract of cfhistograms show that there are no particular "wide
> rows". The widest has 61214 cells which is big but not that huge to be
> really a concern.
>
> Turning on trace probabilty only tells give you some "hints" about what
> kind of queries are done, it does not give the exact partition key nor
> other statement values, especially when you are using prepared statements
> ...
>
>
> "I am considering reducing read_request_timeout_in_ms: 5000 in
> cassandra.yaml so that it reduces the impact when this occurs." --> Don't
> do that, you'll only sweep dust under the carpet. Find the real issue and
> fix it instead of changing parameter to hide it.
>
> One solution would be on client side, to activate some logging to show the
> CQL3 statements the application is issuing that may overload the server.  I
> know that's better said than done but I don't have any other idea for the
> moment
>
> -------- Shameless self-promotion ------
>
> To support this kind of live prod debugging & investigation that I add a
> new dynamic query logging feature in Achilles:
> https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging
>
> Once you hit the issue, this kind of feature may save your day...
>
>
>
>
>
>
> On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kwri...@nanigans.com>
> wrote:
>
>> I can see from cfhistograms that I do have some wide rows (see below).  I
>> set trace probability as you suggested but the output doesn’t appear to
>> tell me what row was actually read unless I missed something.  I just see
>> executing prepared statement.   Any ideas how I can find the row in
>> question?
>>
>> I am considering reducing read_request_timeout_in_ms: 5000 in
>> cassandra.yaml so that it reduces the impact when this occurs.
>>
>> Any help in identifying my issue would be GREATLY appreciated
>>
>> Cell Count per Partition
>>
>>     1 cells: 50449950
>>
>>     2 cells: 14281828
>>
>>     3 cells: 8093366
>>
>>     4 cells: 5029200
>>
>>     5 cells: 3103023
>>
>>     6 cells: 3059903
>>
>>     7 cells: 1903018
>>
>>     8 cells: 1509297
>>
>>    10 cells: 2420359
>>
>>    12 cells: 1624895
>>
>>    14 cells: 1171678
>>
>>    17 cells: 1289391
>>
>>    20 cells: 909777
>>
>>    24 cells: 852081
>>
>>    29 cells: 722925
>>
>>    35 cells: 587067
>>
>>    42 cells: 459473
>>
>>    50 cells: 358744
>>
>>    60 cells: 304146
>>
>>    72 cells: 244682
>>
>>    86 cells: 191045
>>
>>   103 cells: 155337
>>
>>   124 cells: 127061
>>
>>   149 cells: 98913
>>
>>   179 cells: 77454
>>
>>   215 cells: 59849
>>
>>   258 cells: 46117
>>
>>   310 cells: 35321
>>
>>   372 cells: 26319
>>
>>   446 cells: 19379
>>
>>   535 cells: 13783
>>
>>   642 cells: 9993
>>
>>   770 cells: 6973
>>
>>   924 cells: 4713
>>
>>  1109 cells: 3229
>>
>>  1331 cells: 2062
>>
>>  1597 cells: 1338
>>
>>  1916 cells: 773
>>
>>  2299 cells: 495
>>
>>  2759 cells: 268
>>
>>  3311 cells: 150
>>
>>  3973 cells: 100
>>
>>  4768 cells: 42
>>
>>  5722 cells: 24
>>
>>  6866 cells: 12
>>
>>  8239 cells: 9
>>
>>  9887 cells: 3
>>
>> 11864 cells: 0
>>
>> 14237 cells: 5
>>
>> 17084 cells: 1
>>
>> 20501 cells: 0
>>
>> 24601 cells: 2
>>
>> 29521 cells: 0
>>
>> 35425 cells: 0
>>
>> 42510 cells: 0
>>
>> 51012 cells: 0
>>
>> 61214 cells: 2
>>
>> From: DuyHai Doan <doanduy...@gmail.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Thursday, July 24, 2014 at 3:01 PM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: Hot, large row
>>
>> "How can I detect wide rows?" -->
>>
>> nodetool cfhistograms <keyspace> <suspected column family>
>>
>> Look at column "Column count" (last column) and identify a line in this
>> column with very high value of "Offset". In a well designed application you
>> should have a gaussian distribution where 80% of your row have a similar
>> number of columns.
>>
>> "Anyone know what debug level I can set so that I can see what reads the
>> hot node is handling?  " -->
>>
>> "nodetool settraceprobability <value>",  where value is a small number
>> (0.001) on the node where you encounter the issue. Activate the tracing for
>> a while (5 mins) and deactivate it (value = 0). Then look into
>> system_traces tables "events" & "sessions". It may help or not since the
>> tracing is done once every 1000.
>>
>> "Any way to get the server to blacklist these wide rows automatically?"
>> --> No
>>
>>
>> On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kwri...@nanigans.com>
>> wrote:
>>
>>> Hi all,
>>>
>>>    We are seeing an issue where basically daily one of our nodes spikes
>>> in load and is churning in CMS heap pressure.  It appears that reads are
>>> backing up and my guess is that our application is reading a large row
>>> repeatedly.  Our write structure can lead itself to wide rows very
>>> infrequently (<0.001%) and we do our best to detect and delete them but
>>> obviously we’re missing a case.  Hoping for assistance on the following
>>> questions:
>>>
>>>    - How can I detect wide rows?
>>>    - Anyone know what debug level I can set so that I can see what
>>>    reads the hot node is handling?  I’m hoping to see the “bad” row
>>>    - Any way to get the server to blacklist these wide rows
>>>    automatically?
>>>
>>> We’re using C* 2.0.6 with Vnodes.
>>>
>>> Thanks
>>>
>>
>>
>

Reply via email to