Re: Intermittent long application pauses on nodes

Benedict Elliott Smith Wed, 29 Jan 2014 13:46:39 -0800

Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path}
-XX:+LogVMOutput


I never figured out what kills stdout for C*. It's a library we depend on,
didn't try too hard to figure out which one.


On 29 January 2014 21:07, Frank Ng <fnt...@gmail.com> wrote:

> Benedict,
> Thanks for the advice.  I've tried turning on PrintSafepointStatistics.
> However, that info is only sent to the STDOUT console.  The cassandra
> startup script closes the STDOUT when it finishes, so nothing is shown for
> safepoint statistics once it's done starting up.  Do you know how to
> startup cassandra and send all stdout to a log file and tell cassandra not
> to close stdout?
>
> Also, we have swap turned off as recommended.
>
> thanks
>
>
> On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <
> belliottsm...@datastax.com> wrote:
>
>> Frank,
>>
>>
>> The same advice for investigating holds: add the VM flags 
>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1   (you 
>> could put something above 1 there, to reduce the amount of logging, since a 
>> pause of 52s will be pretty obvious even if aggregated with lots of other 
>> safe points; the count is the number of safepoints to aggregate into one log 
>> message)
>>
>>
>> 52s is a very extreme pause, and I would be surprised if revoke bias could 
>> cause this. I wonder if the VM is swapping out.
>>
>>
>>
>> On 29 January 2014 19:02, Frank Ng <fnt...@gmail.com> wrote:
>>
>>> Thanks for the update.  Our logs indicated that there were 0 pending for
>>> CompactionManager at that time.  Also, there were no nodetool repairs
>>> running at that time.  The log statements above state that the application
>>> had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
>>> the safepoint.
>>>
>>>
>>> On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
>>> shaochuan.w...@bloomreach.com> wrote:
>>>
>>>> We had similar latency spikes when pending compactions can't keep it up
>>>> or repair/streaming taking too much cycles.
>>>>
>>>>
>>>> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng <fnt...@gmail.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> We've been having intermittent long application pauses (version 1.2.8)
>>>>> and not sure if it's a cassandra bug.  During these pauses, there are
>>>>> dropped messages in the cassandra log file along with the node seeing 
>>>>> other
>>>>> nodes as down.  We've turned on gc logging and the following is an example
>>>>> of a long "stopped" or pause event in the gc.log file.
>>>>>
>>>>> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
>>>>> application threads were stopped: 0.091450 seconds
>>>>> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
>>>>> application threads were stopped: 51.8190260 seconds
>>>>> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
>>>>> application threads were stopped: 0.005470 seconds
>>>>>
>>>>> As seen above, there was a 0.091450 secs pause, then a 51.8190260 secs
>>>>> pause.  There were no GC log events between those 2 log statements.  Since
>>>>> there's no GC logs in between, something else must be causing the long 
>>>>> stop
>>>>> time to reach a safepoint.
>>>>>
>>>>> Could there be a Cassandra thread that is taking a long time to reach
>>>>> a safepoint and what is it trying to do? Along with the node seeing other
>>>>> nodes as down in the cassandra log file, the StatusLogger shows 1599
>>>>> Pending in ReadStage and 9 Pending in MutationStage.
>>>>>
>>>>> There is mention of cassandra batch revoke bias locks as a possible
>>>>> cause (not GC) via:
>>>>> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html
>>>>>
>>>>> We have JNA, no swap, and the cluster runs fine besides there
>>>>> intermittent long pause that can cause a node to appear down to other
>>>>> nodes.  Any ideas as the cause of the long pause above? It seems not
>>>>> related to GC.
>>>>>
>>>>> thanks.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Intermittent long application pauses on nodes

Reply via email to