If you can, do a few (short, maybe 10m records, delete the default schema
between executions) run of Cassandra Stress test against your production
cluster (replication=3, force quorum to 3). Look for latency max in the 10s
of SECONDS. If your devops team is running a monitoring tool that looks at
the network, look for timeout/retries/errors/lost packets, etc. during the
run (worst case you need to do netstats runs against the relevant nic e.g.
every 10 seconds on the CassStress node, look for jumps in this count (if
monitoring is enabled, look at the monitor's results for ALL of your nodes.
At least one is having some issues.


*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> The reality of modern distributed systems is that connectivity between
> nodes is never guaranteed and distributed software must be able to cope
> with occasional absence of connectivity. GC and network connectivity are
> the two issues that a lot of us are most familiar with. There may be others
> - but most technical problems on a node would be clearly logged on that
> node. If you see a lapse of connectivity no more than once or twice a day,
> consider yourselves lucky.
>
> Is it only one node at a time that goes down, and at widely dispersed
> times?
>
> How many nodes?
>
> -- Jack Krupansky
>
> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <
> samuelsson.j...@gmail.com> wrote:
>
>> Hi,
>>
>> Version is 2.0.17.
>> Yes, these are VMs in the cloud though I'm fairly certain they are on a
>> LAN rather than WAN. They are both in the same data centre physically. The
>> phi_convict_threshold is set to default. I'd rather find the root cause of
>> the problem than just hiding it by not convicting a node if it isn't
>> responding though. If pings are <2 ms without a single ping missed in
>> several days, I highly doubt that network is the reason for the downtime.
>>
>> Best regards,
>> Joel
>>
>> 2016-02-23 16:39 GMT+01:00 <sean_r_dur...@homedepot.com>:
>>
>>> You didn’t mention version, but I saw this kind of thing very often in
>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>>> increase it to reduce the UP/DOWN flapping behavior.
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com]
>>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Nodes go down periodically
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> Thanks for your reply.
>>>
>>>
>>>
>>> I have debug logging on and see no GC pauses that are that long. GC
>>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>>
>>> Do I need to enable GC log options to see the pauses?
>>>
>>> I see plenty of these lines:
>>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>>> 118) GC for ParNew: 24 ms for 1 collections
>>>
>>> as well as a few CMS GC log lines.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Joel
>>>
>>>
>>>
>>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger <hkro...@gmail.com>:
>>>
>>> Hi,
>>>
>>>
>>>
>>> Those are probably GC pauses. Memory tuning is probably needed. Check
>>> the parameters that you already have customised if they make sense.
>>>
>>>
>>>
>>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>>
>>>
>>>
>>> Hannu
>>>
>>>
>>>
>>>
>>>
>>> On 23 Feb 2016, at 16:08, Joel Samuelsson <samuelsson.j...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>>> from <1 second to 30 or so seconds.
>>>
>>>
>>>
>>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>>> InetAddress /109.74.13.67 is now DOWN
>>>
>>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>>> (line 978) InetAddress /109.74.13.67 is now UP
>>>
>>>
>>>
>>> I find nothing odd in the logs around the same time. I logged a ping
>>> with timestamp and checked during the same time and saw nothing weird (ping
>>> is less than 2ms at all times).
>>>
>>>
>>>
>>> Does anyone have any suggestions as to why this might happen?
>>>
>>>
>>>
>>> Best regards,
>>> Joel
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of business or client engagement letter. The
>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>> content of this attachment and for any damages or losses arising from any
>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>> items of a destructive nature, which may be contained in this attachment
>>> and shall not be liable for direct, indirect, consequential or special
>>> damages in connection with this e-mail message or its attachment.
>>>
>>
>>
>

Reply via email to