Hi,

We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
sitting at <25% CPU, doing mostly writes, and not showing any particular
long GC times/pauses. By all observed metrics the ring is healthy and
performing well.

However, we are noticing a pretty consistent number of connection timeouts
coming from the messaging service between various pairs of nodes in the
ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
per minute, usually between two pairs of nodes for several hours at a time.
It seems to occur for several hours at a time, then may stop or move to
other pairs of nodes in the ring. The metric
"Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
the nodes in the TotalTimeouts metric.

Looking at the debug log typically shows a large number of messages like
the following on one of the nodes:

StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)

We have cross node timeouts enabled, but ntp is running on all nodes and no
node appears to have time drift.

The network appears to be fine between nodes, with iperf tests showing that
we have a lot of headroom.

Any thoughts on what to look for? Can we increase thread count/pool sizes
for the messaging service?

Thanks,

Mike

-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.

Reply via email to