Glad you tracked that down!
On Wed, Jun 23, 2010 at 6:14 PM, AJ Slater a...@zuno.com wrote:
This issue is caused by my network.
Cassandra maintains multiple gossip connections per node pair. One of
these connections is used for heartbeat and load broadcasting traffic.
Its quite talky.
I shall do just that. I did a bunch of tests this morning and the
situation appears to be this:
I have three nodes A, B and C, with RF=2. I understand now why this
issue wasn't apparent with RF=3.
If there are regular intranode column requests going on (e.g. i set up
a pinger to get remote
TRACE 14:42:06,248 unable to connect to /10.33.3.20
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
So that's interesting since it is a clear failure that comes from the
operating system and indicates something which can be observed
The only indication I have that cassandra realized something was wrong
during this period was this INFO message:
10.33.2.70:/var/log/cassandra/output.log
DEBUG 20:00:35,841 get_slice
DEBUG 20:00:35,841 weakreadremote reading SliceFromReadCommand(table='jolitics.c
om',
To summarize:
If a request for a column comes in *after a period of several hours
with no requests*, then the node servicing the request hangs while
looking for its peer rather than servicing the request like it should.
It then throws either a TimedOutException or a (wrong)
NotFoundExeption.
And
I'm seing 10s timeouts on reads few times a day. Its hard to reproduce
consistently but seems to happen most often after its been a long time
between reads. After presenting itself for a couple minutes the
problem then goes away.
I've got a three node cluster with replication factor 2, reading at
Cassandra 0.6.2 from the apache debian source.
Ubunutu Jaunty. Sun Java6 jvm.
All nodes in separate racks at 365 main.
On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater a...@zuno.com wrote:
I'm seing 10s timeouts on reads few times a day. Its hard to reproduce
consistently but seems to happen most
Total data size in the entire cluster is about twenty 12k images. With
no other load on the system. I just ask for one column and I get these
timeouts. Performing multiple gets on the columns leads to multiple
timeouts for a period of a few seconds or minutes and then the
situation magically
Do you have Row Caching enabled ? You can check in the JMX console to see if
you're hitting the cache.
Try turning on DEBUG level logging and look at the log on a machine you
connect to do the read.
Aaron
On 18 Jun 2010, at 05:31, AJ Slater wrote:
Total data size in the entire cluster
The explanation that best fits the symptoms you describe is that you
are swapping.
On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater a...@zuno.com wrote:
I'm seing 10s timeouts on reads few times a day. Its hard to reproduce
consistently but seems to happen most often after its been a long time
Are these physical machines or virtuals? Did you post your
cassandra.in.sh and storage-conf.xml someplace?
On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater a...@zuno.com wrote:
Total data size in the entire cluster is about twenty 12k images. With
no other load on the system. I just ask for one
The behavior was seen with row caching off.
I now have row caching on.
key cache hit rate is 0.75-0.8
row cache hit rate is 0 (row cache capacity=1, RowsCached=100%)
looks like I should try another format for RowsCached, like 0.8 or
90% or something.
On Thu, Jun 17, 2010 at 1:47 PM, aaron
The machines in question have 8GB of RAM each and generally never touch swap.
I shall try to monitor memory/swap overnight and see if something
strange happens.
Would swapping really take 10s?
AJ
On Thu, Jun 17, 2010 at 1:54 PM, Jonathan Ellis jbel...@gmail.com wrote:
The explanation that best
These are physical machines.
storage-conf.xml.fs03 is here:
http://pastebin.com/weL41NB1
Diffs from that for the other two storage-confs are inline here:
a...@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03
storage-conf.xml.fs01
185c185
14 matches
Mail list logo