I'm seing 10s timeouts on reads few times a day. Its hard to reproduce
consistently but seems to happen most often after its been a long time
between reads. After presenting itself for a couple minutes the
problem then goes away.

I've got a three node cluster with replication factor 2, reading at
consistency level ONE. The columns being read are around 12k each. The
nodes are 8GB multicore boxes with the JVM limits between 4GB and 6GB.

Here's an application log from early this morning when a developer in
Belgrade accessed the system:

Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 |
Requested image_id: 5827067133c3d670071c17d9144f0b49
Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 |
TimedOutException for Image 5827067133c3d670071c17d9144f0b49
Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
Get took 10005.388975 ms
Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 |
Requested image_id: af8caf3b76ce97d13812ddf795104a5c
Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
Get took 3.658056 ms
Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
Transform took 0.978947 ms

That's a Timeout and then a successful get of another column.

Here's the cassandra log for 10.33.2.70:

DEBUG 03:54:17,070 get_slice
DEBUG 03:54:17,071 weakreadremote reading
SliceFromReadCommand(table='jolitics.com',
key='5827067133c3d670071c17d9144f0b49',
column_parent='QueryPath(columnFamilyName='Images',
superColumnName='null', columnName='null')', start='', finish='
', reversed=false, count=100)
DEBUG 03:54:17,071 weakreadremote reading
SliceFromReadCommand(table='jolitics.com',
key='5827067133c3d670071c17d9144f0b49',
column_parent='QueryPath(columnFamilyName='Images',
superColumnName='null', columnName='null')', start='', finish='
', reversed=false, count=100) from 45138@/10.33.3.10
DEBUG 03:54:27,077 get_slice
DEBUG 03:54:27,078 weakreadlocal reading
SliceFromReadCommand(table='jolitics.com',
key='af8caf3b76ce97d13812ddf795104a5c',
column_parent='QueryPath(columnFamilyName='Images',
superColumnName='null', columnName='null')', start='', finish=''
, reversed=false, count=100)
DEBUG 03:54:27,079 collecting body:false:1...@1275951327610885
DEBUG 03:54:27,080 collecting body:false:1...@1275951327610885
DEBUG 03:54:27,080 Reading consistency digest for af8caf3b76ce97d13812ddf795104a
5c from 45...@[/10.33.2.70, /10.33.3.10]
DEBUG 03:54:50,779 Disseminating load info ...

It looks like it asks for key='5827067133c3d670071c17d9144f0b49' from
the local host and also queries 10.33.3.10 for the first one and then
for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local host
and then returns appropriately.

Here's the log for 10.33.3.10 around that time:

DEBUG 03:54:19,645 Disseminating load info ...
DEBUG 03:55:19,645 Disseminating load info ...
DEBUG 03:56:19,646 Disseminating load info ...
DEBUG 03:57:19,645 Disseminating load info ...
DEBUG 03:58:19,645 Disseminating load info ...
DEBUG 03:59:19,646 Disseminating load info ...
DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leaving
55875144 used; max is 6580535296

No record of communication from 10.33.2.70.

Does this ring any bells for anyone? I can of course attach
storage-conf's for all nodes if that sounds useful and I'll be on
#cassandra as ajslater.

Much thanks for taking a look and any suggestions. We fear we'll have
to abandon Cassandra if this bug cannot be resolved.

AJ

Reply via email to