Re: Occasional 10s Timeouts on Read

aaron morton Thu, 17 Jun 2010 13:48:33 -0700

Do you have Row Caching enabled ? You can check in the JMX console to see if 
you're hitting the cache.


Try  turning on DEBUG level logging and look at the log on a machine you 
connect to do the read. 


Aaron

On 18 Jun 2010, at 05:31, AJ Slater wrote:

> Total data size in the entire cluster is about twenty 12k images. With
> no other load on the system. I just ask for one column and I get these
> timeouts. Performing multiple gets on the columns leads to multiple
> timeouts for a period of a few seconds or minutes and then the
> situation magically resolves itself and response times are down to
> single digit milliseconds for a column get.
> 
> On Thu, Jun 17, 2010 at 10:24 AM, AJ Slater <a...@zuno.com> wrote:
>> Cassandra 0.6.2 from the apache debian source.
>> Ubunutu Jaunty. Sun Java6 jvm.
>> 
>> All nodes in separate racks at 365 main.
>> 
>> On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater <a...@zuno.com> wrote:
>>> I'm seing 10s timeouts on reads few times a day. Its hard to reproduce
>>> consistently but seems to happen most often after its been a long time
>>> between reads. After presenting itself for a couple minutes the
>>> problem then goes away.
>>> 
>>> I've got a three node cluster with replication factor 2, reading at
>>> consistency level ONE. The columns being read are around 12k each. The
>>> nodes are 8GB multicore boxes with the JVM limits between 4GB and 6GB.
>>> 
>>> Here's an application log from early this morning when a developer in
>>> Belgrade accessed the system:
>>> 
>>> Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 |
>>> Requested image_id: 5827067133c3d670071c17d9144f0b49
>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 |
>>> TimedOutException for Image 5827067133c3d670071c17d9144f0b49
>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>> Get took 10005.388975 ms
>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 |
>>> Requested image_id: af8caf3b76ce97d13812ddf795104a5c
>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>> Get took 3.658056 ms
>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>> Transform took 0.978947 ms
>>> 
>>> That's a Timeout and then a successful get of another column.
>>> 
>>> Here's the cassandra log for 10.33.2.70:
>>> 
>>> DEBUG 03:54:17,070 get_slice
>>> DEBUG 03:54:17,071 weakreadremote reading
>>> SliceFromReadCommand(table='jolitics.com',
>>> key='5827067133c3d670071c17d9144f0b49',
>>> column_parent='QueryPath(columnFamilyName='Images',
>>> superColumnName='null', columnName='null')', start='', finish='
>>> ', reversed=false, count=100)
>>> DEBUG 03:54:17,071 weakreadremote reading
>>> SliceFromReadCommand(table='jolitics.com',
>>> key='5827067133c3d670071c17d9144f0b49',
>>> column_parent='QueryPath(columnFamilyName='Images',
>>> superColumnName='null', columnName='null')', start='', finish='
>>> ', reversed=false, count=100) from 45138@/10.33.3.10
>>> DEBUG 03:54:27,077 get_slice
>>> DEBUG 03:54:27,078 weakreadlocal reading
>>> SliceFromReadCommand(table='jolitics.com',
>>> key='af8caf3b76ce97d13812ddf795104a5c',
>>> column_parent='QueryPath(columnFamilyName='Images',
>>> superColumnName='null', columnName='null')', start='', finish=''
>>> , reversed=false, count=100)
>>> DEBUG 03:54:27,079 collecting body:false:1...@1275951327610885
>>> DEBUG 03:54:27,080 collecting body:false:1...@1275951327610885
>>> DEBUG 03:54:27,080 Reading consistency digest for 
>>> af8caf3b76ce97d13812ddf795104a
>>> 5c from 45...@[/10.33.2.70, /10.33.3.10]
>>> DEBUG 03:54:50,779 Disseminating load info ...
>>> 
>>> It looks like it asks for key='5827067133c3d670071c17d9144f0b49' from
>>> the local host and also queries 10.33.3.10 for the first one and then
>>> for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local host
>>> and then returns appropriately.
>>> 
>>> Here's the log for 10.33.3.10 around that time:
>>> 
>>> DEBUG 03:54:19,645 Disseminating load info ...
>>> DEBUG 03:55:19,645 Disseminating load info ...
>>> DEBUG 03:56:19,646 Disseminating load info ...
>>> DEBUG 03:57:19,645 Disseminating load info ...
>>> DEBUG 03:58:19,645 Disseminating load info ...
>>> DEBUG 03:59:19,646 Disseminating load info ...
>>> DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leaving
>>> 55875144 used; max is 6580535296
>>> 
>>> No record of communication from 10.33.2.70.
>>> 
>>> Does this ring any bells for anyone? I can of course attach
>>> storage-conf's for all nodes if that sounds useful and I'll be on
>>> #cassandra as ajslater.
>>> 
>>> Much thanks for taking a look and any suggestions. We fear we'll have
>>> to abandon Cassandra if this bug cannot be resolved.
>>> 
>>> AJ
>>> 
>>

Re: Occasional 10s Timeouts on Read

Reply via email to