Jonathan:

I tried out the patch you attached to JIRA-440, I applied it to 0.4,
and it works for me.  Now, as soon as I take the node down, there may
be one or two seconds of the thrift-internal error (timeout) but as
soon as the host doing the querying can see the node is down, the
error stops, and valid output is given by the get_key_range query
again.  And there isn't any disruption when the node comes back up.

Thanks!  (I put this same note in the bug report).

Simon Smith




On Fri, Sep 11, 2009 at 9:38 AM, Simon Smith <simongsm...@gmail.com> wrote:
> https://issues.apache.org/jira/browse/CASSANDRA-440
>
> Thanks again, of course I'm happy to give any additional information
> and will gladly do any testing of the fix.
>
> Simon
>
>
> On Thu, Sep 10, 2009 at 7:32 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>> That confirms what I suspected, thanks.
>>
>> Can you file a ticket on Jira and I'll work on a fix for you to test?
>>
>> thanks,
>>
>> -Jonathan
>>
>> On Thu, Sep 10, 2009 at 4:42 PM, Simon Smith<simongsm...@gmail.com> wrote:
>>> I sent get_key_range to node #1 (174.143.182.178), and here are the
>>> resulting log lines from 174.143.182.178's log (Do you want the other
>>> nodes' log lines? Let me know if so.)
>>>
>>> DEBUG - get_key_range
>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>> startWith='', stopAt='', maxResults=100) from 6...@174.143.182.178:7000
>>> DEBUG - collecting :false:3...@1252535119
>>>  [ ... chop the repeated & identical collecting messages ... ]
>>> DEBUG - collecting :false:3...@1252535119
>>> DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4,
>>> java5, match, match1, match2, match3, match4, match5, newegg, newegg1,
>>> newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5,
>>> sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4,
>>> test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5],
>>> completed=false) to 6...@174.143.182.178:7000
>>> DEBUG - Processing response on an async result from 
>>> 6...@174.143.182.178:7000
>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>> startWith='', stopAt='', maxResults=58) from 6...@174.143.182.182:7000
>>> DEBUG - Processing response on an async result from 
>>> 6...@174.143.182.182:7000
>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>> startWith='', stopAt='', maxResults=58) from 6...@174.143.182.179:7000
>>> DEBUG - Processing response on an async result from 
>>> 6...@174.143.182.179:7000
>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>> startWith='', stopAt='', maxResults=22) from 6...@174.143.182.185:7000
>>> DEBUG - Processing response on an async result from 
>>> 6...@174.143.182.185:7000
>>> DEBUG - Disseminating load info ...
>>>
>>>
>>> Thanks,
>>>
>>> Simon
>>>
>>> On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>> I think I see the problem.
>>>>
>>>> Can you check if your range query is spanning multiple nodes in the
>>>> cluster?  You can tell by setting the log level to DEBUG, and looking
>>>> for after it logs get_key_range, it will say "reading
>>>> RangeCommand(...) from ... @machine" more than once.
>>>>
>>>> The bug is that when picking the node to start the range query it
>>>> consults the failure detector to avoid dead nodes, but if the query
>>>> spans nodes it does not do that on subsequent nodes.
>>>>
>>>> But if you are only generating one RangeCommand per get_key_range then
>>>> we have two bugs. :)
>>>>
>>>> -Jonathan
>>>>
>>>
>>
>

Reply via email to