If you are getting a timeout on one table, then a mismatch of RF and node
count doesn't seem as likely.

Time to look at your query. You said it was a 'select * from table where
key=?' type query. I would next use the trace facility in cqlsh to
investigate further. That's a good way to find hard to find issues. You
should be looking for clear ledge where you go from single digit ms to 4 or
5 digit ms times.

The other place to look is your data model for that table if you want to
post the output from a desc table.

Patrick



On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> wrote:

> On further analysis, this issue happens only on 1 table in the KS which
> has the max reads.
>
> @Atul, I will look at system health, but didnt see anything standing out
> from GC logs. (using JDK 1.8_92 with G1GC).
>
> @Patrick , could you please elaborate the "mismatch on node count + RF"
> part.
>
> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com>
> wrote:
>
>> There could be many reasons for this if it is intermittent. CPU usage +
>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>> met that time load. Heap issue if CPU is busy for GC only. Network health
>> could be the reason. So better to look system health during that time when
>> it comes.
>>
>> ------------------------------------------------------------
>> ---------------------------------------------------------
>> Atul Saroha
>> *Lead Software Engineer*
>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>
>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.t...@gmail.com>
>> wrote:
>>
>>> Hi Patrick,
>>>
>>> The nodetool status shows all nodes up and normal now. From OpsCenter
>>> "Event Log" , there are some nodes reported as being down/up etc. during
>>> the timeframe of timeout, but these are Search workload nodes from the
>>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pmcfa...@gmail.com>
>>> wrote:
>>>
>>>> You aren't achieving quorum on your reads as the error is explains.
>>>> That means you either have some nodes down or your topology is not matching
>>>> up. The fact you are using LOCAL_QUORUM might point to a datacenter
>>>> mis-match on node count + RF.
>>>>
>>>> What does your nodetool status look like?
>>>>
>>>> Patrick
>>>>
>>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <jaalex.t...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We recently started getting intermittent timeouts on primary key
>>>>> queries (select * from table where key=<key>)
>>>>>
>>>>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
>>>>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
>>>>> responses were required but only 1 replica
>>>>> a responded)
>>>>>
>>>>> The same query would work fine when tried directly from cqlsh. There
>>>>> are no indications in system.log for the table in question, though there
>>>>> were compactions in progress for tables in another keyspace which is more
>>>>> frequently accessed.
>>>>>
>>>>> My understanding is that the chances of primary key queries timing out
>>>>> is very minimal. Please share the possible reasons / ways to debug this
>>>>> issue.
>>>>>
>>>>> We are using Cassandra 2.1 (DSE 4.8.7).
>>>>>
>>>>> Thanks,
>>>>> Joseph
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to