Thanks, Romain for the detailed explanation. We use log4j 2 and i have added the driver logging for slow/error queries, will see if it helps to provide any pattern once in Prod.
I tried getendpoints and getsstables for some of the timed out keys and most of them listed only 1 SSTable .There were a few which showed 2 SSTables. There is no specific trend on the keys, it's completely based on the user access, and the same keys return results instantly from cqlsh On Tue, Sep 6, 2016 at 1:57 PM, Romain Hardouin <romainh...@yahoo.fr> wrote: > There is nothing special in the two sstablemetadata outuputs but if the > timeouts are due to a network split or overwhelmed node or something like > that you won't see anything here. That said, if you have the keys which > produced the timeouts then, yes, you can look for a regular pattern (i.e. > always the same keys?). > > You can find sstables for a given key with nodetool: > nodetool getendpoints <keyspace> <cf> <key> > Then you can run the following command on one/each node of the enpoints: > nodetool getsstables <keyspace> <cf> <key> > > If many sstables are shown in the previous command it means that your data > is fragmented but thanks to LCS this number should be low. > > I think the most usefull actions now would be: > > * 1) *Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your log > file, you will see the following when errors will occur: > - Local replicas [<endpoint1>, ...] are insufficient to satisfy > LOCAL_QUORUM requirement of X live nodes in '<dc>' > > You are using C* 2.1 but you can have a look at the C* 2.2 > logback.xml: https://github.com/apache/cassandra/blob/ > cassandra-2.2/conf/logback.xml > I'm using it on production, it's better because it creates a separate > debug.log file with a asynchronous appender. > > Watch out when enabling: > > <appender-ref ref="ASYNCDEBUGLOG" /> > > Because the default logback configuration set all o.a.c in DEBUG: > > <logger name="org.apache.cassandra" level="DEBUG"/> > > Instead you can set: > > <logger name="org.apache.cassandra" level="INFO"/> > <logger name="org.apache.cassandra.db.ConsistencyLevel" level="DEBUG"/> > > Also, if you want to restrict debug.log to DEBUG level only (instead > of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in > logback.xml: > > <filter class="ch.qos.logback.classic.filter.LevelFilter"> > <level>DEBUG</level> > <onMatch>ACCEPT</onMatch> > <onMismatch>DENY</onMismatch> > </filter> > > Thus, the debug.log file will be empty unless some Consistency issues > happen. > > * 2) *Enable slow queries log at the driver level with a QueryLogger: > > Cluster cluster = ... > // log queries longer than 1 second, see also withDynamicThreshold > QueryLogger queryLogger = QueryLogger.builder(cluster). > withConstantThreshold(1000).build(); > cluster.register(queryLogger); > > Then in your driver logback file: > > <logger name="com.datastax.driver.core.QueryLogger.SLOW" > level="DEBUG" /> > > *3) *And/or: you mentioned that you use DSE so you can enable slow > queries logging in dse.yaml (cql_slow_log_options) > > Best, > > Romain > > > Le Lundi 5 septembre 2016 20h05, Joseph Tech <jaalex.t...@gmail.com> a > écrit : > > > Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52 > MB (out2). The records are inserted with different TTLs based on their > nature ; test records with 1 day, typeA records with 6 months, typeB > records with 1 year etc. There are also explicit DELETEs from this table, > though it's much lower than the rate of inserts. > > I am not sure how to interpret this output, or if it's the right SSTables > that were picked. Please advise. Is there a way to get the sstables > corresponding to the keys that timed out, though they are accessible later. > > On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <anshu.vajpa...@gmail.com> > wrote: > > We have seen read time out issue in cassandra due to high droppable > tombstone ratio for repository. > > Please check for high droppable tombstone ratio for your repo. > > On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <romainh...@yahoo.fr> > wrote: > > Yes dclocal_read_repair_chance will reduce the cross-DC traffic and > latency, so you can swap the values ( https://issues.apache.org/ji > ra/browse/CASSANDRA-7320 > <https://issues.apache.org/jira/browse/CASSANDRA-7320> ). I guess the > sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the > default size was way too small: 5 MB. So maybe someone in your company > tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say > to change the value but just keep in mind that you're using a small value > here, it could help you someday. > > Regarding the cells, the histograms shows an *estimation* of the min, p50, > ..., p99, max of cells based on SSTables metadata. On your screenshot, the > Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is > 1109, so 99% of your partition keys have less than (or equal to) 1109 > cells. > You can see these data of a given sstable with the tool sstablemetadata. > > Best, > > Romain > > > > Le Lundi 5 septembre 2016 15h17, Joseph Tech <jaalex.t...@gmail.com> a > écrit : > > > Thanks, Romain . We will try to enable the DEBUG logging (assuming it > won't clog the logs much) . Regarding the table configs, read_repair_chance > must be carried over from older versions - mostly defaults. I think > sstable_size_in_mb > was set to limit the max SSTable size, though i am not sure on the reason > for the 50 MB value. > > Does setting dclocal_read_repair_chance help in reducing cross-DC traffic > (haven't looked into this parameter, just going by the name). > > By the cell count definition : is it incremented based on the number of > writes for a given name(key?) and value. This table is heavy on reads and > writes. If so, the value should be much higher? > > On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr> > wrote: > > Hi, > > Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it > could help to find a regular pattern. By the way, I see that you have set a > global read repair chance: > read_repair_chance = 0.1 > And not the local read repair: > dclocal_read_repair_chance = 0.0 > Is there any reason to do that or is it just the old (pre 2.0.9) default > configuration? > > The cell count is the number of triplets: (name, value, timestamp) > > Also, I see that you have set sstable_size_in_mb at 50 MB. What is the > rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per > read" are good. > > Best, > > Romain > > Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a > écrit : > > > Hi Ryan, > > Attached are the cfhistograms run within few mins of each other. On the > surface, don't see anything which indicates too much skewing (assuming > skewing ==keys spread across many SSTables) . Please confirm. Related to > this, what does the "cell count" metric indicate ; didn't find a clear > explanation in the documents. > > Thanks, > Joseph > > > On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote: > > Have you looked at cfhistograms/tablehistograms your data maybe just > skewed (most likely explanation is probably the correct one here) > > Regard, > > Ryan Svihla > > _____________________________ > From: Joseph Tech <jaalex.t...@gmail.com> > Sent: Wednesday, August 31, 2016 11:16 PM > Subject: Re: Read timeouts on primary key queries > To: <user@cassandra.apache.org> > > > > Patrick, > > The desc table is below (only col names changed) : > > CREATE TABLE db.tbl ( > id1 text, > id2 text, > id3 text, > id4 text, > f1 text, > f2 map<text, text>, > f3 map<text, text>, > created timestamp, > updated timestamp, > PRIMARY KEY (id1, id2, id3, id4) > ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC) > AND bloom_filter_fp_chance = 0.01 > AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' > AND comment = '' > AND compaction = {'sstable_size_in_mb': '50', 'class': > 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'} > AND compression = {'sstable_compression': 'org.apache.cassandra.io. > compress.LZ4Compressor'} > AND dclocal_read_repair_chance = 0.0 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.1 > AND speculative_retry = '99.0PERCENTILE'; > > and the query is select * from tbl where id1=? and id2=? and id3=? and > id4=? > > The timeouts happen within ~2s to ~5s, while the successful calls have avg > of 8ms and p99 of 15s. These times are seen from app side, the actual query > times would be slightly lower. > > Is there a way to capture traces only when queries take longer than a > specified duration? . We can't enable tracing in production given the > volume of traffic. We see that the same query which timed out works fine > later, so not sure if the trace of a successful run would help. > > Thanks, > Joseph > > > On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> > wrote: > > If you are getting a timeout on one table, then a mismatch of RF and node > count doesn't seem as likely. > > Time to look at your query. You said it was a 'select * from table where > key=?' type query. I would next use the trace facility in cqlsh to > investigate further. That's a good way to find hard to find issues. You > should be looking for clear ledge where you go from single digit ms to 4 or > 5 digit ms times. > > The other place to look is your data model for that table if you want to > post the output from a desc table. > > Patrick > > > > On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> > wrote: > > On further analysis, this issue happens only on 1 table in the KS which > has the max reads. > > @Atul, I will look at system health, but didnt see anything standing out > from GC logs. (using JDK 1.8_92 with G1GC). > > @Patrick , could you please elaborate the "mismatch on node count + RF" > part. > > On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com> > wrote: > > There could be many reasons for this if it is intermittent. CPU usage + > I/O wait status. As read are I/O intensive, your IOPS requirement should be > met that time load. Heap issue if CPU is busy for GC only. Network health > could be the reason. So better to look system health during that time when > it comes. > > ------------------------------ ------------------------------ > ------------------------------ --------------------------- > Atul Saroha > *Lead Software Engineer* > *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369 > Plot # 362, ASF Centre - Tower A, Udyog Vihar, > Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA > > On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.t...@gmail.com> > wrote: > > Hi Patrick, > > The nodetool status shows all nodes up and normal now. From OpsCenter > "Event Log" , there are some nodes reported as being down/up etc. during > the timeframe of timeout, but these are Search workload nodes from the > remote (non-local) DC. The RF is 3 and there are 9 nodes per DC. > > Thanks, > Joseph > > On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pmcfa...@gmail.com> > wrote: > > You aren't achieving quorum on your reads as the error is explains. That > means you either have some nodes down or your topology is not matching up. > The fact you are using LOCAL_QUORUM might point to a datacenter mis-match > on node count + RF. > > What does your nodetool status look like? > > Patrick > > On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <jaalex.t...@gmail.com> > wrote: > > Hi, > > We recently started getting intermittent timeouts on primary key queries > (select * from table where key=<key>) > > The error is : com.datastax.driver.core.excep tions.ReadTimeoutException: > Cassandra timeout during read query at consistency LOCAL_QUORUM (2 > responses were required but only 1 replica > a responded) > > The same query would work fine when tried directly from cqlsh. There are > no indications in system.log for the table in question, though there were > compactions in progress for tables in another keyspace which is more > frequently accessed. > > My understanding is that the chances of primary key queries timing out is > very minimal. Please share the possible reasons / ways to debug this issue. > > We are using Cassandra 2.1 (DSE 4.8.7). > > Thanks, > Joseph > > > > > > > > > > > > > > > > > > > > > -- > *Regards,* > *Anshu * > > > > > >