Heya! I’ve been observing some strange and worrying behaviour all this week with row cache hits taking hundreds of milliseconds.
Cassandra 1.2.15, Datastax CQL driver 1.0.4. EC2 m1.xlarge instances RF=3, N=4 vnodes in use key cache: 200M row cache: 200M row_cache_provider: SerializingCacheProvider Query: PreparedStatement SELECT * from uniques3 WHERE hash=? AND item=? AND event=?. All values are < 20 bytes. All data is written with a TTL of days. Row is not particularly wide (see cfhistograms in the pastebin). Row cache hit can take hundreds of milliseconds, pretty much screwing performance. My initial thought was garbage collection, but I collected traces and GC logs to the pastebin below, so while there *is* plenty of GC going on, I don’t think it’s the reason. We also have other column families accessed through Thrift which do not exhibit this behaviour at all. There are no abnormal query times for cache misses. http://pastebin.com/ac6PVHhm Notice also the weird “triple hump” on the cfhistograms - I’m kinda used to seeing two humps, one for cache hits and one for disk access, but this one has clearly three humps, one at the 200ms area. Also odd is the very large fp false ratio, but that might be just our data. Armed with the traces I formed a hypothesis that perhaps row cache is a bad idea, turned it off for this CF, and hey! The average read latencies dropped to about 2 milliseconds. So I’m kinda fine here now, but I would really appreciate it if someone could explain to me what is going on, and why would a row cache hit ever take up to 450 milliseconds? In our usecase, this CF does contain some hot data, and the row cache hit ratio is around 80%, so keeping it would be kinda useful. (The pastebin contains a couple of traces, GC logs from all servers noted in the trace, cfstats, cfhistograms and schema.) /Janne