Hi,

We have a web crawler project currently based on Cassandra (
https://github.com/iParadigms/walker, written in Go and using the gocql
driver), with the following relevant usage pattern:

- Big range reads over a CF to grab potentially millions of rows and
dispatch new links to crawl
- Fast insert of new links (effectively using Cassandra to deduplicate)

We ultimately planned on doing the batch processing step (the dispatching)
in a system like Spark, but for the time being it is also in Go. We believe
this should work fine given that Cassandra now properly allows chunked
iteration of columns in a CF.

The issue is, periodically while doing a particularly large range read,
other operations time out because that node is "busy". In an experimental
cluster with only two nodes (and replication factor of 2), I'll get an
error like: "Operation timed out - received only 1 responses." Indicating
that the second node took too long to reply. At the moment I have the long
range reads set to consistency level ANY but the rest of the operations are
on QUORUM, so on this cluster they require responses from both nodes. The
relevant CF is also using LeveledCompactionStrategy. This happens in both
Cassandra 2 and 2.1.

Despite this error I don't see any significant I/O, memory consumption, or
CPU usage.

Here are some of the configuration values I've played with:

Increasing timeouts:
read_request_timeout_in_ms:
15000
range_request_timeout_in_ms:
30000
write_request_timeout_in_ms:
10000
request_timeout_in_ms: 10000

Getting rid of caches we don't need:
key_cache_size_in_mb: 0
row_cache_size_in_mb: 0

Each of the 2 nodes has an HDD for commit log and single HDD I'm using for
data. Hence the following thread config (maybe since I/O is not an issue I
should increase these?):
concurrent_reads: 16
concurrent_writes: 32
concurrent_counter_writes: 32

Because I have a large number columns and aren't doing random I/O I've
increased this:
column_index_size_in_kb: 2048

It's something of a mystery why this error comes up. Of course with a 3rd
node it will get masked if I am doing QUORUM operations, but it still seems
like it should not happen, and that there is some kind of head-of-line
blocking or other issue in Cassandra. I would like to increase the amount
of dispatching I'm doing because of this it bogs it down if I do.

Any suggestions for other things we can try here would be appreciated.

-dan

Reply via email to