Re: large range read in Cassandra

2015-02-02 Thread Dan Kinder
For the benefit of others, I ended up finding out that the CQL library I
was using (https://github.com/gocql/gocql) at this time leaves paging page
size defaulted to no paging, so Cassandra was trying to pull all rows of
the partition into memory at once. Setting the page size to a reasonable
number seems to have done the trick.

On Tue, Nov 25, 2014 at 2:54 PM, Dan Kinder dkin...@turnitin.com wrote:

 Thanks, very helpful Rob, I'll watch for that.

 On Tue, Nov 25, 2014 at 11:45 AM, Robert Coli rc...@eventbrite.com
 wrote:

 On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder dkin...@turnitin.com
 wrote:

 To be clear, I expect this range query to take a long time and perform
 relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
 https://issues.apache.org/jira/browse/CASSANDRA-4415,
 http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
 so that we aren't literally pulling the entire thing in. Am I
 misunderstanding this use case? Could you clarify why exactly it would slow
 way down? It seems like with each read it should be doing a simple range
 read from one or two sstables.


 If you're paging through a single partition, that's likely to be fine.
 When you said range reads ... over rows my impression was you were
 talking about attempting to page through millions of partitions.

 With that confusion cleared up, the likely explanation for lack of
 availability in your case is heap pressure/GC time. Look for GCs around
 that time. Also, if you're using authentication, make sure that your
 authentication keyspace has a replication factor greater than 1.

 =Rob





 --
 Dan Kinder
 Senior Software Engineer
 Turnitin – www.turnitin.com
 dkin...@turnitin.com




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: large range read in Cassandra

2014-11-25 Thread Dan Kinder
Thanks Rob.

To be clear, I expect this range query to take a long time and perform
relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
https://issues.apache.org/jira/browse/CASSANDRA-4415,
http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
so that we aren't literally pulling the entire thing in. Am I
misunderstanding this use case? Could you clarify why exactly it would slow
way down? It seems like with each read it should be doing a simple range
read from one or two sstables.

If this won't work then it may me we need to start using Hive/Spark/Pig
etc. sooner, or page it manually using LIMIT and WHERE  [the last returned
result].

On Mon, Nov 24, 2014 at 5:49 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Nov 24, 2014 at 4:26 PM, Dan Kinder dkin...@turnitin.com wrote:

 We have a web crawler project currently based on Cassandra (
 https://github.com/iParadigms/walker, written in Go and using the gocql
 driver), with the following relevant usage pattern:

 - Big range reads over a CF to grab potentially millions of rows and
 dispatch new links to crawl


 If you really mean millions of storage rows, this is just about the worst
 case for Cassandra. The problem you're having is probably that you
 shouldn't try to do this in Cassandra.

 Your timeouts are either from the read actually taking longer than the
 timeout or from the reads provoking heap pressure and resulting GC.

 =Rob




Re: large range read in Cassandra

2014-11-25 Thread Robert Coli
On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder dkin...@turnitin.com wrote:

 To be clear, I expect this range query to take a long time and perform
 relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
 https://issues.apache.org/jira/browse/CASSANDRA-4415,
 http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
 so that we aren't literally pulling the entire thing in. Am I
 misunderstanding this use case? Could you clarify why exactly it would slow
 way down? It seems like with each read it should be doing a simple range
 read from one or two sstables.


If you're paging through a single partition, that's likely to be fine. When
you said range reads ... over rows my impression was you were talking
about attempting to page through millions of partitions.

With that confusion cleared up, the likely explanation for lack of
availability in your case is heap pressure/GC time. Look for GCs around
that time. Also, if you're using authentication, make sure that your
authentication keyspace has a replication factor greater than 1.

=Rob


Re: large range read in Cassandra

2014-11-25 Thread Dan Kinder
Thanks, very helpful Rob, I'll watch for that.

On Tue, Nov 25, 2014 at 11:45 AM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder dkin...@turnitin.com wrote:

 To be clear, I expect this range query to take a long time and perform
 relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
 https://issues.apache.org/jira/browse/CASSANDRA-4415,
 http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
 so that we aren't literally pulling the entire thing in. Am I
 misunderstanding this use case? Could you clarify why exactly it would slow
 way down? It seems like with each read it should be doing a simple range
 read from one or two sstables.


 If you're paging through a single partition, that's likely to be fine.
 When you said range reads ... over rows my impression was you were
 talking about attempting to page through millions of partitions.

 With that confusion cleared up, the likely explanation for lack of
 availability in your case is heap pressure/GC time. Look for GCs around
 that time. Also, if you're using authentication, make sure that your
 authentication keyspace has a replication factor greater than 1.

 =Rob





-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


large range read in Cassandra

2014-11-24 Thread Dan Kinder
Hi,

We have a web crawler project currently based on Cassandra (
https://github.com/iParadigms/walker, written in Go and using the gocql
driver), with the following relevant usage pattern:

- Big range reads over a CF to grab potentially millions of rows and
dispatch new links to crawl
- Fast insert of new links (effectively using Cassandra to deduplicate)

We ultimately planned on doing the batch processing step (the dispatching)
in a system like Spark, but for the time being it is also in Go. We believe
this should work fine given that Cassandra now properly allows chunked
iteration of columns in a CF.

The issue is, periodically while doing a particularly large range read,
other operations time out because that node is busy. In an experimental
cluster with only two nodes (and replication factor of 2), I'll get an
error like: Operation timed out - received only 1 responses. Indicating
that the second node took too long to reply. At the moment I have the long
range reads set to consistency level ANY but the rest of the operations are
on QUORUM, so on this cluster they require responses from both nodes. The
relevant CF is also using LeveledCompactionStrategy. This happens in both
Cassandra 2 and 2.1.

Despite this error I don't see any significant I/O, memory consumption, or
CPU usage.

Here are some of the configuration values I've played with:

Increasing timeouts:
read_request_timeout_in_ms:
15000
range_request_timeout_in_ms:
3
write_request_timeout_in_ms:
1
request_timeout_in_ms: 1

Getting rid of caches we don't need:
key_cache_size_in_mb: 0
row_cache_size_in_mb: 0

Each of the 2 nodes has an HDD for commit log and single HDD I'm using for
data. Hence the following thread config (maybe since I/O is not an issue I
should increase these?):
concurrent_reads: 16
concurrent_writes: 32
concurrent_counter_writes: 32

Because I have a large number columns and aren't doing random I/O I've
increased this:
column_index_size_in_kb: 2048

It's something of a mystery why this error comes up. Of course with a 3rd
node it will get masked if I am doing QUORUM operations, but it still seems
like it should not happen, and that there is some kind of head-of-line
blocking or other issue in Cassandra. I would like to increase the amount
of dispatching I'm doing because of this it bogs it down if I do.

Any suggestions for other things we can try here would be appreciated.

-dan


Re: large range read in Cassandra

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 4:26 PM, Dan Kinder dkin...@turnitin.com wrote:

 We have a web crawler project currently based on Cassandra (
 https://github.com/iParadigms/walker, written in Go and using the gocql
 driver), with the following relevant usage pattern:

 - Big range reads over a CF to grab potentially millions of rows and
 dispatch new links to crawl


If you really mean millions of storage rows, this is just about the worst
case for Cassandra. The problem you're having is probably that you
shouldn't try to do this in Cassandra.

Your timeouts are either from the read actually taking longer than the
timeout or from the reads provoking heap pressure and resulting GC.

=Rob