I'm using Astyanax with a query like this:

clusterContext
  .getClient()
  .getKeyspace("instruments")
  .prepareQuery(INSTRUMENTS_CF)
  .setConsistencyLevel(ConsistencyLevel.CL_LOCAL_QUORUM)
  .getKeySlice(new String[] {
    "ROW1",
    "ROW2",
    // 20,000 keys here...
    "ROW20000"
  })
  .execute();

At the time this query executes the first time (resulting in unresponsive
cluster), there are zero rows in the column family. Schema is below, pretty
basic:

CREATE KEYSPACE instruments WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'aws-us-east-1': '2'
};

CREATE TABLE instruments (
  key bigint PRIMARY KEY,
  definition blob,
  id bigint,
  name text,
  symbol text,
  updated bigint
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};




On Tue, Jun 10, 2014 at 6:35 PM, Laing, Michael <michael.la...@nytimes.com>
wrote:

> Perhaps if you described both the schema and the query in more detail, we
> could help... e.g. did the query have an IN clause with 20000 keys? Or is
> the key compound? More detail will help.
>
>
> On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma <jer...@barchart.com>
> wrote:
>
>> I didn't explain clearly - I'm not requesting 20000 unknown keys
>> (resulting in a full scan), I'm requesting 20000 specific rows by key.
>> On Jun 10, 2014 6:02 PM, "DuyHai Doan" <doanduy...@gmail.com> wrote:
>>
>>> Hello Jeremy
>>>
>>> Basically what you are doing is to ask Cassandra to do a distributed
>>> full scan on all the partitions across the cluster, it's normal that the
>>> nodes are somehow.... stressed.
>>>
>>> How did you make the query? Are you using Thrift or CQL3 API?
>>>
>>> Please note that there is another way to get all partition keys : SELECT
>>> DISTINCT <partition_key> FROM..., more details here :
>>> www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
>>> I ran an application today that attempted to fetch 20,000+ unique row
>>> keys in one query against a set of completely empty column families. On a
>>> 4-node cluster (EC2 m1.large instances) with the recommended memory
>>> settings (2 GB heap), every single node immediately ran out of memory and
>>> became unresponsive, to the point where I had to kill -9 the cassandra
>>> processes.
>>>
>>> Now clearly this query is not the best idea in the world, but the
>>> effects of it are a bit disturbing. What could be going on here? Are there
>>> any other query pitfalls I should be aware of that have the potential to
>>> explode the entire cluster?
>>>
>>> -j
>>>
>>
>

Reply via email to