I'm using Astyanax with a query like this: clusterContext .getClient() .getKeyspace("instruments") .prepareQuery(INSTRUMENTS_CF) .setConsistencyLevel(ConsistencyLevel.CL_LOCAL_QUORUM) .getKeySlice(new String[] { "ROW1", "ROW2", // 20,000 keys here... "ROW20000" }) .execute();
At the time this query executes the first time (resulting in unresponsive cluster), there are zero rows in the column family. Schema is below, pretty basic: CREATE KEYSPACE instruments WITH replication = { 'class': 'NetworkTopologyStrategy', 'aws-us-east-1': '2' }; CREATE TABLE instruments ( key bigint PRIMARY KEY, definition blob, id bigint, name text, symbol text, updated bigint ) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; On Tue, Jun 10, 2014 at 6:35 PM, Laing, Michael <michael.la...@nytimes.com> wrote: > Perhaps if you described both the schema and the query in more detail, we > could help... e.g. did the query have an IN clause with 20000 keys? Or is > the key compound? More detail will help. > > > On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma <jer...@barchart.com> > wrote: > >> I didn't explain clearly - I'm not requesting 20000 unknown keys >> (resulting in a full scan), I'm requesting 20000 specific rows by key. >> On Jun 10, 2014 6:02 PM, "DuyHai Doan" <doanduy...@gmail.com> wrote: >> >>> Hello Jeremy >>> >>> Basically what you are doing is to ask Cassandra to do a distributed >>> full scan on all the partitions across the cluster, it's normal that the >>> nodes are somehow.... stressed. >>> >>> How did you make the query? Are you using Thrift or CQL3 API? >>> >>> Please note that there is another way to get all partition keys : SELECT >>> DISTINCT <partition_key> FROM..., more details here : >>> www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3 >>> I ran an application today that attempted to fetch 20,000+ unique row >>> keys in one query against a set of completely empty column families. On a >>> 4-node cluster (EC2 m1.large instances) with the recommended memory >>> settings (2 GB heap), every single node immediately ran out of memory and >>> became unresponsive, to the point where I had to kill -9 the cassandra >>> processes. >>> >>> Now clearly this query is not the best idea in the world, but the >>> effects of it are a bit disturbing. What could be going on here? Are there >>> any other query pitfalls I should be aware of that have the potential to >>> explode the entire cluster? >>> >>> -j >>> >> >