Re: Range scan performance in 0.6.0 beta2

Jonathan Ellis Thu, 25 Mar 2010 19:58:11 -0700

On Thu, Mar 25, 2010 at 8:33 AM, Henrik Schröder <skro...@gmail.com> wrote:
> Hi everyone,
>
> We're trying to implement a virtual datastore for our users where they can
> set up "tables" and "indexes" to store objects and have them indexed on
> arbitrary properties. And we did a test implementation for Cassandra in the
> following way:
>
> Objects are stored in one columnfamily, each key is made up of tableid +
> "object key", and each row has one column where the value is the serialized
> object. This part is super-simple, we're just using Cassandra as a
> key-value-store, and this part performs really well.
>
> The indexes are a bit tricker, but basically for each index and each object
> that is stored, we compute a fixed-length bytearray based on the object that
> make up the indexvalue. We then store these bytearray indexvalues in another
> columnfamily, with the indexid as row key, the indexvalue as the column
> name, and the object key as the column value.


So all the values for an entire index will be in one row?  That
doesn't sound good.

You really want to put each index [and each table] in its own CF, but
until we can do that dynamically (0.7) you could at least make the
index row keys a tuple of (indexid, indexvalue) and the column names
in each row the object keys (empty column values).

This works pretty well for a lot of users, including Digg.

> We tested just the "index" part of our design, and these are the
> numbers we got:
> inserts (15 threads, batches of 10): 4000/second
> get_slices (10 threads, random range sizes, count 1000): 50/second at start,
> dies at about 6 million columns inserted. (OutOfMemoryException)
> get_slices (10 threads, random range sizes, count 10): 200/s at start, slows
> down the more columns there are.

Those are really low read numbers, but I'd make the schema change
above before digging deeper there.

Also, if you are OOMing, you're probably getting really crappy
performance for some time before that, as the JVM tries desperately to
collect enough space to keep going.  The easiest solution is to just
let it use more memory, assuming you can do so.
http://wiki.apache.org/cassandra/RunningCassandra

-Jonathan

Re: Range scan performance in 0.6.0 beta2

Reply via email to