Re: Range scan performance in 0.6.0 beta2

2010-03-29 Thread Henrik Schröder
On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis jbel...@gmail.com wrote:

 On Fri, Mar 26, 2010 at 7:40 AM, Henrik Schröder skro...@gmail.com
 wrote:
  For each indexvalue we insert a row where the key is indexid + : +
  indexvalue encoded as hex string, and the row contains only one column,
  where the name is the object key encoded as a bytearray, and the value is
  empty.

 It's a unique index then?  And you're trying to read things ordered by
 the index, not just give me keys with that have a column with this
 value?


Yes, because if we have more than one column per row, there's no way of
(easily) limiting the result. As it is now we rarely want all object keys
associated with a range of indexvalues. However, this means we will have a
lot of rows if we do it in Cassandra.


/Henrik


Re: Range scan performance in 0.6.0 beta2

2010-03-29 Thread Jonathan Ellis
On Mon, Mar 29, 2010 at 4:06 AM, Henrik Schröder skro...@gmail.com wrote:
 On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis jbel...@gmail.com wrote:
 It's a unique index then?  And you're trying to read things ordered by
 the index, not just give me keys with that have a column with this
 value?

 Yes, because if we have more than one column per row, there's no way of
 (easily) limiting the result.

That's exactly what the count parameter of SliceRange is for... ?

-Jonathan


Re: Range scan performance in 0.6.0 beta2

2010-03-26 Thread Henrik Schröder

 So all the values for an entire index will be in one row?  That
 doesn't sound good.

 You really want to put each index [and each table] in its own CF, but
 until we can do that dynamically (0.7) you could at least make the
 index row keys a tuple of (indexid, indexvalue) and the column names
 in each row the object keys (empty column values).

 This works pretty well for a lot of users, including Digg.


We tested your suggestions like this:
We're using the OrderPreservingPartitioner.
We set the keycache and rowcache to 40%.
We're using the same machine as before, but we switched to a 64-bit JVM and
gave it 5GB of memory
For each indexvalue we insert a row where the key is indexid + : +
indexvalue encoded as hex string, and the row contains only one column,
where the name is the object key encoded as a bytearray, and the value is
empty.
When reading, we do a get_range_slice with an empty slice_range (start and
finish are 0-length byte-arrays), and randomly generated start_key and
finish_key where we know they both have been inserted, and finally a
row_count of 1000.

These are the numbers we got this time:
inserts (15 threads, batches of 10): 4000/second
get_range_slices (10 threads, row_count 1000): 50/seconds at start, down to
10/second at 250k inserts.

These numbers are slightly better than our previous OPP tries, but nothing
significant. For what it's worth, if we're only doing writes, the machine
bottlenecks on disk I/O as expected, but whenever we do reads, it
bottlenecks on CPU usage instead. Is this expected?


Also, how would dynamic column families help us? In our tests, we only
tested a single index, so even if we had one column family per index, we
would still only write to one of them and then get the exact same results as
above, right?

We're really grateful for any help with both how to tune Cassandra and how
to design our data model. The designs we've tested so far is the best we
could come up with ourselves, all we really need is a way to store groups of
mappings of indexvalue-objectkey, and be able to get a range of objectkeys
back given a group and a start and stop indexvalue.


/Henrik


Re: Range scan performance in 0.6.0 beta2

2010-03-26 Thread Jonathan Ellis
On Fri, Mar 26, 2010 at 7:40 AM, Henrik Schröder skro...@gmail.com wrote:
 For each indexvalue we insert a row where the key is indexid + : +
 indexvalue encoded as hex string, and the row contains only one column,
 where the name is the object key encoded as a bytearray, and the value is
 empty.

It's a unique index then?  And you're trying to read things ordered by
the index, not just give me keys with that have a column with this
value?

 These numbers are slightly better than our previous OPP tries, but nothing
 significant. For what it's worth, if we're only doing writes, the machine
 bottlenecks on disk I/O as expected, but whenever we do reads, it
 bottlenecks on CPU usage instead. Is this expected?

Yes.

 Also, how would dynamic column families help us?

You don't have to mess with key prefixes, since each CF contains only
one type of index.

-Jonathan


Re: Range scan performance in 0.6.0 beta2

2010-03-25 Thread Sylvain Lebresne
I don't know If that could play any role, but if ever you have
disabled the assertions
when running cassandra (that is, you removed the -ea line in
cassandra.in.sh), there
was a bug in 0.6beta2 that will make read in row with lots of columns
quite slow.

Another problem you may have is if you have the commitLog directory on the same
hard drive than the data directory. If that's the case and you read
and write at the
same time, that may be a reason for poor read performances (and write too).

As for the row with 30 millions columns, you have to be aware that right now,
cassandra will deserialize whole rows during compaction
(http://wiki.apache.org/cassandra/CassandraLimitations).
So depending on the size of what you store in you column, you could
very well hit
that limitation (that could be why you OOM). In which case, I see two choices:
1) add more RAM to the machine or 2) change your data structure to
avoid that (maybe
can you split rows with too many columns somehow ?).

--
Sylvain

On Thu, Mar 25, 2010 at 2:33 PM, Henrik Schröder skro...@gmail.com wrote:
 Hi everyone,

 We're trying to implement a virtual datastore for our users where they can
 set up tables and indexes to store objects and have them indexed on
 arbitrary properties. And we did a test implementation for Cassandra in the
 following way:

 Objects are stored in one columnfamily, each key is made up of tableid +
 object key, and each row has one column where the value is the serialized
 object. This part is super-simple, we're just using Cassandra as a
 key-value-store, and this part performs really well.

 The indexes are a bit tricker, but basically for each index and each object
 that is stored, we compute a fixed-length bytearray based on the object that
 make up the indexvalue. We then store these bytearray indexvalues in another
 columnfamily, with the indexid as row key, the indexvalue as the column
 name, and the object key as the column value.

 The idea is then that to perform a range query on an index in this virtual
 datastore, we do a get_slice to get the range of indexvalues and their
 corresponding object keys, and we can then multi_get the actual objects from
 the other column family.

 Since these virtual tables and indexes will be created by our users the
 whole system has to be very dynamic, and we can't make any assumptions about
 the actual objects they will store and the distribution of these. We do know
 that it must be able to scale well, and this is what attracted us to
 Cassandra in the first place. We do however have some performance targets we
 want to hit, we have one use-case where there will be about 30 million
 records in a table, and knowing that it can go up to 100 million records
 would be nice. As for speed, would like to get thousands of writes and range
 reads per second.

 Given these requirements and our design, we will then have rows in Cassandra
 with millions of columns, from which we want to fetch large column slices.
 We set it all up on a single developer machine (MacPro, QuadCore 2.66ghz)
 running Windows, and we used the thrift compiler to generate a C# client
 library. We tested just the index part of our design, and these are the
 numbers we got:
 inserts (15 threads, batches of 10): 4000/second
 get_slices (10 threads, random range sizes, count 1000): 50/second at start,
 dies at about 6 million columns inserted. (OutOfMemoryException)
 get_slices (10 threads, random range sizes, count 10): 200/s at start, slows
 down the more columns there are.


 When we saw that the above results were bad, we tried a different approach
 storing the indexvalues in the key instead, using the
 OrderPreservingPartitioner and using get_range_slice to get ranges of rows,
 but we got even worse results:
 inserts (15 threads, in batches of 10): 4000/second
 get_range_slice (10 threads, random key ranges, count 1000): 20/second at
 start, 5/second with 30 million rows


 Finally, we did a similar test using MySQL instead and then we got these
 numbers:
 inserts (15 threads, in batches of 10): 4000/second
 select (10 threads, limit 1000): 500/second

 So for us, the MySQL version delivers the speed that we want, but none of
 the scaling that Cassandra gives us. We set up our columnfamilies like this:

 ColumnFamily CompareWith=BytesType Name=Objects RowsCached=0
 KeysCached=0/
 ColumnFamily CompareWith=BytesType Name=Indexes RowsCached=0
 KeysCached=0/

 And we now have these questions:
 a) Is there a better way of structuring our data and building the virtual
 indexes?
 b) Are our Cassandra numbers too low? Or is this the expected performance?
 c) Did we miss to change some important setting (in the conf xml or java
 config) since our rows are this large?
 d) Can we avoid hitting the Out of memory exception?


 /Henrik



Re: Range scan performance in 0.6.0 beta2

2010-03-25 Thread Henrik Schröder
On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne sylv...@yakaz.com wrote:

 I don't know If that could play any role, but if ever you have
 disabled the assertions
 when running cassandra (that is, you removed the -ea line in
 cassandra.in.sh), there
 was a bug in 0.6beta2 that will make read in row with lots of columns
 quite slow.


We tried it with beta3 and got the same results, so that didn't do anything.


 Another problem you may have is if you have the commitLog directory on the
 same
 hard drive than the data directory. If that's the case and you read
 and write at the
 same time, that may be a reason for poor read performances (and write too).


We also tested doing only reads, and got about the same read speeds


 As for the row with 30 millions columns, you have to be aware that right
 now,
 cassandra will deserialize whole rows during compaction
 (http://wiki.apache.org/cassandra/CassandraLimitations).
 So depending on the size of what you store in you column, you could
 very well hit
 that limitation (that could be why you OOM). In which case, I see two
 choices:
 1) add more RAM to the machine or 2) change your data structure to
 avoid that (maybe
 can you split rows with too many columns somehow ?).


Splitting the rows would be an option if we got anything near decent speed
for small rows, but even if we only have a few hundred thousand columns in
one row, the read speed is still slow.

What kind of numbers are common for this type of operation? Say that you
have a row with 50 columns whose names range from 0x0 to 0x7A120, and
you do get_slice operations on that with ranges of random numbers in the
interval but with a fixed count of 1000, and that you multithread it with
~10 of threads, can't you get more than 50 reads/s?

When we've been reading up on Cassandra we've seen posts that billions of
columns in a row shouldn't be a problem, and sure enough, writing all that
data goes pretty fast, but as soon as you want to retrieve it, it is really
slow. We also tried doing counts on the number of columns in a row, and that
was really, really slow, it took half a minute to count the columns in a row
with 50 columns, and when doing the same on a row with millions, it just
crashed with an OOM exception after a few minutes.


/Henrik


Re: Range scan performance in 0.6.0 beta2

2010-03-25 Thread Nathan McCall
I noticed you turned Key caching off in your ColumnFamily declaration,
have you tried experimenting with this on and playing key caching
configuration? Also, have you looked at the JMX output for what
commands are pending execution? That is always helpful to me in
hunting down bottlenecks.

-Nate

On Thu, Mar 25, 2010 at 9:31 AM, Henrik Schröder skro...@gmail.com wrote:
 On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne sylv...@yakaz.com wrote:

 I don't know If that could play any role, but if ever you have
 disabled the assertions
 when running cassandra (that is, you removed the -ea line in
 cassandra.in.sh), there
 was a bug in 0.6beta2 that will make read in row with lots of columns
 quite slow.

 We tried it with beta3 and got the same results, so that didn't do anything.


 Another problem you may have is if you have the commitLog directory on the
 same
 hard drive than the data directory. If that's the case and you read
 and write at the
 same time, that may be a reason for poor read performances (and write
 too).

 We also tested doing only reads, and got about the same read speeds


 As for the row with 30 millions columns, you have to be aware that right
 now,
 cassandra will deserialize whole rows during compaction
 (http://wiki.apache.org/cassandra/CassandraLimitations).
 So depending on the size of what you store in you column, you could
 very well hit
 that limitation (that could be why you OOM). In which case, I see two
 choices:
 1) add more RAM to the machine or 2) change your data structure to
 avoid that (maybe
 can you split rows with too many columns somehow ?).

 Splitting the rows would be an option if we got anything near decent speed
 for small rows, but even if we only have a few hundred thousand columns in
 one row, the read speed is still slow.

 What kind of numbers are common for this type of operation? Say that you
 have a row with 50 columns whose names range from 0x0 to 0x7A120, and
 you do get_slice operations on that with ranges of random numbers in the
 interval but with a fixed count of 1000, and that you multithread it with
 ~10 of threads, can't you get more than 50 reads/s?

 When we've been reading up on Cassandra we've seen posts that billions of
 columns in a row shouldn't be a problem, and sure enough, writing all that
 data goes pretty fast, but as soon as you want to retrieve it, it is really
 slow. We also tried doing counts on the number of columns in a row, and that
 was really, really slow, it took half a minute to count the columns in a row
 with 50 columns, and when doing the same on a row with millions, it just
 crashed with an OOM exception after a few minutes.


 /Henrik



Re: Range scan performance in 0.6.0 beta2

2010-03-25 Thread Sylvain Lebresne
On Thu, Mar 25, 2010 at 5:31 PM, Henrik Schröder skro...@gmail.com wrote:
 On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne sylv...@yakaz.com wrote:

 I don't know If that could play any role, but if ever you have
 disabled the assertions
 when running cassandra (that is, you removed the -ea line in
 cassandra.in.sh), there
 was a bug in 0.6beta2 that will make read in row with lots of columns
 quite slow.

 We tried it with beta3 and got the same results, so that didn't do anything.

I'm not sure the patch has made it for beta3. If you haven't removed
the assertions,
then it's not your problem. If you have, I could only suggest you to
try the svn
branche for 0.6 (svn checkout
https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6).
Just saying.




 Another problem you may have is if you have the commitLog directory on the
 same
 hard drive than the data directory. If that's the case and you read
 and write at the
 same time, that may be a reason for poor read performances (and write
 too).

 We also tested doing only reads, and got about the same read speeds


 As for the row with 30 millions columns, you have to be aware that right
 now,
 cassandra will deserialize whole rows during compaction
 (http://wiki.apache.org/cassandra/CassandraLimitations).
 So depending on the size of what you store in you column, you could
 very well hit
 that limitation (that could be why you OOM). In which case, I see two
 choices:
 1) add more RAM to the machine or 2) change your data structure to
 avoid that (maybe
 can you split rows with too many columns somehow ?).

 Splitting the rows would be an option if we got anything near decent speed
 for small rows, but even if we only have a few hundred thousand columns in
 one row, the read speed is still slow.

 What kind of numbers are common for this type of operation? Say that you
 have a row with 50 columns whose names range from 0x0 to 0x7A120, and
 you do get_slice operations on that with ranges of random numbers in the
 interval but with a fixed count of 1000, and that you multithread it with
 ~10 of threads, can't you get more than 50 reads/s?

 When we've been reading up on Cassandra we've seen posts that billions of
 columns in a row shouldn't be a problem, and sure enough, writing all that
 data goes pretty fast, but as soon as you want to retrieve it, it is really
 slow. We also tried doing counts on the number of columns in a row, and that
 was really, really slow, it took half a minute to count the columns in a row
 with 50 columns, and when doing the same on a row with millions, it just
 crashed with an OOM exception after a few minutes.


 /Henrik



Re: Range scan performance in 0.6.0 beta2

2010-03-25 Thread Jonathan Ellis
On Thu, Mar 25, 2010 at 8:33 AM, Henrik Schröder skro...@gmail.com wrote:
 Hi everyone,

 We're trying to implement a virtual datastore for our users where they can
 set up tables and indexes to store objects and have them indexed on
 arbitrary properties. And we did a test implementation for Cassandra in the
 following way:

 Objects are stored in one columnfamily, each key is made up of tableid +
 object key, and each row has one column where the value is the serialized
 object. This part is super-simple, we're just using Cassandra as a
 key-value-store, and this part performs really well.

 The indexes are a bit tricker, but basically for each index and each object
 that is stored, we compute a fixed-length bytearray based on the object that
 make up the indexvalue. We then store these bytearray indexvalues in another
 columnfamily, with the indexid as row key, the indexvalue as the column
 name, and the object key as the column value.

So all the values for an entire index will be in one row?  That
doesn't sound good.

You really want to put each index [and each table] in its own CF, but
until we can do that dynamically (0.7) you could at least make the
index row keys a tuple of (indexid, indexvalue) and the column names
in each row the object keys (empty column values).

This works pretty well for a lot of users, including Digg.

 We tested just the index part of our design, and these are the
 numbers we got:
 inserts (15 threads, batches of 10): 4000/second
 get_slices (10 threads, random range sizes, count 1000): 50/second at start,
 dies at about 6 million columns inserted. (OutOfMemoryException)
 get_slices (10 threads, random range sizes, count 10): 200/s at start, slows
 down the more columns there are.

Those are really low read numbers, but I'd make the schema change
above before digging deeper there.

Also, if you are OOMing, you're probably getting really crappy
performance for some time before that, as the JVM tries desperately to
collect enough space to keep going.  The easiest solution is to just
let it use more memory, assuming you can do so.
http://wiki.apache.org/cassandra/RunningCassandra

-Jonathan