Re: Range scan performance in 0.6.0 beta2
On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis jbel...@gmail.com wrote: On Fri, Mar 26, 2010 at 7:40 AM, Henrik Schröder skro...@gmail.com wrote: For each indexvalue we insert a row where the key is indexid + : + indexvalue encoded as hex string, and the row contains only one column, where the name is the object key encoded as a bytearray, and the value is empty. It's a unique index then? And you're trying to read things ordered by the index, not just give me keys with that have a column with this value? Yes, because if we have more than one column per row, there's no way of (easily) limiting the result. As it is now we rarely want all object keys associated with a range of indexvalues. However, this means we will have a lot of rows if we do it in Cassandra. /Henrik
Re: Range scan performance in 0.6.0 beta2
On Mon, Mar 29, 2010 at 4:06 AM, Henrik Schröder skro...@gmail.com wrote: On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis jbel...@gmail.com wrote: It's a unique index then? And you're trying to read things ordered by the index, not just give me keys with that have a column with this value? Yes, because if we have more than one column per row, there's no way of (easily) limiting the result. That's exactly what the count parameter of SliceRange is for... ? -Jonathan
Re: Range scan performance in 0.6.0 beta2
So all the values for an entire index will be in one row? That doesn't sound good. You really want to put each index [and each table] in its own CF, but until we can do that dynamically (0.7) you could at least make the index row keys a tuple of (indexid, indexvalue) and the column names in each row the object keys (empty column values). This works pretty well for a lot of users, including Digg. We tested your suggestions like this: We're using the OrderPreservingPartitioner. We set the keycache and rowcache to 40%. We're using the same machine as before, but we switched to a 64-bit JVM and gave it 5GB of memory For each indexvalue we insert a row where the key is indexid + : + indexvalue encoded as hex string, and the row contains only one column, where the name is the object key encoded as a bytearray, and the value is empty. When reading, we do a get_range_slice with an empty slice_range (start and finish are 0-length byte-arrays), and randomly generated start_key and finish_key where we know they both have been inserted, and finally a row_count of 1000. These are the numbers we got this time: inserts (15 threads, batches of 10): 4000/second get_range_slices (10 threads, row_count 1000): 50/seconds at start, down to 10/second at 250k inserts. These numbers are slightly better than our previous OPP tries, but nothing significant. For what it's worth, if we're only doing writes, the machine bottlenecks on disk I/O as expected, but whenever we do reads, it bottlenecks on CPU usage instead. Is this expected? Also, how would dynamic column families help us? In our tests, we only tested a single index, so even if we had one column family per index, we would still only write to one of them and then get the exact same results as above, right? We're really grateful for any help with both how to tune Cassandra and how to design our data model. The designs we've tested so far is the best we could come up with ourselves, all we really need is a way to store groups of mappings of indexvalue-objectkey, and be able to get a range of objectkeys back given a group and a start and stop indexvalue. /Henrik
Re: Range scan performance in 0.6.0 beta2
On Fri, Mar 26, 2010 at 7:40 AM, Henrik Schröder skro...@gmail.com wrote: For each indexvalue we insert a row where the key is indexid + : + indexvalue encoded as hex string, and the row contains only one column, where the name is the object key encoded as a bytearray, and the value is empty. It's a unique index then? And you're trying to read things ordered by the index, not just give me keys with that have a column with this value? These numbers are slightly better than our previous OPP tries, but nothing significant. For what it's worth, if we're only doing writes, the machine bottlenecks on disk I/O as expected, but whenever we do reads, it bottlenecks on CPU usage instead. Is this expected? Yes. Also, how would dynamic column families help us? You don't have to mess with key prefixes, since each CF contains only one type of index. -Jonathan
Re: Range scan performance in 0.6.0 beta2
I don't know If that could play any role, but if ever you have disabled the assertions when running cassandra (that is, you removed the -ea line in cassandra.in.sh), there was a bug in 0.6beta2 that will make read in row with lots of columns quite slow. Another problem you may have is if you have the commitLog directory on the same hard drive than the data directory. If that's the case and you read and write at the same time, that may be a reason for poor read performances (and write too). As for the row with 30 millions columns, you have to be aware that right now, cassandra will deserialize whole rows during compaction (http://wiki.apache.org/cassandra/CassandraLimitations). So depending on the size of what you store in you column, you could very well hit that limitation (that could be why you OOM). In which case, I see two choices: 1) add more RAM to the machine or 2) change your data structure to avoid that (maybe can you split rows with too many columns somehow ?). -- Sylvain On Thu, Mar 25, 2010 at 2:33 PM, Henrik Schröder skro...@gmail.com wrote: Hi everyone, We're trying to implement a virtual datastore for our users where they can set up tables and indexes to store objects and have them indexed on arbitrary properties. And we did a test implementation for Cassandra in the following way: Objects are stored in one columnfamily, each key is made up of tableid + object key, and each row has one column where the value is the serialized object. This part is super-simple, we're just using Cassandra as a key-value-store, and this part performs really well. The indexes are a bit tricker, but basically for each index and each object that is stored, we compute a fixed-length bytearray based on the object that make up the indexvalue. We then store these bytearray indexvalues in another columnfamily, with the indexid as row key, the indexvalue as the column name, and the object key as the column value. The idea is then that to perform a range query on an index in this virtual datastore, we do a get_slice to get the range of indexvalues and their corresponding object keys, and we can then multi_get the actual objects from the other column family. Since these virtual tables and indexes will be created by our users the whole system has to be very dynamic, and we can't make any assumptions about the actual objects they will store and the distribution of these. We do know that it must be able to scale well, and this is what attracted us to Cassandra in the first place. We do however have some performance targets we want to hit, we have one use-case where there will be about 30 million records in a table, and knowing that it can go up to 100 million records would be nice. As for speed, would like to get thousands of writes and range reads per second. Given these requirements and our design, we will then have rows in Cassandra with millions of columns, from which we want to fetch large column slices. We set it all up on a single developer machine (MacPro, QuadCore 2.66ghz) running Windows, and we used the thrift compiler to generate a C# client library. We tested just the index part of our design, and these are the numbers we got: inserts (15 threads, batches of 10): 4000/second get_slices (10 threads, random range sizes, count 1000): 50/second at start, dies at about 6 million columns inserted. (OutOfMemoryException) get_slices (10 threads, random range sizes, count 10): 200/s at start, slows down the more columns there are. When we saw that the above results were bad, we tried a different approach storing the indexvalues in the key instead, using the OrderPreservingPartitioner and using get_range_slice to get ranges of rows, but we got even worse results: inserts (15 threads, in batches of 10): 4000/second get_range_slice (10 threads, random key ranges, count 1000): 20/second at start, 5/second with 30 million rows Finally, we did a similar test using MySQL instead and then we got these numbers: inserts (15 threads, in batches of 10): 4000/second select (10 threads, limit 1000): 500/second So for us, the MySQL version delivers the speed that we want, but none of the scaling that Cassandra gives us. We set up our columnfamilies like this: ColumnFamily CompareWith=BytesType Name=Objects RowsCached=0 KeysCached=0/ ColumnFamily CompareWith=BytesType Name=Indexes RowsCached=0 KeysCached=0/ And we now have these questions: a) Is there a better way of structuring our data and building the virtual indexes? b) Are our Cassandra numbers too low? Or is this the expected performance? c) Did we miss to change some important setting (in the conf xml or java config) since our rows are this large? d) Can we avoid hitting the Out of memory exception? /Henrik
Re: Range scan performance in 0.6.0 beta2
On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne sylv...@yakaz.com wrote: I don't know If that could play any role, but if ever you have disabled the assertions when running cassandra (that is, you removed the -ea line in cassandra.in.sh), there was a bug in 0.6beta2 that will make read in row with lots of columns quite slow. We tried it with beta3 and got the same results, so that didn't do anything. Another problem you may have is if you have the commitLog directory on the same hard drive than the data directory. If that's the case and you read and write at the same time, that may be a reason for poor read performances (and write too). We also tested doing only reads, and got about the same read speeds As for the row with 30 millions columns, you have to be aware that right now, cassandra will deserialize whole rows during compaction (http://wiki.apache.org/cassandra/CassandraLimitations). So depending on the size of what you store in you column, you could very well hit that limitation (that could be why you OOM). In which case, I see two choices: 1) add more RAM to the machine or 2) change your data structure to avoid that (maybe can you split rows with too many columns somehow ?). Splitting the rows would be an option if we got anything near decent speed for small rows, but even if we only have a few hundred thousand columns in one row, the read speed is still slow. What kind of numbers are common for this type of operation? Say that you have a row with 50 columns whose names range from 0x0 to 0x7A120, and you do get_slice operations on that with ranges of random numbers in the interval but with a fixed count of 1000, and that you multithread it with ~10 of threads, can't you get more than 50 reads/s? When we've been reading up on Cassandra we've seen posts that billions of columns in a row shouldn't be a problem, and sure enough, writing all that data goes pretty fast, but as soon as you want to retrieve it, it is really slow. We also tried doing counts on the number of columns in a row, and that was really, really slow, it took half a minute to count the columns in a row with 50 columns, and when doing the same on a row with millions, it just crashed with an OOM exception after a few minutes. /Henrik
Re: Range scan performance in 0.6.0 beta2
I noticed you turned Key caching off in your ColumnFamily declaration, have you tried experimenting with this on and playing key caching configuration? Also, have you looked at the JMX output for what commands are pending execution? That is always helpful to me in hunting down bottlenecks. -Nate On Thu, Mar 25, 2010 at 9:31 AM, Henrik Schröder skro...@gmail.com wrote: On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne sylv...@yakaz.com wrote: I don't know If that could play any role, but if ever you have disabled the assertions when running cassandra (that is, you removed the -ea line in cassandra.in.sh), there was a bug in 0.6beta2 that will make read in row with lots of columns quite slow. We tried it with beta3 and got the same results, so that didn't do anything. Another problem you may have is if you have the commitLog directory on the same hard drive than the data directory. If that's the case and you read and write at the same time, that may be a reason for poor read performances (and write too). We also tested doing only reads, and got about the same read speeds As for the row with 30 millions columns, you have to be aware that right now, cassandra will deserialize whole rows during compaction (http://wiki.apache.org/cassandra/CassandraLimitations). So depending on the size of what you store in you column, you could very well hit that limitation (that could be why you OOM). In which case, I see two choices: 1) add more RAM to the machine or 2) change your data structure to avoid that (maybe can you split rows with too many columns somehow ?). Splitting the rows would be an option if we got anything near decent speed for small rows, but even if we only have a few hundred thousand columns in one row, the read speed is still slow. What kind of numbers are common for this type of operation? Say that you have a row with 50 columns whose names range from 0x0 to 0x7A120, and you do get_slice operations on that with ranges of random numbers in the interval but with a fixed count of 1000, and that you multithread it with ~10 of threads, can't you get more than 50 reads/s? When we've been reading up on Cassandra we've seen posts that billions of columns in a row shouldn't be a problem, and sure enough, writing all that data goes pretty fast, but as soon as you want to retrieve it, it is really slow. We also tried doing counts on the number of columns in a row, and that was really, really slow, it took half a minute to count the columns in a row with 50 columns, and when doing the same on a row with millions, it just crashed with an OOM exception after a few minutes. /Henrik
Re: Range scan performance in 0.6.0 beta2
On Thu, Mar 25, 2010 at 5:31 PM, Henrik Schröder skro...@gmail.com wrote: On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne sylv...@yakaz.com wrote: I don't know If that could play any role, but if ever you have disabled the assertions when running cassandra (that is, you removed the -ea line in cassandra.in.sh), there was a bug in 0.6beta2 that will make read in row with lots of columns quite slow. We tried it with beta3 and got the same results, so that didn't do anything. I'm not sure the patch has made it for beta3. If you haven't removed the assertions, then it's not your problem. If you have, I could only suggest you to try the svn branche for 0.6 (svn checkout https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6). Just saying. Another problem you may have is if you have the commitLog directory on the same hard drive than the data directory. If that's the case and you read and write at the same time, that may be a reason for poor read performances (and write too). We also tested doing only reads, and got about the same read speeds As for the row with 30 millions columns, you have to be aware that right now, cassandra will deserialize whole rows during compaction (http://wiki.apache.org/cassandra/CassandraLimitations). So depending on the size of what you store in you column, you could very well hit that limitation (that could be why you OOM). In which case, I see two choices: 1) add more RAM to the machine or 2) change your data structure to avoid that (maybe can you split rows with too many columns somehow ?). Splitting the rows would be an option if we got anything near decent speed for small rows, but even if we only have a few hundred thousand columns in one row, the read speed is still slow. What kind of numbers are common for this type of operation? Say that you have a row with 50 columns whose names range from 0x0 to 0x7A120, and you do get_slice operations on that with ranges of random numbers in the interval but with a fixed count of 1000, and that you multithread it with ~10 of threads, can't you get more than 50 reads/s? When we've been reading up on Cassandra we've seen posts that billions of columns in a row shouldn't be a problem, and sure enough, writing all that data goes pretty fast, but as soon as you want to retrieve it, it is really slow. We also tried doing counts on the number of columns in a row, and that was really, really slow, it took half a minute to count the columns in a row with 50 columns, and when doing the same on a row with millions, it just crashed with an OOM exception after a few minutes. /Henrik
Re: Range scan performance in 0.6.0 beta2
On Thu, Mar 25, 2010 at 8:33 AM, Henrik Schröder skro...@gmail.com wrote: Hi everyone, We're trying to implement a virtual datastore for our users where they can set up tables and indexes to store objects and have them indexed on arbitrary properties. And we did a test implementation for Cassandra in the following way: Objects are stored in one columnfamily, each key is made up of tableid + object key, and each row has one column where the value is the serialized object. This part is super-simple, we're just using Cassandra as a key-value-store, and this part performs really well. The indexes are a bit tricker, but basically for each index and each object that is stored, we compute a fixed-length bytearray based on the object that make up the indexvalue. We then store these bytearray indexvalues in another columnfamily, with the indexid as row key, the indexvalue as the column name, and the object key as the column value. So all the values for an entire index will be in one row? That doesn't sound good. You really want to put each index [and each table] in its own CF, but until we can do that dynamically (0.7) you could at least make the index row keys a tuple of (indexid, indexvalue) and the column names in each row the object keys (empty column values). This works pretty well for a lot of users, including Digg. We tested just the index part of our design, and these are the numbers we got: inserts (15 threads, batches of 10): 4000/second get_slices (10 threads, random range sizes, count 1000): 50/second at start, dies at about 6 million columns inserted. (OutOfMemoryException) get_slices (10 threads, random range sizes, count 10): 200/s at start, slows down the more columns there are. Those are really low read numbers, but I'd make the schema change above before digging deeper there. Also, if you are OOMing, you're probably getting really crappy performance for some time before that, as the JVM tries desperately to collect enough space to keep going. The easiest solution is to just let it use more memory, assuming you can do so. http://wiki.apache.org/cassandra/RunningCassandra -Jonathan