understanding the cassandra storage scaling
I have a very basic question which I have been unable to find in online documentation on cassandra. It seems like every node in a cassandra cluster contains all the data ever stored in the cluster (i.e., all nodes are identical). I don't understand how you can scale this on commodity servers with merely internal hard disks. In other words, if I want to store 5 TB of data, does that each node need a hard disk capacity of 5 TB?? With HBase, memcached and other nosql solutions it is more clear how data is spilt up in the cluster and replicated for fault tolerance. Again, please excuse the rather basic question.
Re: understanding the cassandra storage scaling
there are two numbers to look at, N the numbers of hosts in the ring (cluster) and R the number of replicas for each data item. R is configurable per column family. Typically for large clusters N R. For very small clusters if makes sense for R to be close to N in which case cassandra is useful so the database doesn't have a single a single point of failure but not so much b/c of the size of the data. But for large clusters it rarely makes sense to have N=R, usually N R. On Thu, Dec 9, 2010 at 12:28 PM, Jonathan Colby jonathan.co...@gmail.comwrote: I have a very basic question which I have been unable to find in online documentation on cassandra. It seems like every node in a cassandra cluster contains all the data ever stored in the cluster (i.e., all nodes are identical). I don't understand how you can scale this on commodity servers with merely internal hard disks. In other words, if I want to store 5 TB of data, does that each node need a hard disk capacity of 5 TB?? With HBase, memcached and other nosql solutions it is more clear how data is spilt up in the cluster and replicated for fault tolerance. Again, please excuse the rather basic question. -- /Ran
unsubscribe
Massimo Carro www.liquida.it - www.liquida.com
Re: understanding the cassandra storage scaling
Thanks Ran. This helps a little but unfortunately I'm still a bit fuzzy for me. So is it not true that each node contains all the data in the cluster? I haven't come across any information on how clustered data is coordinated in cassandra. how does my query get directed to the right node? On Thu, Dec 9, 2010 at 11:35 AM, Ran Tavory ran...@gmail.com wrote: there are two numbers to look at, N the numbers of hosts in the ring (cluster) and R the number of replicas for each data item. R is configurable per column family. Typically for large clusters N R. For very small clusters if makes sense for R to be close to N in which case cassandra is useful so the database doesn't have a single a single point of failure but not so much b/c of the size of the data. But for large clusters it rarely makes sense to have N=R, usually N R. On Thu, Dec 9, 2010 at 12:28 PM, Jonathan Colby jonathan.co...@gmail.com wrote: I have a very basic question which I have been unable to find in online documentation on cassandra. It seems like every node in a cassandra cluster contains all the data ever stored in the cluster (i.e., all nodes are identical). I don't understand how you can scale this on commodity servers with merely internal hard disks. In other words, if I want to store 5 TB of data, does that each node need a hard disk capacity of 5 TB?? With HBase, memcached and other nosql solutions it is more clear how data is spilt up in the cluster and replicated for fault tolerance. Again, please excuse the rather basic question. -- /Ran
Re: understanding the cassandra storage scaling
So is it not true that each node contains all the data in the cluster? No, not in the general case, in fact rarely is it the case. Usually RN. In my case I have N=6 and R=2. You configure R per CF under ReplicationFactor (v0.6.*) or replication_factor (v0.7.*). http://wiki.apache.org/cassandra/StorageConfiguration On Thu, Dec 9, 2010 at 12:43 PM, Jonathan Colby jonathan.co...@gmail.comwrote: Thanks Ran. This helps a little but unfortunately I'm still a bit fuzzy for me. So is it not true that each node contains all the data in the cluster? I haven't come across any information on how clustered data is coordinated in cassandra. how does my query get directed to the right node? On Thu, Dec 9, 2010 at 11:35 AM, Ran Tavory ran...@gmail.com wrote: there are two numbers to look at, N the numbers of hosts in the ring (cluster) and R the number of replicas for each data item. R is configurable per column family. Typically for large clusters N R. For very small clusters if makes sense for R to be close to N in which case cassandra is useful so the database doesn't have a single a single point of failure but not so much b/c of the size of the data. But for large clusters it rarely makes sense to have N=R, usually N R. On Thu, Dec 9, 2010 at 12:28 PM, Jonathan Colby jonathan.co...@gmail.com wrote: I have a very basic question which I have been unable to find in online documentation on cassandra. It seems like every node in a cassandra cluster contains all the data ever stored in the cluster (i.e., all nodes are identical). I don't understand how you can scale this on commodity servers with merely internal hard disks. In other words, if I want to store 5 TB of data, does that each node need a hard disk capacity of 5 TB?? With HBase, memcached and other nosql solutions it is more clear how data is spilt up in the cluster and replicated for fault tolerance. Again, please excuse the rather basic question. -- /Ran -- /Ran
Re: understanding the cassandra storage scaling
This helps a little but unfortunately I'm still a bit fuzzy for me. So is it not true that each node contains all the data in the cluster? Not at all. Basically each node is responsible of only a part of the data (a range really). But for each data you can choose on how many nodes it is; this is the Replication Factor. For instance, if you choose to have RF=1, then each piece of data will be on exactly one node (this is usually a bad idea since it offers very weak durability guarantees but nevertheless, it can be done). If you choose RF=3, each piece of data is on 3 nodes (independently of the number of nodes your cluster have). You can have all data on all node, but for that you'll have to choose RF=#{nodes in the cluster}. But this is a very degenerate case. how does my query get directed to the right node? Each node in the cluster knows the ranges of data each other nodes hold. I suggest you watch the first video linked in this page http://wiki.apache.org/cassandra/ArticlesAndPresentations It explains this and more. -- Sylvain
Re: understanding the cassandra storage scaling
awesome! Thank you guys for the really quick answers and the links to the presentations. On Thu, Dec 9, 2010 at 12:06 PM, Sylvain Lebresne sylv...@yakaz.com wrote: This helps a little but unfortunately I'm still a bit fuzzy for me. So is it not true that each node contains all the data in the cluster? Not at all. Basically each node is responsible of only a part of the data (a range really). But for each data you can choose on how many nodes it is; this is the Replication Factor. For instance, if you choose to have RF=1, then each piece of data will be on exactly one node (this is usually a bad idea since it offers very weak durability guarantees but nevertheless, it can be done). If you choose RF=3, each piece of data is on 3 nodes (independently of the number of nodes your cluster have). You can have all data on all node, but for that you'll have to choose RF=#{nodes in the cluster}. But this is a very degenerate case. how does my query get directed to the right node? Each node in the cluster knows the ranges of data each other nodes hold. I suggest you watch the first video linked in this page http://wiki.apache.org/cassandra/ArticlesAndPresentations It explains this and more. -- Sylvain
N to N relationships
Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Re: N to N relationships
How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.com wrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Secondary indexes change everything?
It seems to me that secondary indexes (new in 0.7) change everything when it comes to data modeling. - OOP becomes obsolete - primary indexes become obsolete if you ever want to do a range query (which you probably will...), better to assign a random row id Taken together, it's likely that very little will remain of your old database schema... Am I right?
Re: Secondary indexes change everything?
- OPP becomes obsolete (OOP is not obsolete!) - primary indexes become obsolete if you ever want to do a range query (which you probably will...), better to assign a random row id Taken together, it's likely that very little will remain of your old database schema... Am I right?
Quorum: killing 1 out of 3 server kills the cluster (?)
Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
Hi, The UnavailableExceptions will be thrown because quorum of size 2 needs at least 2 nodes to be alive (as for qurom of size 3 as well). The data won't be automatically redistributed to other nodes. Thibaut On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: unsubscribe
On Thu, 2010-12-09 at 11:42 +0100, Massimo Carro wrote: Massimo Carro www.liquida.it - www.liquida.com http://wiki.apache.org/cassandra/FAQ#unsubscribe -- Eric Evans eev...@rackspace.com
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
RE: Quorum: killing 1 out of 3 server kills the cluster (?)
With 3 nodes and RF=2 you have 3 key ranges: N1+N2, N2+N3 and N3+N1. Killing N1 you've got only 1 alive range N2+N3 and 2/3 of the range is down for Quorum, which is actually all, so N1+N2 and N3+N1 fails. -Original Message- From: Timo Nentwig [mailto:timo.nent...@toptarif.de] Sent: Thursday, December 09, 2010 6:01 PM To: user@cassandra.apache.org Subject: Re: Quorum: killing 1 out of 3 server kills the cluster (?) On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
In other words, if you want to use QUORUM, you need to set RF=3. (I know because I had exactly the same problem.) On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
On Dec 9, 2010, at 17:39, David Boxenhorn wrote: In other words, if you want to use QUORUM, you need to set RF=3. (I know because I had exactly the same problem.) I naively assume that if I kill either node that holds N1 (i.e. node 1 or 3), N1 will still remain on another node. Only if both fail, I actually lose data. But apparently this is not how it works... On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
If that is what you want, use CL=ONE On Thu, Dec 9, 2010 at 6:43 PM, Timo Nentwig timo.nent...@toptarif.dewrote: On Dec 9, 2010, at 17:39, David Boxenhorn wrote: In other words, if you want to use QUORUM, you need to set RF=3. (I know because I had exactly the same problem.) I naively assume that if I kill either node that holds N1 (i.e. node 1 or 3), N1 will still remain on another node. Only if both fail, I actually lose data. But apparently this is not how it works... On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
On Thu, Dec 9, 2010 at 10:43 AM, Timo Nentwig timo.nent...@toptarif.dewrote: On Dec 9, 2010, at 17:39, David Boxenhorn wrote: In other words, if you want to use QUORUM, you need to set RF=3. (I know because I had exactly the same problem.) I naively assume that if I kill either node that holds N1 (i.e. node 1 or 3), N1 will still remain on another node. Only if both fail, I actually lose data. But apparently this is not how it works... No this is correct. Killing one node with a replication factor of 2 will not cause you to lose data. You are requiring a consistency level higher than what is available. Change your app to use CL.ONE and all data will be available even with one machine unavailable. On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
I naively assume that if I kill either node that holds N1 (i.e. node 1 or 3), N1 will still remain on another node. Only if both fail, I actually lose data. But apparently this is not how it works... Sure, the data that N1 holds is also on another node and you won't lose it by only losing N1. But when you do a quorum query, you are saying to Cassandra Please please would you fail my request if you can't get a response from 2 nodes. So if only 1 node holding the data is up at the moment of the query then Cassandra, which is a very polite software, do what you asked and fail. If you want Cassandra to send you an answer with only one node up, use CL=ONE (as said by David). On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Cassandra and disk space
I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
On Dec 9, 2010, at 17:55, Sylvain Lebresne wrote: I naively assume that if I kill either node that holds N1 (i.e. node 1 or 3), N1 will still remain on another node. Only if both fail, I actually lose data. But apparently this is not how it works... Sure, the data that N1 holds is also on another node and you won't lose it by only losing N1. But when you do a quorum query, you are saying to Cassandra Please please would you fail my request if you can't get a response from 2 nodes. So if only 1 node holding the data is up at the moment of the query then Cassandra, which is a very polite software, do what you asked and fail. And my application would fall back to ONE. Quorum writes will also fail so I would also use ONE so that the app stays up. What would I have to do make the data to redistribute when the broken node is up again? Simply call nodetool repair on it? If you want Cassandra to send you an answer with only one node up, use CL=ONE (as said by David). On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: N to N relationships
Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.comwrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Re: N to N relationships
What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.com wrote: Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.comwrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Re: Secondary indexes change everything?
OPP is not yet obsolete. The included secondary indexes still aren't good at finding keys for ranges of indexed values, such as name 'b' and name 'c' . This is something that an OPP index would be good at. Of course, you can do something similar with one or more rows, so it's not that big of an advantage for OPP. If you can make primary indexes useful, you might as well -- no reason to throw that away. The main thing that the secondary index support does is relieve you from having to write all of the indexing code and CFs by hand. - Tyler On Thu, Dec 9, 2010 at 8:23 AM, David Boxenhorn da...@lookin2.com wrote: - OPP becomes obsolete (OOP is not obsolete!) - primary indexes become obsolete if you ever want to do a range query (which you probably will...), better to assign a random row id Taken together, it's likely that very little will remain of your old database schema... Am I right?
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
If you switch your writes to CL ONE when a failure occurs, you might as well use ONE for all writes. ONE and QUORUM behave the same when all nodes are working correctly. - Tyler On Thu, Dec 9, 2010 at 11:26 AM, Timo Nentwig timo.nent...@toptarif.dewrote: On Dec 9, 2010, at 17:55, Sylvain Lebresne wrote: I naively assume that if I kill either node that holds N1 (i.e. node 1 or 3), N1 will still remain on another node. Only if both fail, I actually lose data. But apparently this is not how it works... Sure, the data that N1 holds is also on another node and you won't lose it by only losing N1. But when you do a quorum query, you are saying to Cassandra Please please would you fail my request if you can't get a response from 2 nodes. So if only 1 node holding the data is up at the moment of the query then Cassandra, which is a very polite software, do what you asked and fail. And my application would fall back to ONE. Quorum writes will also fail so I would also use ONE so that the app stays up. What would I have to do make the data to redistribute when the broken node is up again? Simply call nodetool repair on it? If you want Cassandra to send you an answer with only one node up, use CL=ONE (as said by David). On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
And my application would fall back to ONE. Quorum writes will also fail so I would also use ONE so that the app stays up. What would I have to do make the data to redistribute when the broken node is up again? Simply call nodetool repair on it? There is 3 mechanisms for that: - hinted handoff: basically, when the node is back up, the other node will send him what he missed. - read-repair: whenever you read a data and an inconsistency is detected (because one node is not up to date), it gets repaired. - calling nodetool repair The two first are automatic, you have nothing to do. Nodetool repair is usually run only periodically (say once a week) to repair some cold data that wasn't dealt with by the two first mechanisms. -- Sylvain If you want Cassandra to send you an answer with only one node up, use CL=ONE (as said by David). On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Cassandra and disk space
I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Depending on which version you're on, you may be seeing this: https://issues.apache.org/jira/browse/CASSANDRA-1674 But regardless, disk space variations is a fact of life with cassandra. Off the top of my head I'm not ready to say what the expectations are with respect to repair under all circumstances. Anyone? Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Major compactions can potentially double the amount of disk, if you have a single large column family that contributes almost all disk space. For such clusters, regular background compaction can indeed cause a doubling when the compaction happened to be a major one (i.e., happened to include all sstables). -- / Peter Schuller
Re: Cassandra and disk space
If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: N to N relationships
I mean if I have secondary indexes. Apparently they are calculated in the background... On 9 December 2010 18:33, David Boxenhorn da...@lookin2.com wrote: What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.comwrote: Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.comwrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Re: Cassandra and disk space
Is there any plans to improve this in future? For big data clusters this could be very expensive. Based on your comment, I will need 200TB of storage for 100TB of data to keep Cassandra running. -- Rustam. On 09/12/2010 17:56, Tyler Hobbs wrote: If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com mailto:static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Stuck with adding nodes
Hi good people. I underestimated load during peak times and now I'm stuck with our production cluster. Right now its 3 nodes, rf 3 so everything is everywhere. We have ~300GB data load. ~10MB/sec incoming traffic and ~50 (peak) reads/sec to the cluster The problem derives from our quorum read / writes: At peak hours one of the machines (thats random) will fall behind because its a little slower than the others and than shortly after that it will drop most read requests. So right now the only way to survive is to take one machine down making every read / write a ALL operation. It's necessary to take one machine down because otherwise users will wait for timeouts from that overwhelmed machine when the client lib chooses it. Since we are a real time oriented thing thats a killer. So now we tried to add 2 more nodes. Problem is that anticompaction takes to long. Meaning it is not done when peak hour arrives and the machine that would stream the data to the new node must be taken down. We tried to block the ports 7000 and 9160 to that machine because we hoped that would stop traffic and let the machine end anticompaction. But that did not work because we could not cut the already existing connections to the other nodes. Currently I am copying all data files (thats all existing data) from one node to the new nodes in hope that I could than manually assign them their new tokenrange (nodetool move) and do cleanup. Obviously I will try this tomorrow (it's been a long day) on a test system but any advice would be highly appreciated. Sighs and thanks. Daniel smeet.com Berlin
Re: Stuck with adding nodes
Currently I am copying all data files (thats all existing data) from one node to the new nodes in hope that I could than manually assign them their new tokenrange (nodetool move) and do cleanup. Unless I'm misunderstanding you I believe you should be setting the initial token. nodetool move would be for a node already in the ring. And keep in mind that a nodetool move is currently a decommission+bootstrap - so if you're teetering on the edge of overload you will want to keep that in mind when moving a node to avoid ending up in a worse situation as another node temporarily receives more load than usual as a result of increased ring ownership. Obviously I will try this tomorrow (it's been a long day) on a test system but any advice would be highly appreciated. One possibility if you have additional hardware to spare temporarily, is to add more nodes than you actually need and then, once you are significantly over capacity, you have the flexibility to move nodes around to an optimum position and then decommission those machines that were only borrowed. I.e., initial bootstrap of nodes takes a shorter amount of time because you're giving them less token space per new node. And once all are in the ring, you're free to move things around and then free up the hardware. (Another option may be to implement throttling of the anti-compaction so that it runs very slowly during peak hours, but that requires patching cassandra or else firewall/packet filtering fu and is probably likely to be more risky than it's worth.) -- / Peter Schuller
Re: Cassandra and disk space
That depends on your scenario. In the worst case of one big CF, there's not much that can be easily done for the disk usage of compaction and cleanup (which is essentially compaction). If, instead, you have several column families and no single CF makes up the majority of your data, you can push your disk usage a bit higher. A fundamental idea behind Cassandra's architecture is that disk space is cheap (which, indeed, it is). If you are particularly sensitive to this, Cassandra might not be the best solution to your problem. Also keep in mind that Cassandra performs well with average disks, so you don't need to spend a lot there. Additionally, most people find that the replication protects their data enough to allow them to use RAID 0 instead of 1, 10, 5, or 6. - Tyler On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev rus...@code.az wrote: Is there any plans to improve this in future? For big data clusters this could be very expensive. Based on your comment, I will need 200TB of storage for 100TB of data to keep Cassandra running. -- Rustam. On 09/12/2010 17:56, Tyler Hobbs wrote: If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: Cassandra and disk space
i recently finished a practice expansion of 4 nodes to 5 nodes, a series of nodetool move, nodetool cleanup and jmx gc steps. i found that in some of the steps, disk usage actually grew to 2.5x the base data size on one of the nodes. i'm using 0.6.4. -scott On Thu, 9 Dec 2010, Rustam Aliyev wrote: Is there any plans to improve this in future? For big data clusters this could be very expensive. Based on your comment, I will need 200TB of storage for 100TB of data to keep Cassandra running. -- Rustam. On 09/12/2010 17:56, Tyler Hobbs wrote: If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: N to N relationships
Am assuming you have one matrix and you know the dimensions. Also as you say the most important queries are to get an entire column or an entire row.I would consider using a standard CF for the Columns and one for the Rows. The key for each would be the col / row number, each cassandra column name would be the id of the other dimension and the value whatever you want. - when storing the data update both the Column and Row CF- reading a whole row/col would be simply reading from the appropriate CF.- reading an intersection is a get_slice to either col or row CF using the column_names field to identify the other dimension.You would not need secondary indexes to serve these queries.Hope that helps.AaronOn 10 Dec, 2010,at 07:02 AM, Sébastien Druon sdr...@spotuse.com wrote:I mean if I have secondary indexes. Apparently they are calculated in the background...On 9 December 2010 18:33, David Boxenhorn da...@lookin2.com wrote: What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.com wrote: Thanks a lot for the answerWhat about the indexing when adding a new element? Is it incremental? Thanks againOn 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ?Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.com wrote: Hello,For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra.The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this:- what is the best way to represent this matrix? what would have the best performance in reading? in writing?. a super column family with n column families, with n columns each . a column family with n columns and n linesIn the second case, we would need to extract 2 kinds of information:- all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job?Thanks a lot for the answers,Best regards, Sébastien Druon
Re: Secondary indexes change everything?
On Thu, Dec 9, 2010 at 12:16 PM, David Boxenhorn da...@lookin2.com wrote: What do you mean by, The included secondary indexes still aren't good at finding keys for ranges of indexed values, such as name 'b' and name 'c' .? Do you mean that secondary indexes don't support range queries at all? http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Running multiple instances on a single server --micrandra ??
Overall, I don't think this is a crazy idea, though I think I'd prefer cassandra to manage this setup. The problem you will run into is that because the storage port is assumed to be the same across the cluster you'll only be able to do this if you can assign multiple IPs to each server (one for each process) (I know this because I proposed something similar last year :)). -ryan On Tue, Dec 7, 2010 at 10:00 PM, Jonathan Ellis jbel...@gmail.com wrote: The major downside is you're going to want to let each instance have its own dedicated commitlog spindle too, unless you just don't have many updates. On Tue, Dec 7, 2010 at 8:25 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I am quite ready to be stoned for this thread but I have been thinking about this for a while and I just wanted to bounce these ideas of some guru's. Cassandra does allow multiple data directories, but as far as I can tell no one runs in this configuration. This is something that is very different between the hbase architecture and the Cassandra architecture. HBase borrows the concept from hadoop of JBOD configurations. HBase has many small ish (~256 MB) regions managed with Zookeeper. Cassandra has a few (1 per node) large node sized Token Ranges managed by Gossip consensus. Lets say a node has 6 300 GB disks. You have the options of RAID5, RAID6, RAID10, or RAID0. The problem I have found with these configurations are major compactions (of even large minor ones) can take a long time. Even if your disk is not heavily utilized this is a lot of data to move through. Thus node joins take a long time. Node moves take a long time. The idea behind micrandra is for a 6 disk system run 6 instances of Cassandra, one per disk. Use the RackAwareSnitch to make sure no replicas live on the same node. The downsides 1) we would have to manage 6x the instances of cassandra 2) we would have some overhead for each JVM. The upsides ? 1) Since disk/instance failure only degrades the overall performance 1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when down a disk) 2) Moves and joins have less work to do 3) Can scale up a single node by adding a single disk to an existing system (assuming the ram and cpu is light) 4) OPP would be easier to balance out hot spots (maybe not on this one in not an OPP) What does everyone thing? Does it ever make sense to run this way? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Obscured question about data size in a Column Family
Hi there, Quoting an information in the wiki about Cassandra limitations ( http://wiki.apache.org/cassandra/CassandraLimitations): ... So all the data from a given columnfamily/key pair had to fit in memory, or 2GB ... Does this mean 1. A ColumnFamily can only be 2GB of data 2. A Column (key/pair) can only be 2GB of data Thanks for the explanation. -- http://twitter.com/jpartogi http://twitter.com/scrum8
Re: Cassandra and disk space
That depends on your scenario. In the worst case of one big CF, there's not much that can be easily done for the disk usage of compaction and cleanup (which is essentially compaction). If, instead, you have several column families and no single CF makes up the majority of your data, you can push your disk usage a bit higher. Is there any formula to calculate this? Let's say I have 500GB in single CF. So I need at least 500GB of free space for compaction. If I partition this CF and split it into 10 proportional CFs each 50GB, does it mean that I will need only 50GB of free space? Also, is there recommended maximum of data size per node? Thanks. A fundamental idea behind Cassandra's architecture is that disk space is cheap (which, indeed, it is). If you are particularly sensitive to this, Cassandra might not be the best solution to your problem. Also keep in mind that Cassandra performs well with average disks, so you don't need to spend a lot there. Additionally, most people find that the replication protects their data enough to allow them to use RAID 0 instead of 1, 10, 5, or 6. - Tyler On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev rus...@code.az mailto:rus...@code.az wrote: Is there any plans to improve this in future? For big data clusters this could be very expensive. Based on your comment, I will need 200TB of storage for 100TB of data to keep Cassandra running. -- Rustam. On 09/12/2010 17:56, Tyler Hobbs wrote: If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com mailto:static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: Cassandra and disk space
Yes, that's correct, but I wouldn't push it too far. You'll become much more sensitive to disk usage changes; in particular, rebalancing your cluster will particularly difficult, and repair will also become dangerous. Disk performance also tends to drop when a disk nears capacity. There's no recommended maximum size -- it all depends on your access rates. Anywhere from 10 GB to 1TB is typical. - Tyler On Thu, Dec 9, 2010 at 5:52 PM, Rustam Aliyev rus...@code.az wrote: That depends on your scenario. In the worst case of one big CF, there's not much that can be easily done for the disk usage of compaction and cleanup (which is essentially compaction). If, instead, you have several column families and no single CF makes up the majority of your data, you can push your disk usage a bit higher. Is there any formula to calculate this? Let's say I have 500GB in single CF. So I need at least 500GB of free space for compaction. If I partition this CF and split it into 10 proportional CFs each 50GB, does it mean that I will need only 50GB of free space? Also, is there recommended maximum of data size per node? Thanks. A fundamental idea behind Cassandra's architecture is that disk space is cheap (which, indeed, it is). If you are particularly sensitive to this, Cassandra might not be the best solution to your problem. Also keep in mind that Cassandra performs well with average disks, so you don't need to spend a lot there. Additionally, most people find that the replication protects their data enough to allow them to use RAID 0 instead of 1, 10, 5, or 6. - Tyler On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev rus...@code.az wrote: Is there any plans to improve this in future? For big data clusters this could be very expensive. Based on your comment, I will need 200TB of storage for 100TB of data to keep Cassandra running. -- Rustam. On 09/12/2010 17:56, Tyler Hobbs wrote: If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: Cassandra and disk space
Additionally, cleanup will fail to run when the disk is more than 50% full. Another reason to stay below 50%. On Thu, Dec 9, 2010 at 6:03 PM, Tyler Hobbs ty...@riptano.com wrote: Yes, that's correct, but I wouldn't push it too far. You'll become much more sensitive to disk usage changes; in particular, rebalancing your cluster will particularly difficult, and repair will also become dangerous. Disk performance also tends to drop when a disk nears capacity. There's no recommended maximum size -- it all depends on your access rates. Anywhere from 10 GB to 1TB is typical. - Tyler On Thu, Dec 9, 2010 at 5:52 PM, Rustam Aliyev rus...@code.az wrote: That depends on your scenario. In the worst case of one big CF, there's not much that can be easily done for the disk usage of compaction and cleanup (which is essentially compaction). If, instead, you have several column families and no single CF makes up the majority of your data, you can push your disk usage a bit higher. Is there any formula to calculate this? Let's say I have 500GB in single CF. So I need at least 500GB of free space for compaction. If I partition this CF and split it into 10 proportional CFs each 50GB, does it mean that I will need only 50GB of free space? Also, is there recommended maximum of data size per node? Thanks. A fundamental idea behind Cassandra's architecture is that disk space is cheap (which, indeed, it is). If you are particularly sensitive to this, Cassandra might not be the best solution to your problem. Also keep in mind that Cassandra performs well with average disks, so you don't need to spend a lot there. Additionally, most people find that the replication protects their data enough to allow them to use RAID 0 instead of 1, 10, 5, or 6. - Tyler On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev rus...@code.az wrote: Is there any plans to improve this in future? For big data clusters this could be very expensive. Based on your comment, I will need 200TB of storage for 100TB of data to keep Cassandra running. -- Rustam. On 09/12/2010 17:56, Tyler Hobbs wrote: If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: Cassandra and disk space
Thanks Tyler, this is really useful. Also, I noticed that you can specify multiple data file directories located on different disks. Let's say if I have machine with 4 x 500GB drives, what would be the difference between following 2 setups: 1. each drive mounted separately and has data file dirs on it (so 4x data file dirs) 2. disks are in RAID0 and mounted as one drive with one data folder on it In other words, does splitting data folder into smaller ones bring any performance or stability advantages? On 10/12/2010 00:03, Tyler Hobbs wrote: Yes, that's correct, but I wouldn't push it too far. You'll become much more sensitive to disk usage changes; in particular, rebalancing your cluster will particularly difficult, and repair will also become dangerous. Disk performance also tends to drop when a disk nears capacity. There's no recommended maximum size -- it all depends on your access rates. Anywhere from 10 GB to 1TB is typical. - Tyler On Thu, Dec 9, 2010 at 5:52 PM, Rustam Aliyev rus...@code.az mailto:rus...@code.az wrote: That depends on your scenario. In the worst case of one big CF, there's not much that can be easily done for the disk usage of compaction and cleanup (which is essentially compaction). If, instead, you have several column families and no single CF makes up the majority of your data, you can push your disk usage a bit higher. Is there any formula to calculate this? Let's say I have 500GB in single CF. So I need at least 500GB of free space for compaction. If I partition this CF and split it into 10 proportional CFs each 50GB, does it mean that I will need only 50GB of free space? Also, is there recommended maximum of data size per node? Thanks. A fundamental idea behind Cassandra's architecture is that disk space is cheap (which, indeed, it is). If you are particularly sensitive to this, Cassandra might not be the best solution to your problem. Also keep in mind that Cassandra performs well with average disks, so you don't need to spend a lot there. Additionally, most people find that the replication protects their data enough to allow them to use RAID 0 instead of 1, 10, 5, or 6. - Tyler On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev rus...@code.az mailto:rus...@code.az wrote: Is there any plans to improve this in future? For big data clusters this could be very expensive. Based on your comment, I will need 200TB of storage for 100TB of data to keep Cassandra running. -- Rustam. On 09/12/2010 17:56, Tyler Hobbs wrote: If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com mailto:static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: Cassandra and disk space
On Thu, Dec 9, 2010 at 4:20 PM, Rustam Aliyev rus...@code.az wrote: Thanks Tyler, this is really useful. [ RAID0 vs JBOD question ] In other words, does splitting data folder into smaller ones bring any performance or stability advantages? This is getting to be a FAQ, so here's my stock answer : There are non-zero production deployments which have experienced fail from multiple data directories in cassandra. There are zero production deployments which have experienced win from multiple data directories in cassandra. YMMV, of course! =Rob PS - Maybe we should remove the multiple data directory stuff, so people don't keep getting tempted to use it?
Re: Cassandra and disk space
On Thu, Dec 9, 2010 at 6:20 PM, Rustam Aliyev rus...@code.az wrote: Also, I noticed that you can specify multiple data file directories located on different disks. Let's say if I have machine with 4 x 500GB drives, what would be the difference between following 2 setups: 1. each drive mounted separately and has data file dirs on it (so 4x data file dirs) 2. disks are in RAID0 and mounted as one drive with one data folder on it In other words, does splitting data folder into smaller ones bring any performance or stability advantages? It brings disadvantages. Your largest CF will be limited to the size of your smallest drive, and you won't be using them in parallel when compacting. RAID0 is the better option. -Brandon
Re: [OT] shout out for riptano training
I second that as well. I actually found the training to be fun (love the new stuff in 0.7.0) and quite interesting. Now I'm looking forward to the next Cassandra Summit. Thank you Riptano. On Thu, Dec 9, 2010 at 2:48 PM, Dave Viner davevi...@gmail.com wrote: Just wanted to give a shout-out to Jonathan Ellis the Riptano team for the awesome training they provided yesterday in Santa Monica. It was awesome, and I'd highly recommend it for anyone who is using or seriously considering using Cassandra. Just. freakin awesome. Dave Viner -- Salvador Fuentes Jr.
Re: N to N relationships
I would also recommend two column families. Storing the key as NxN would require you to hit multiple machines to query for an entire row or column with RandomPartitioner. Even with OPP you would need to pick row or columns to order by and the other would require hitting multiple machines. Two column families avoids this and avoids any problems with choosing OPP. On Thu, Dec 9, 2010 at 2:26 PM, Aaron Morton aa...@thelastpickle.comwrote: Am assuming you have one matrix and you know the dimensions. Also as you say the most important queries are to get an entire column or an entire row. I would consider using a standard CF for the Columns and one for the Rows. The key for each would be the col / row number, each cassandra column name would be the id of the other dimension and the value whatever you want. - when storing the data update both the Column and Row CF - reading a whole row/col would be simply reading from the appropriate CF. - reading an intersection is a get_slice to either col or row CF using the column_names field to identify the other dimension. You would not need secondary indexes to serve these queries. Hope that helps. Aaron On 10 Dec, 2010,at 07:02 AM, Sébastien Druon sdr...@spotuse.com wrote: I mean if I have secondary indexes. Apparently they are calculated in the background... On 9 December 2010 18:33, David Boxenhorn da...@lookin2.com wrote: What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.comwrote: Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.comwrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
[RELEASE] 0.7.0 rc2
I'd have thought all that turkey and stuffing would have done more damage to momentum, but judging by the number of bug-fixes in the last couple of weeks, that isn't the case. As usual, I'd be remiss if I didn't point out that this is not yet a stable release. It's getting pretty close, but we're not ready to stick a fork in it yet. Be sure to test it thoroughly before upgrading something important. Please be sure to read through the changes[1] and release notes[2]. Report any problems you find[3], and if you have any questions, don't hesitate to ask. Thanks! [1]: http://goo.gl/ZMQEe (CHANGES.txt) [2]: http://goo.gl/R35HH (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA [4]: http://people.apache.org/~eevans/cassandra_0.7.0~rc2_all.deb -- Eric Evans eev...@rackspace.com
Re: Obscured question about data size in a Column Family
In = 0.6 (but not 0.7) a row could not be larger than 2GB. 2GB is still the largest possible column value. On Thu, Dec 9, 2010 at 5:38 PM, Joshua Partogi joshua.j...@gmail.com wrote: Hi there, Quoting an information in the wiki about Cassandra limitations (http://wiki.apache.org/cassandra/CassandraLimitations): ... So all the data from a given columnfamily/key pair had to fit in memory, or 2GB ... Does this mean 1. A ColumnFamily can only be 2GB of data 2. A Column (key/pair) can only be 2GB of data Thanks for the explanation. -- http://twitter.com/jpartogi -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: NullPointerException in Beta3 and rc1
describe_schema_versions() returns a MapString, ListString with one entry. The key is an UUID and ListString has one element, which is IP of my machine. I think this has something to do with 'truncate' command in CLI, I can reproduce by: 1. create a CF with column1 as a secondary index 2. add some rows 3. truncate the CF 4. add some rows and do a query where column1=someValue and column2=someValue. It does not happy if the query just has column1=someValue thanks On Wed, Dec 8, 2010 at 3:17 PM, Aaron Morton aa...@thelastpickle.comwrote: Jonathan suggested your cluster has multiple schemas, caused by https://issues.apache.org/jira/browse/CASSANDRA-1824 Can you run this API command describe_schema_versions() , it's not listed on the wiki yet but it will tell you how many schema versions are out there. pycassa supports it. Aaron On 09 Dec, 2010,at 08:19 AM, Aaron Morton aa...@thelastpicklecom wrote: Please send this to the list rather than me personally. Aaron Begin forwarded message: *From: *Wenjun Che wen...@openf.in *Date: *08 December 2010 4:35:10 PM *To: *aa...@thelastpickle.com *Subject: Re: NullPointerException in Beta3 and rc1* I created the CF on beta3 with: create column family RecipientChat with gc_grace=5 and comparator = 'AsciiType' and column_metadata=[{column_name:recipient,validation_class:BytesType,index_type:0}] After I added about 5000 rows, I got the error when querying the CF with recipient='somevalue' and anotherColumn='anotherValue'. I tried truncating the CF and it was still getting the same error. The last thing I tried is upgrading to rc1 and saw the same error. Thanks
Re: Cassandra and disk space
This is true, but for larger installations I end up needing more servers to hold the disks, more racks to hold the servers the point where the overall cost per GB climbs (granted the cost per IOP is probably still good). AIUI, a chunk of that 50% is replicated data such that the truly available space in the cluster is lower than 50% when capacity planning? If so, for some workloads where it's just data pouring in with very few updates, would have me thinking I'd want a tiered model, archiving cold data onto a filer/hdfs. Bill On Thu, 2010-12-09 at 13:26 -0600, Tyler Hobbs wrote: That depends on your scenario. In the worst case of one big CF, there's not much that can be easily done for the disk usage of compaction and cleanup (which is essentially compaction). If, instead, you have several column families and no single CF makes up the majority of your data, you can push your disk usage a bit higher. A fundamental idea behind Cassandra's architecture is that disk space is cheap (which, indeed, it is). If you are particularly sensitive to this, Cassandra might not be the best solution to your problem. Also keep in mind that Cassandra performs well with average disks, so you don't need to spend a lot there. Additionally, most people find that the replication protects their data enough to allow them to use RAID 0 instead of 1, 10, 5, or 6. - Tyler On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev rus...@code.az wrote: Is there any plans to improve this in future? For big data clusters this could be very expensive. Based on your comment, I will need 200TB of storage for 100TB of data to keep Cassandra running. -- Rustam. On 09/12/2010 17:56, Tyler Hobbs wrote: If you are on 0.6, repair is particularly dangerous with respect to disk space usage. If your replica is sufficiently out of sync, you can triple your disk usage pretty easily. This has been improved in 0.7, so repairs should use about half as much disk space, on average. In general, yes, keep your nodes under 50% disk usage at all times. Any of: compaction, cleanup, snapshotting, repair, or bootstrapping (the latter two are improved in 0.7) can double your disk usage temporarily. You should plan to add more disk space or add nodes when you get close to this limit. Once you go over 50%, it's more difficult to add nodes, at least in 0.6. - Tyler On Thu, Dec 9, 2010 at 11:19 AM, Mark static.void@gmail.com wrote: I recently ran into a problem during a repair operation where my nodes completely ran out of space and my whole cluster was... well, clusterfucked. I want to make sure how to prevent this problem in the future. Should I make sure that at all times every node is under 50% of its disk space? Are there any normal day-to-day operations that would cause the any one node to double in size that I should be aware of? If on or more nodes to surpass the 50% mark, what should I plan to do? Thanks for any advice
Re: NullPointerException in Beta3 and rc1
Can you still reproduce this with rc2, after starting with an empty data and commitlog directory? There used to be a bug w/ truncate + 2ary indexes but that should be fixed now. On Thu, Dec 9, 2010 at 8:53 PM, Wenjun Che wen...@openf.in wrote: describe_schema_versions() returns a MapString, ListString with one entry. The key is an UUID and ListString has one element, which is IP of my machine. I think this has something to do with 'truncate' command in CLI, I can reproduce by: 1. create a CF with column1 as a secondary index 2. add some rows 3. truncate the CF 4. add some rows and do a query where column1=someValue and column2=someValue. It does not happy if the query just has column1=someValue thanks On Wed, Dec 8, 2010 at 3:17 PM, Aaron Morton aa...@thelastpickle.com wrote: Jonathan suggested your cluster has multiple schemas, caused by https://issues.apache.org/jira/browse/CASSANDRA-1824 Can you run this API command describe_schema_versions() , it's not listed on the wiki yet but it will tell you how many schema versions are out there. pycassa supports it. Aaron On 09 Dec, 2010,at 08:19 AM, Aaron Morton aa...@thelastpicklecom wrote: Please send this to the list rather than me personally. Aaron Begin forwarded message: From: Wenjun Che wen...@openf.in Date: 08 December 2010 4:35:10 PM To: aa...@thelastpickle.com Subject: Re: NullPointerException in Beta3 and rc1 I created the CF on beta3 with: create column family RecipientChat with gc_grace=5 and comparator = 'AsciiType' and column_metadata=[{column_name:recipient,validation_class:BytesType,index_type:0}] After I added about 5000 rows, I got the error when querying the CF with recipient='somevalue' and anotherColumn='anotherValue'. I tried truncating the CF and it was still getting the same error. The last thing I tried is upgrading to rc1 and saw the same error. Thanks -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Running multiple instances on a single server --micrandra ??
On Tue, 2010-12-07 at 21:25 -0500, Edward Capriolo wrote: The idea behind micrandra is for a 6 disk system run 6 instances of Cassandra, one per disk. Use the RackAwareSnitch to make sure no replicas live on the same node. The downsides 1) we would have to manage 6x the instances of cassandra 2) we would have some overhead for each JVM. The upsides ? 1) Since disk/instance failure only degrades the overall performance 1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when down a disk) 2) Moves and joins have less work to do 3) Can scale up a single node by adding a single disk to an existing system (assuming the ram and cpu is light) 4) OPP would be easier to balance out hot spots (maybe not on this one in not an OPP) What does everyone thing? Does it ever make sense to run this way? It might for read heavy loads. When I looked at this, it was pointed out to me it's simpler to run fewer bigger coarser nodes and take the entire node/server out when something goes wrong. Basically give each Cassandra a server. I wonder if it would be better to rethink compaction if that's what's driving the idea. It seems to what is biting everyone, along with GC. Bill