Lost everything after topology change
Hi everyone, I'm coming to you because I'm quite in a pickle, and need to get the Cassandra database working asap… I tried to change the topology file and tried a node tool repair… in cassandra-cli when I tried to list a column family, it tells me null UnavailableException() at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12262) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:683) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:667) at org.apache.cassandra.cli.CliClient.executeList(CliClient.java:1373) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:264) at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:219) at org.apache.cassandra.cli.CliMain.main(CliMain.java:346) I can't seems to find out why. in /var/lib/cassandra, all database files seems to be good since they are of same size than before… They are not gone, that a good news… but what can I do in order to get them back ? Thank you !
Re: Lost everything after topology change
Actually I was creating my second node… Since I wanted to have a full replication I have changed the typology… I have reverted the topology by getting the one on the tgz file since it was the first time I mess with it… now that I reverted the file back it still does not get me my data : nodetool ring gives me this : Address DC RackStatus State Load Effective-Owership Token 102515232201044920484540575422936921078 127.0.0.1 datacenter1 rack1 Up Normal 1,36 GB 0,00% 17406244052094587115982865059561225030 88.190.62.134 datacenter1 rack1 Down Normal ? 0,00% 102515232201044920484540575422936921078 Le 12 août 2013 à 18:47, Robert Coli rc...@eventbrite.com a écrit : On Mon, Aug 12, 2013 at 9:36 AM, Morgan Segalis msega...@gmail.com wrote: I'm coming to you because I'm quite in a pickle, and need to get the Cassandra database working asap… First, #cassandra on freenode is usually better for emergent cases like this. I tried to change the topology file and tried a node tool repair… My first advice would be to change the topology file back. in cassandra-cli when I tried to list a column family, it tells me What does it say when you do nodetool ring? =Rob
Having 2 nodes with 100% Ownership ?
Hi everyone, I would like to have 100% Effective-Owership on both cassandra nodes… I just have created the second node now… ./nodetool ring gives me : Address DC RackStatus State Load Effective-Owership Token 17406244052094587115982865059561225030 my.first.cassandra.ip datacenter1 rack1 Up Normal 1,2 GB 89,77% 1 my.sec.cassandra.ip datacenter1 rack1 Up Normal 1,37 GB 10,23%17406244052094587115982865059561225030 What should I do in order to get 100% on both nodes ? Thank you.
Re: Having 2 nodes with 100% Ownership ?
Hi, thank you for you answer… I don't want 50% I would like 100% so I one is down the second can take over. Thank you. Le 12 août 2013 à 21:09, Mohit Anchlia mohitanch...@gmail.com a écrit : You need to get it to 50% on each to equally distribute the has range. You need to 1) Calculate new token 2) move nodes to that token or use vnodes For the first option see: http://www.datastax.com/docs/0.8/install/cluster_init On Mon, Aug 12, 2013 at 12:06 PM, Morgan Segalis msega...@gmail.com wrote: Hi everyone, I would like to have 100% Effective-Owership on both cassandra nodes… I just have created the second node now… ./nodetool ring gives me : Address DC RackStatus State Load Effective-Owership Token 17406244052094587115982865059561225030 my.first.cassandra.ip datacenter1 rack1 Up Normal 1,2 GB 89,77% 1 my.sec.cassandra.ip datacenter1 rack1 Up Normal 1,37 GB 10,23%17406244052094587115982865059561225030 What should I do in order to get 100% on both nodes ? Thank you.
Re: Lost everything after topology change
Thanks Aaron, but Robert help me for every step on free node #cassandra ! Regards, Morgan. Le 12 août 2013 à 23:30, Aaron Morton aa...@thelastpickle.com a écrit : I think you need to get the DOWN node out of their, run nodetool removenode Then let us know what the ring looks like and what you want to change, we should be able to help. Cheers - Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 13/08/2013, at 4:52 AM, Morgan Segalis msega...@gmail.com wrote: Actually I was creating my second node… Since I wanted to have a full replication I have changed the typology… I have reverted the topology by getting the one on the tgz file since it was the first time I mess with it… now that I reverted the file back it still does not get me my data : nodetool ring gives me this : Address DC RackStatus State Load Effective-Owership Token 102515232201044920484540575422936921078 127.0.0.1 datacenter1 rack1 Up Normal 1,36 GB 0,00% 17406244052094587115982865059561225030 88.190.62.134 datacenter1 rack1 Down Normal ? 0,00% 102515232201044920484540575422936921078 Le 12 août 2013 à 18:47, Robert Coli rc...@eventbrite.com a écrit : On Mon, Aug 12, 2013 at 9:36 AM, Morgan Segalis msega...@gmail.com wrote: I'm coming to you because I'm quite in a pickle, and need to get the Cassandra database working asap… First, #cassandra on freenode is usually better for emergent cases like this. I tried to change the topology file and tried a node tool repair… My first advice would be to change the topology file back. in cassandra-cli when I tried to list a column family, it tells me What does it say when you do nodetool ring? =Rob
Re: Having 2 nodes with 100% Ownership ?
It is still a little fuzzy when it comes to calculate a token for 50% distribution… How do I do that, it is not like I wanted to have 10,23% on one node, and 89,77% and the other ;-) I have found this website http://blog.milford.io/cassandra-token-calculator/ not sure If I should get the token from it eyes closed ? for 2 nodes it tells me to put 85070591730234615865843651857942052864 on the second node. (For 2 node calculation) and the first one at 0. How come I can get 50% on both nodes if all data are replicated ? Le 12 août 2013 à 21:20, Francisco Andrades Grassi bigjoc...@gmail.com a écrit : Hi, You should use a 50% token distribution as Mohit pointed out, but configure a replication factor of 2, so all your rows will be effectively in both nodes. -- Francisco Andrades Grassi www.bigjocker.com @bigjocker On Aug 12, 2013, at 2:44 PM, Morgan Segalis msega...@gmail.com wrote: Hi, thank you for you answer… I don't want 50% I would like 100% so I one is down the second can take over. Thank you. Le 12 août 2013 à 21:09, Mohit Anchlia mohitanch...@gmail.com a écrit : You need to get it to 50% on each to equally distribute the has range. You need to 1) Calculate new token 2) move nodes to that token or use vnodes For the first option see: http://www.datastax.com/docs/0.8/install/cluster_init On Mon, Aug 12, 2013 at 12:06 PM, Morgan Segalis msega...@gmail.com wrote: Hi everyone, I would like to have 100% Effective-Owership on both cassandra nodes… I just have created the second node now… ./nodetool ring gives me : Address DC RackStatus State Load Effective-Owership Token 17406244052094587115982865059561225030 my.first.cassandra.ip datacenter1 rack1 Up Normal 1,2 GB 89,77% 1 my.sec.cassandra.ip datacenter1 rack1 Up Normal 1,37 GB 10,23%17406244052094587115982865059561225030 What should I do in order to get 100% on both nodes ? Thank you.
Re: Having 2 nodes with 100% Ownership ?
Hi Robert, Thanks for helping me (again). As you know, I'm a real newbie. So I fetched the whole apache-cassandra (not the /var/lib/cassandra) folder from my first server to my second server. So I'm sure to use the exact same version. I have changed the token of the second node to 85070591730234615865843651857942052864 as http://blog.milford.io/cassandra-token-calculator/ gave me. Here's my current topology : create keyspace mykeyspace with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1 : 1} and durable_writes = true; so I should update with update keyspace KEYSPACE_NAME with storage_options = {DC1 : 2}; and then : $ nodetool repair (on both nodes) Knowing that Server 1 and Server 2 are from the same Provider, but not on the Same Data Center (Ping is really fast, not sure if that comes in count : ± 0.350 ms) Should I change the cassandra-topology.properties file ? (it is currently the out-of-the-box version) Thank you. Morgan. Le 13 août 2013 à 01:22, Robert Coli rc...@eventbrite.com a écrit : On Mon, Aug 12, 2013 at 4:19 PM, Morgan Segalis msega...@gmail.com wrote: It is still a little fuzzy when it comes to calculate a token for 50% distribution… How do I do that, it is not like I wanted to have 10,23% on one node, and 89,77% and the other ;-) The feature which picks a random token and results in distributions like this has been removed from upstream. for 2 nodes it tells me to put 85070591730234615865843651857942052864 on the second node. (For 2 node calculation) and the first one at 0. How come I can get 50% on both nodes if all data are replicated ? Briefly, you will get effective ownership = 100% if you up the RF so that it = N. Two nodes, each of which own 50% of the token range, but with a RF=2 means ownership is 50%, but effective ownership is 100%. =Rob
Re: Having 2 nodes with 100% Ownership ?
Le 13 août 2013 à 01:50, Robert Coli rc...@eventbrite.com a écrit : On Mon, Aug 12, 2013 at 4:41 PM, Morgan Segalis msega...@gmail.com wrote: So I fetched the whole apache-cassandra (not the /var/lib/cassandra) folder from my first server to my second server. Including the data directory for your keyspace? That's the simplest way to do this operation in your case. Actually no, I was talking about the cassandra source folder, not the data folder. When I bootstrapped the second node, it took 10 sec. to transfer more than 10% (Servers have 10GB connections). The whole keyspace is 16G apparently So If it needs to take 100 sec for whole data, I'm not going to take any chance doing it by hand ;-) So I'm sure to use the exact same version. I have changed the token of the second node to 85070591730234615865843651857942052864 as http://blog.milford.io/cassandra-token-calculator/ gave me. Ok. so I should update with update keyspace KEYSPACE_NAME with storage_options = {DC1 : 2}; and then : $ nodetool repair (on both nodes) Yes. If you pre-copy the sstables to the new node, this will go MUCH faster, because it will only have to sync the data that has come into the original node between the copy and the repair. =Rob So I should not touch the cassandra-topology.properties file ? And the fact that the node 1 and node 2 are both DC1 RACK1 does not bother cassandra ?
Adding my first node to another one...
Hi everyone, I'm trying to wrap my head around Cassandra great ability to expand… I have set up my first Cassandra node a while ago… it was working great, and data wasn't so important back then. Since I had a great experience with Cassandra I decided to migrate step by step my MySQL data to Cassandra. Now data start to be important, so I would like to create another node, and add it. Since I had some issue with my DataCenter, I wanted to have a copy (of sensible data only) on another DataCenter. Quite frankly I'm still a newbie on Cassandra and need your guys help. First things first… Already up and Running Cassandra (Called A): - Do I need to change anything to the cassandra.yaml to make sure that another node can connect ? if yes, should I restart the node (because I would have to warn users about downtime) ? - Since this node should be a seed, the seed list is already set to localhost, is that good enough ? The new node I want to add (Called B): - I know that before starting this node, I should modify the seed list in cassandra.yaml… Is that the only thing I need to do ? It is my first time doing this, so please be gentle ;-) Thank you all, Morgan.
Re: Adding my first node to another one...
Hi Arthur, Thank you for your answer. I have read the section Adding Capacity to an Existing Cluster prior to posting my question. Actually I was thinking I would like Cassandra choose by itself the token. Since I want only some column family to be an ALL cluster, and other column family to be where they are, no matter balancing… I do not find anything on the configuration that I should make on the very first (and only node so far) to start the replication. (The configuration of my Node A is pretty basic, almost out of the box, I might changed the name) How to make this node know that it will be a Seed. My current Node A is using Cassandra 1.1.0 Is it compatible if I install a new node with Cassandra 1.2.8 ? or should I fetch 1.1.0 for Node B ? Thank you. Morgan. Le 1 août 2013 à 20:32, Arthur Zubarev arthur.zuba...@aol.com a écrit : Hi Morgan, The scaling out depends on several factors. The most intricate is perhaps calculating the tokens. Also the Cassandra version is important. At this point in time I suggest you read section Adding Capacity to an Existing Cluster at http://www.datastax.com/docs/1.0/operations/cluster_management and come back here with questions and more details. Regards, Arthur -Original Message- From: Morgan Segalis Sent: Thursday, August 01, 2013 11:24 AM To: user@cassandra.apache.org Subject: Adding my first node to another one... Hi everyone, I'm trying to wrap my head around Cassandra great ability to expand… I have set up my first Cassandra node a while ago… it was working great, and data wasn't so important back then. Since I had a great experience with Cassandra I decided to migrate step by step my MySQL data to Cassandra. Now data start to be important, so I would like to create another node, and add it. Since I had some issue with my DataCenter, I wanted to have a copy (of sensible data only) on another DataCenter. Quite frankly I'm still a newbie on Cassandra and need your guys help. First things first… Already up and Running Cassandra (Called A): - Do I need to change anything to the cassandra.yaml to make sure that another node can connect ? if yes, should I restart the node (because I would have to warn users about downtime) ? - Since this node should be a seed, the seed list is already set to localhost, is that good enough ? The new node I want to add (Called B): - I know that before starting this node, I should modify the seed list in cassandra.yaml… Is that the only thing I need to do ? It is my first time doing this, so please be gentle ;-) Thank you all, Morgan.
Re: Adding my first node to another one...
Hi Rob, Le 2 août 2013 à 00:15, Robert Coli rc...@eventbrite.com a écrit : On Thu, Aug 1, 2013 at 2:07 PM, Morgan Segalis msega...@gmail.com wrote: Actually I was thinking I would like Cassandra choose by itself the token. You NEVER want Cassandra to choose its own token in production. There is no advantage to doing so and significant risk when used as a matter of course. The conf file even says you should manually specify tokens in production.. Ok, then I'll try to understand this token thing. How to make this node know that it will be a Seed. The only thing that makes a node a Seed is that any other node has it in its seed list. Good to know, thanks ! My current Node A is using Cassandra 1.1.0 You should not run 1.1.0, it contains significant and serious bugs. You should upgrade to the top of 1.1 series ASAP. Of course I need to upgrade Cassandra, but I won't do that until I have another node than can take the relay while I'm upgrading. Is it compatible if I install a new node with Cassandra 1.2.8 ? or should I fetch 1.1.0 for Node B ? It is not compatible, use 1.1.x with 1.1.x. Yeah, that's what I though ! =Rob Thank you for your tips.
Re: Store a timeline with uniques properties
Hi Aaron, That's great news... Would you know the name of this feature so I can look further into it ? Thanks, Morgan. Le 31 août 2012 à 06:05, aaron morton aa...@thelastpickle.com a écrit : Consider trying… UserTimeline CF row_key: user_id column_names: timestamp, other_user_id, action column_values: action details To get the changes between two times specify the start and end timestamps and do not include the other components of the column name. e.g. from 1234, NULL, NULL to 6789, NULL, NULL Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/08/2012, at 11:32 PM, Morgan Segalis msega...@gmail.com wrote: Sorry for the scheme that has not keep the right tabulation for some people... Here's a space-version instead of a tabulation. user1 row :| lte| lte -1| lte -2| lte -3 | lte -4 | values :| user2-name-change | user3-pic-change | user4-status-change | user2-pic-change | user2-status-change | If for example, user2 changes it's picture, the row should look like that : user1 row :|lte | lte -1 | lte -2 | lte -3 | lte -4| values : | user2-pic-change| user2-name-change | user3-pic-change | user4-status-change | user2-status-change | Le 30 août 2012 à 13:22, Morgan Segalis a écrit : Hi everyone, I'm trying to use cassandra in order to store a timeline, but with values that must be unique (replaced). (So not really a timeline, but didn't find a better word for it) Let's me give you an example : - An user have a list of friends - Friends can change their nickname, status, profile picture, etc... at the beginning the CF will look like that for user1: lte = latest-timestamp-entry, which is the timestamp of the entry (-1 -2 -3 means that the timestamp are older) user1 row : | lte | lte -1 | lte -2 | lte -3 | lte -4 | values :| user2-name-change | user3-pic-change | user4-status-change | user2-pic-change| user2-status-change | If for example, user2 changes it's picture, the row should look like that : user1 row : | lte | lte -1 | lte -2 | lte -3 | lte -4 | values :| user2-pic-change| user2-name-change | user3-pic-change | user4-status-change | user2-status-change | notice that user2-pic-change in the first representation (lte -3) has moved to the (lte) on the second representation. That way when user1 connects again, It can retrieve only informations that occurred between the last time he connected. e.g. : if the user1's last connexion date it between lte -2 and lte -3, then he will only be notified that : - user2 has changed his picture - user2 has changed his name - user3 has changed his picture I would not keep the old data since the timeline is saved locally on the client, and not on the server. I really would like not to search for each column in order to find the user2-pic-change, that can be long especially if the user has many friends. Is there a simple way to do that with cassandra, or I am bound to create another CF, with column title holding the action e.g. user2-pic-change and for value the timestamp when it appears ? Thanks, Morgan.
Re: Store a timeline with uniques properties
Nevermind, it is called composite columns. Thank you for your help. Morgan. Le 31 août 2012 à 06:05, aaron morton aa...@thelastpickle.com a écrit : Consider trying… UserTimeline CF row_key: user_id column_names: timestamp, other_user_id, action column_values: action details To get the changes between two times specify the start and end timestamps and do not include the other components of the column name. e.g. from 1234, NULL, NULL to 6789, NULL, NULL Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/08/2012, at 11:32 PM, Morgan Segalis msega...@gmail.com wrote: Sorry for the scheme that has not keep the right tabulation for some people... Here's a space-version instead of a tabulation. user1 row :| lte| lte -1| lte -2| lte -3 | lte -4 | values :| user2-name-change | user3-pic-change | user4-status-change | user2-pic-change | user2-status-change | If for example, user2 changes it's picture, the row should look like that : user1 row :|lte | lte -1 | lte -2 | lte -3 | lte -4| values : | user2-pic-change| user2-name-change | user3-pic-change | user4-status-change | user2-status-change | Le 30 août 2012 à 13:22, Morgan Segalis a écrit : Hi everyone, I'm trying to use cassandra in order to store a timeline, but with values that must be unique (replaced). (So not really a timeline, but didn't find a better word for it) Let's me give you an example : - An user have a list of friends - Friends can change their nickname, status, profile picture, etc... at the beginning the CF will look like that for user1: lte = latest-timestamp-entry, which is the timestamp of the entry (-1 -2 -3 means that the timestamp are older) user1 row : | lte | lte -1 | lte -2 | lte -3 | lte -4 | values :| user2-name-change | user3-pic-change | user4-status-change | user2-pic-change| user2-status-change | If for example, user2 changes it's picture, the row should look like that : user1 row : | lte | lte -1 | lte -2 | lte -3 | lte -4 | values :| user2-pic-change| user2-name-change | user3-pic-change | user4-status-change | user2-status-change | notice that user2-pic-change in the first representation (lte -3) has moved to the (lte) on the second representation. That way when user1 connects again, It can retrieve only informations that occurred between the last time he connected. e.g. : if the user1's last connexion date it between lte -2 and lte -3, then he will only be notified that : - user2 has changed his picture - user2 has changed his name - user3 has changed his picture I would not keep the old data since the timeline is saved locally on the client, and not on the server. I really would like not to search for each column in order to find the user2-pic-change, that can be long especially if the user has many friends. Is there a simple way to do that with cassandra, or I am bound to create another CF, with column title holding the action e.g. user2-pic-change and for value the timestamp when it appears ? Thanks, Morgan.
Store a timeline with uniques properties
Hi everyone, I'm trying to use cassandra in order to store a timeline, but with values that must be unique (replaced). (So not really a timeline, but didn't find a better word for it) Let's me give you an example : - An user have a list of friends - Friends can change their nickname, status, profile picture, etc... at the beginning the CF will look like that for user1: lte = latest-timestamp-entry, which is the timestamp of the entry (-1 -2 -3 means that the timestamp are older) user1 row : | lte | lte -1 | lte -2 | lte -3 | lte -4 | values :| user2-name-change | user3-pic-change | user4-status-change | user2-pic-change| user2-status-change | If for example, user2 changes it's picture, the row should look like that : user1 row : | lte | lte -1 | lte -2 | lte -3 | lte -4 | values :| user2-pic-change| user2-name-change | user3-pic-change | user4-status-change | user2-status-change | notice that user2-pic-change in the first representation (lte -3) has moved to the (lte) on the second representation. That way when user1 connects again, It can retrieve only informations that occurred between the last time he connected. e.g. : if the user1's last connexion date it between lte -2 and lte -3, then he will only be notified that : - user2 has changed his picture - user2 has changed his name - user3 has changed his picture I would not keep the old data since the timeline is saved locally on the client, and not on the server. I really would like not to search for each column in order to find the user2-pic-change, that can be long especially if the user has many friends. Is there a simple way to do that with cassandra, or I am bound to create another CF, with column title holding the action e.g. user2-pic-change and for value the timestamp when it appears ? Thanks, Morgan.
Re: Store a timeline with uniques properties
Sorry for the scheme that has not keep the right tabulation for some people... Here's a space-version instead of a tabulation. user1 row :| lte| lte -1| lte -2| lte -3 | lte -4 | values :| user2-name-change | user3-pic-change | user4-status-change | user2-pic-change | user2-status-change | If for example, user2 changes it's picture, the row should look like that : user1 row :|lte | lte -1 | lte -2 | lte -3 | lte -4| values : | user2-pic-change| user2-name-change | user3-pic-change | user4-status-change | user2-status-change | Le 30 août 2012 à 13:22, Morgan Segalis a écrit : Hi everyone, I'm trying to use cassandra in order to store a timeline, but with values that must be unique (replaced). (So not really a timeline, but didn't find a better word for it) Let's me give you an example : - An user have a list of friends - Friends can change their nickname, status, profile picture, etc... at the beginning the CF will look like that for user1: lte = latest-timestamp-entry, which is the timestamp of the entry (-1 -2 -3 means that the timestamp are older) user1 row : | lte | lte -1 | lte -2 | lte -3 | lte -4 | values :| user2-name-change | user3-pic-change | user4-status-change | user2-pic-change| user2-status-change | If for example, user2 changes it's picture, the row should look like that : user1 row : | lte | lte -1 | lte -2 | lte -3 | lte -4 | values :| user2-pic-change| user2-name-change | user3-pic-change | user4-status-change | user2-status-change | notice that user2-pic-change in the first representation (lte -3) has moved to the (lte) on the second representation. That way when user1 connects again, It can retrieve only informations that occurred between the last time he connected. e.g. : if the user1's last connexion date it between lte -2 and lte -3, then he will only be notified that : - user2 has changed his picture - user2 has changed his name - user3 has changed his picture I would not keep the old data since the timeline is saved locally on the client, and not on the server. I really would like not to search for each column in order to find the user2-pic-change, that can be long especially if the user has many friends. Is there a simple way to do that with cassandra, or I am bound to create another CF, with column title holding the action e.g. user2-pic-change and for value the timestamp when it appears ? Thanks, Morgan.
Re: Data model question, storing Queue Message
Hi Aaron, Thank you for your answer, I was beginning to think that my question would never be answered ;-) Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier. I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning? At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF… Is that a good way to do ? Or should I stick with the first implementation ? Best regards, Morgan. Le 30 avr. 2012 à 05:52, aaron morton a écrit : Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra It hard to create a model without some idea of the data load, but I would suggest you start with: CF: UserMessages Key: ReceiverID Columns : column name = TimeUUID ; column value = message ID and Body That will order the messages by time. Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month: CF: UserMessagesMonth Key: ReceiverID+MM Columns : column name = TimeUUID ; column value = message ID and Body Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 27/04/2012, at 1:30 AM, Morgan Segalis wrote: Hi everyone ! I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model. I have worked a while on it, but I can't seems to find the best model for what I'm looking for. I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve the message queue, and deletes it. Here's some key point : - Users are identified by integer IDs - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time I have a queue Message, and here's the operations I would need to do as fast as possible : - Store from 1 to X messages per registered user - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved) - retrieve all messages from an user at once. - delete all messages from an user at once. - delete all messages that are older than Y months (from all users). I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I'm really looking for speed rather than storage optimization. My configuration is 2 dedicated server which are both : - 4 x Intel i7 2.66 Ghz - 64 bits - 24 Go - 2 TB Thank you all.
Re: Data model question, storing Queue Message
Hi Samal, Thanks for the TTL feature, I wasn't aware of it's existence. Day's partitioning will be less wider than month partitionning (about 30 times less give or take ;-) ) Per day it should have something like 100 000 messages stored, most of it would be retrieved so deleted before the TTL feature should come do it's work. Le 30 avr. 2012 à 13:16, samal a écrit : On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis msega...@gmail.com wrote: Hi Aaron, Thank you for your answer, I was beginning to think that my question would never be answered ;-) Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier. USE TTL feature of column as it will remove column after TTL is over (no need for manual job). I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning? key belongs to particular node , so depending on size of your data day or month wise partitioning matters. Other wise it can lead to Fat row which will cause system problem. At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF… Is that a good way to do ? Or should I stick with the first implementation ? Best regards, Morgan. Le 30 avr. 2012 à 05:52, aaron morton a écrit : Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra It hard to create a model without some idea of the data load, but I would suggest you start with: CF: UserMessages Key: ReceiverID Columns : column name = TimeUUID ; column value = message ID and Body That will order the messages by time. Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month: CF: UserMessagesMonth Key: ReceiverID+MM Columns : column name = TimeUUID ; column value = message ID and Body Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 27/04/2012, at 1:30 AM, Morgan Segalis wrote: Hi everyone ! I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model. I have worked a while on it, but I can't seems to find the best model for what I'm looking for. I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve the message queue, and deletes it. Here's some key point : - Users are identified by integer IDs - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time I have a queue Message, and here's the operations I would need to do as fast as possible : - Store from 1 to X messages per registered user - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved) - retrieve all messages from an user at once. - delete all messages from an user at once. - delete all messages that are older than Y months (from all users). I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I'm really looking for speed rather than storage optimization. My configuration is 2 dedicated server which are both : - 4 x Intel i7 2.66 Ghz - 64 bits - 24 Go - 2 TB Thank you all.
Re: Data model question, storing Queue Message
Isn't kafka too young for production using purpose ? Clearly that would fit much better my needs but I can't afford early stage project not ready for production. Is it ? Le 30 avr. 2012 à 14:28, samal samalgo...@gmail.com a écrit : On Mon, Apr 30, 2012 at 5:52 PM, Morgan Segalis msega...@gmail.com wrote: Hi Samal, Thanks for the TTL feature, I wasn't aware of it's existence. Day's partitioning will be less wider than month partitionning (about 30 times less give or take ;-) ) Per day it should have something like 100 000 messages stored, most of it would be retrieved so deleted before the TTL feature should come do it's work. TTL is the last day column can exist in c-world after that it is deleted. Deleting before TTL is fine. Have you considered KAFKA http://incubator.apache.org/kafka/ Le 30 avr. 2012 à 13:16, samal a écrit : On Mon, Apr 30, 2012 at 4:25 PM, Morgan Segalis msega...@gmail.com wrote: Hi Aaron, Thank you for your answer, I was beginning to think that my question would never be answered ;-) Actually, this is what I was going for, except one thing, instead of partitioning row per month, I though about partitioning per day, like that everyday I launch the cleaning tool, and it will delete the day from X month earlier. USE TTL feature of column as it will remove column after TTL is over (no need for manual job). I guess that will reduce the workload drastically, does it have any downside comparing to month partitioning? key belongs to particular node , so depending on size of your data day or month wise partitioning matters. Other wise it can lead to Fat row which will cause system problem. At one point I was going to do something like the twissandra example, Having a CF per User's queue, and another CF per day storing every message's ID of the day, in that way If I want to delete them, I only look into this row, and delete them using ID's for deleting them in the User's queue CF… Is that a good way to do ? Or should I stick with the first implementation ? Best regards, Morgan. Le 30 avr. 2012 à 05:52, aaron morton a écrit : Message Queue is often not a great use case for Cassandra. For information on how to handle high delete workloads see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra It hard to create a model without some idea of the data load, but I would suggest you start with: CF: UserMessages Key: ReceiverID Columns : column name = TimeUUID ; column value = message ID and Body That will order the messages by time. Depending on load (and to support deleting a previous months messages) you may want to partition the rows by month: CF: UserMessagesMonth Key: ReceiverID+MM Columns : column name = TimeUUID ; column value = message ID and Body Everything the same as before. But now a user has a row for each month and which you can delete as a whole. This also helps avoid very big rows. I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I would suggest you keep the per node limit to 300 to 400 GB. It can take a long time to compact, repair and move the data when it gets above 400GB. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 27/04/2012, at 1:30 AM, Morgan Segalis wrote: Hi everyone ! I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model. I have worked a while on it, but I can't seems to find the best model for what I'm looking for. I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve the message queue, and deletes it. Here's some key point : - Users are identified by integer IDs - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time I have a queue Message, and here's the operations I would need to do as fast as possible : - Store from 1 to X messages per registered user - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved) - retrieve all messages from an user at once. - delete all messages from an user at once. - delete all messages that are older than Y months (from all users). I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I'm really looking for speed rather than storage optimization. My configuration is 2 dedicated server which are both : - 4 x Intel i7 2.66 Ghz - 64 bits - 24 Go - 2 TB Thank you all.
Data model question, storing Queue Message
Hi everyone ! I'm fairly new to cassandra and I'm not quite yet familiarized with column oriented NoSQL model. I have worked a while on it, but I can't seems to find the best model for what I'm looking for. I have a Erlang software that let user connecting and communicate with each others, when an user (A) sends a message to a disconnected user (B), it stores it on the database and wait for the user (B) to connect and retrieve the message queue, and deletes it. Here's some key point : - Users are identified by integer IDs - Each message are unique by combination of : Sender ID - Receiver ID - Message ID - time I have a queue Message, and here's the operations I would need to do as fast as possible : - Store from 1 to X messages per registered user - Get the number of stored messages per user (Can be a incremental variable updated at each store // this is often retrieved) - retrieve all messages from an user at once. - delete all messages from an user at once. - delete all messages that are older than Y months (from all users). I really don't think that storage will be an issue, I have 2TB per nodes, messages are 1KB limited. I'm really looking for speed rather than storage optimization. My configuration is 2 dedicated server which are both : - 4 x Intel i7 2.66 Ghz - 64 bits - 24 Go - 2 TB Thank you all.