Re: NoSE: Automated schema design for Cassandra
Thanks for the feedback! I did change column families to tables. I agree the documentation could use some work. If you're interested in seeing what the input and output look like, here's a sample: https://michael.mior.ca/projects/nose/rubis So far we haven't had any schemas used directly for production although it has provided some advice on design alternatives. NoSE actually already contains a mechanism to execute different schema alternatives which is what we used during our evaluation. However, it does not currently directly provide mechanism for synthetic data generation. It would definitely be possible however to add automated generation of test data in the future. Cheers, -- Michael Mior mm...@uwaterloo.ca 2017-05-10 3:55 GMT-04:00 Jacques-Henri Berthemet < jacques-henri.berthe...@genesys.com>: > Hi, > > > > This is interesting, I’d just advise to put full examples and more > documentation on how to use it (the articles are a bit too detailed). > > Also, you should not mention “column families” but just tables. > > > > Was this used to generate a schema used for production? > > Do you think it’s possible to generate test code to validate the workload? > > > > *--* > > *Jacques-Henri Berthemet* > > > > *From:* michael.m...@gmail.com [mailto:michael.m...@gmail.com] *On Behalf > Of *Michael Mior > *Sent:* mardi 9 mai 2017 17:30 > *To:* user <user@cassandra.apache.org> > *Subject:* NoSE: Automated schema design for Cassandra > > > > Hi all, > > > > I wanted to share a tool I've been working on that tries to help automate > the schema design process for Cassandra. The short description is that you > provide information on the kind of data you want to store and the queries > and updates you want to issue, and NoSE will perform a cost-based analysis > to suggest an optimal schema. > > > > There's lots of room for improvement and many Cassandra features which are > not currently supported, but hopefully some in the community may still find > it useful as a starting point. > > > > Link to more details and the source code below: > > > > https://michael.mior.ca/projects/nose/ > > > > If you're interested in trying it out, don't hesitate to reach out and I'm > happy to help! > > > > Cheers, > > -- > > Michael Mior > > mm...@uwaterloo.ca >
RE: NoSE: Automated schema design for Cassandra
Hi, This is interesting, I’d just advise to put full examples and more documentation on how to use it (the articles are a bit too detailed). Also, you should not mention “column families” but just tables. Was this used to generate a schema used for production? Do you think it’s possible to generate test code to validate the workload? -- Jacques-Henri Berthemet From: michael.m...@gmail.com [mailto:michael.m...@gmail.com] On Behalf Of Michael Mior Sent: mardi 9 mai 2017 17:30 To: user <user@cassandra.apache.org> Subject: NoSE: Automated schema design for Cassandra Hi all, I wanted to share a tool I've been working on that tries to help automate the schema design process for Cassandra. The short description is that you provide information on the kind of data you want to store and the queries and updates you want to issue, and NoSE will perform a cost-based analysis to suggest an optimal schema. There's lots of room for improvement and many Cassandra features which are not currently supported, but hopefully some in the community may still find it useful as a starting point. Link to more details and the source code below: https://michael.mior.ca/projects/nose/<https://michael.mior.ca/projects/nose/> If you're interested in trying it out, don't hesitate to reach out and I'm happy to help! Cheers, -- Michael Mior mm...@uwaterloo.ca<mailto:mm...@uwaterloo.ca>
NoSE: Automated schema design for Cassandra
Hi all, I wanted to share a tool I've been working on that tries to help automate the schema design process for Cassandra. The short description is that you provide information on the kind of data you want to store and the queries and updates you want to issue, and NoSE will perform a cost-based analysis to suggest an optimal schema. There's lots of room for improvement and many Cassandra features which are not currently supported, but hopefully some in the community may still find it useful as a starting point. Link to more details and the source code below: https://michael.mior.ca/projects/nose/ If you're interested in trying it out, don't hesitate to reach out and I'm happy to help! Cheers, -- Michael Mior mm...@uwaterloo.ca
Read Performance / Schema Design
Hi Everyone I have a question with regards read performance and schema design if someone could help please. Our requirement is to store per user, many unique results (which is basically an attempt at some questions ..) so I had thought of having the userid as the row key and the result id as columns. The keys for the result ids are maintained in a separate location so are known without having to perform any additional lookups. My concern is that over time reading a single result would incur the overhead of reading the entire row from disk so gradually slow things down. So I was considering if changing the row key to userid + result id would be a better solution ? Cheers Ben
Re: Read Performance / Schema Design
On Wed, Oct 26, 2011 at 9:35 PM, Ben Gambley ben.gamb...@intoscience.comwrote: Hi Everyone I have a question with regards read performance and schema design if someone could help please. Our requirement is to store per user, many unique results (which is basically an attempt at some questions ..) so I had thought of having the userid as the row key and the result id as columns. The keys for the result ids are maintained in a separate location so are known without having to perform any additional lookups. My concern is that over time reading a single result would incur the overhead of reading the entire row from disk so gradually slow things down. So I was considering if changing the row key to *userid + result id* would be a better solution ? Do you regularly need to read all of the results for a given userid? If not, go with the user_id + result_id approach. It will be more efficient for single-result lookups. -- Tyler Hobbs DataStax http://datastax.com/
Re: Read Performance / Schema Design
On Wed, Oct 26, 2011 at 7:35 PM, Ben Gambley ben.gamb...@intoscience.comwrote: Our requirement is to store per user, many unique results (which is basically an attempt at some questions ..) so I had thought of having the userid as the row key and the result id as columns. The keys for the result ids are maintained in a separate location so are known without having to perform any additional lookups. My concern is that over time reading a single result would incur the overhead of reading the entire row from disk so gradually slow things down. So I was considering if changing the row key to *userid + result id* would be a better solution ? This is a clustering choice. Assuming your dataset is too big to fit in system memory, some considerations which should drive your decision are locality of access, cache efficiency, and worst-case performance. 1) If you access many results for a user-id around the same time, then putting them close together will get you better lookup throughput, as once you cause a disk seek to access a result, other results will be available in memory (until it falls out of cache). This is generally going to be somewhat true whether you use a key (userid/resultid) or a key (userid) and column (resultid). 2) If some (or one) result-id is accessed much more frequently than others for all users, then you might wish for most of that result-id to get good cache hitrate, while other result-ids that are less frequently accessed do not need to be cached as much. To get this behavior, you'd want the hottest-most-frequently-accessed result-id to be packed together, and thus you'd want to either use a key such as result-id/user-id, or use a separate column family for hot result-ids. 3) how many results can a user have? If every user will have an unbounded number of results (say 40k), but you generally only need one result at a time, then you probably want to use a compound key (userid/resultid) rather than a key(userid)+column(resultid), because you don't want to have to deal with large data just to get a small piece. That said, it seems that cassandra's handling of wide-rows is improving, so perhaps in the future this will not be as large an issue.
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Thanks Tyler! On Thu, Feb 3, 2011 at 12:06 PM, Tyler Hobbs ty...@datastax.com wrote: On Wed, Feb 2, 2011 at 3:27 PM, Aditya Narayan ady...@gmail.com wrote: Can I have some more feedback about my schema perhaps somewhat more criticisive/harsh ? It sounds reasonable to me. Since you're writing/reading all of the subcolumns at the same time, I would opt for a standard column with the tags serialized into a column value. I don't think you need to worry about row lengths here. Depending on the reminder size and how many times it's likely to be repeated in the timeline, you could explore denormalizing a bit more by storing the reminders in the timelines themselves, perhaps with a separate row per (user, tag) combination. This would cut down on your seeks quite a bit, but it may not be necessary at this point (or at all). -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan ady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs bill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.comwrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirs bill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
I did not understand before... sorry. Again, depending upon how many reminders you have for a single user, this could be a long/wide row. Again, it really comes down to how many reminders are we talking about and how often will they be read/written. While a single row can contain millions (maybe more) columns, that doesn't mean it's a good idea. I'm working on a logging system with Cassandra and ran into this same type of problem. Do I put all of the messages for a single system into a single row keyed off that system's name? I quickly came to the answer of no and now I break my row keys into POSIX_timestamp:system where my timestamps are buckets for every 5 minutes. This nicely distributes the load across the nodes in my system. Bill- On 02/02/2011 11:18 AM, Aditya Narayan wrote: You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders (not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
On Wed, Feb 2, 2011 at 3:27 PM, Aditya Narayan ady...@gmail.com wrote: Can I have some more feedback about my schema perhaps somewhat more criticisive/harsh ? It sounds reasonable to me. Since you're writing/reading all of the subcolumns at the same time, I would opt for a standard column with the tags serialized into a column value. I don't think you need to worry about row lengths here. Depending on the reminder size and how many times it's likely to be repeated in the timeline, you could explore denormalizing a bit more by storing the reminders in the timelines themselves, perhaps with a separate row per (user, tag) combination. This would cut down on your seeks quite a bit, but it may not be necessary at this point (or at all). -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library
Re: Schema Design
fwiw, https://github.com/thobbs/logsandra is more recent. 2011/1/30 aaron morton aa...@thelastpickle.com: This project may be what you are looking for, or provide some inspiration https://github.com/jbohman/logsandra Cloud Kick has an example or rolling up time series data https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/ The schema below sounds reasonable. If you will always bring back the entire log record, consider using a Standard CF rather than a Super CF. Then pack the log message using your favourite serialisation, e.g. JSON. Hope that helps. Aaron On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote: I am also working on a system store logs from hundreds system. In my scenario, most query will like this: let's look at login logs (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time range). My data model like this: . only 1 CF. that's enough for this scenario. . group logs from each host and day to one row. Key format is hostname.category.date . store each log entry as a super column, super olumn name is TimeUUID of the log. each attribute as a column. Then this query can be done as 3 GET, no need to do key range scan. Then I can use RP instead of OPP. If I use OPP, I have to worry about load balance myself. I hate that. However, if I need to do a time range access, I can still use column slice. An additional benefit is, I can clean old logs very easily. We only store logs in 1 year. Just deleting by keys can do this job well. I think storing all logs for a host in a single row is not a good choice. 2 reason: 1, too few keys, so your data will not distributing well. 2, data under a key will always increase. So Cassandra have to do more SSTable compaction. -邮件原件- 发件人: William R Speirs [mailto:bill.spe...@gmail.com] 发送时间: 2011年1月27日 9:15 收件人: user@cassandra.apache.org 主题: Re: Schema Design It makes sense that the single row for a system (with a growing number of columns) will reside on a single machine. With that in mind, here is my updated schema: - A single column family for all the messages. The row keys will be the TimeUUID of the message with the following columns: date/time (in UTC POSIX), system name/id (with an index for fast/easy gets), the actual message payload. - A column family for each system. The row keys will be UTC POSIX time with 1 second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of any messages that were logged during that time bucket. My only hesitation with this design is that buddhasystem warned that each column family, is allocated a piece of memory on the server. I'm not sure what the implications of this are and/or if this would be a problem if a I had a number of systems on the order of hundreds. Thanks... Bill- On 01/26/2011 06:51 PM, Shu Zhang wrote: Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually. More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well. From: Bill Speirs [bill.spe...@gmail.com] Sent: Wednesday, January 26, 2011 1:23 PM To: user@cassandra.apache.org Subject: Re: Schema Design I like this approach, but I have 2 questions: 1) what is the implications of continually adding columns to a single row? I'm unsure how Cassandra is able to grow. I realize you can have a virtually infinite number of columns, but what are the implications of growing the number of columns over time? 2) maybe it's just a restriction of the CLI, but how do I do issue a slice request? Also, what if start (or end) columns don't exist? I'm guessing it's smart enough to get the columns in that range. Thanks! Bill- On Wed, Jan 26, 2011 at 4:12 PM, David McNelis dmcne...@agentisenergy.com wrote: I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirsbill.spe...@gmail.com wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can
Re: Schema Design
This project may be what you are looking for, or provide some inspiration https://github.com/jbohman/logsandra Cloud Kick has an example or rolling up time series data https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/ The schema below sounds reasonable. If you will always bring back the entire log record, consider using a Standard CF rather than a Super CF. Then pack the log message using your favourite serialisation, e.g. JSON. Hope that helps. Aaron On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote: I am also working on a system store logs from hundreds system. In my scenario, most query will like this: let's look at login logs (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time range). My data model like this: . only 1 CF. that's enough for this scenario. . group logs from each host and day to one row. Key format is hostname.category.date . store each log entry as a super column, super olumn name is TimeUUID of the log. each attribute as a column. Then this query can be done as 3 GET, no need to do key range scan. Then I can use RP instead of OPP. If I use OPP, I have to worry about load balance myself. I hate that. However, if I need to do a time range access, I can still use column slice. An additional benefit is, I can clean old logs very easily. We only store logs in 1 year. Just deleting by keys can do this job well. I think storing all logs for a host in a single row is not a good choice. 2 reason: 1, too few keys, so your data will not distributing well. 2, data under a key will always increase. So Cassandra have to do more SSTable compaction. -邮件原件- 发件人: William R Speirs [mailto:bill.spe...@gmail.com] 发送时间: 2011年1月27日 9:15 收件人: user@cassandra.apache.org 主题: Re: Schema Design It makes sense that the single row for a system (with a growing number of columns) will reside on a single machine. With that in mind, here is my updated schema: - A single column family for all the messages. The row keys will be the TimeUUID of the message with the following columns: date/time (in UTC POSIX), system name/id (with an index for fast/easy gets), the actual message payload. - A column family for each system. The row keys will be UTC POSIX time with 1 second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of any messages that were logged during that time bucket. My only hesitation with this design is that buddhasystem warned that each column family, is allocated a piece of memory on the server. I'm not sure what the implications of this are and/or if this would be a problem if a I had a number of systems on the order of hundreds. Thanks... Bill- On 01/26/2011 06:51 PM, Shu Zhang wrote: Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually. More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well. From: Bill Speirs [bill.spe...@gmail.com] Sent: Wednesday, January 26, 2011 1:23 PM To: user@cassandra.apache.org Subject: Re: Schema Design I like this approach, but I have 2 questions: 1) what is the implications of continually adding columns to a single row? I'm unsure how Cassandra is able to grow. I realize you can have a virtually infinite number of columns, but what are the implications of growing the number of columns over time? 2) maybe it's just a restriction of the CLI, but how do I do issue a slice request? Also, what if start (or end) columns don't exist? I'm guessing it's smart enough to get the columns in that range. Thanks! Bill- On Wed, Jan 26, 2011 at 4:12 PM, David McNelis dmcne...@agentisenergy.com wrote: I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirsbill.spe...@gmail.com wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing
Schema Design
I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing: get system_x where month 1; gives me this error: No indexed columns present in index clause with operator EQ. The equals operator works as expected though: get system_x where month = 1; What schema would allow me to get date ranges? Thanks in advance... Bill- * ColumnFamily description * ColumnFamily: system_x_msg Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 1.1671875/249/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, proj_1_msg.7365636f6e64, proj_1_msg.79656172] Column Metadata: Column Name: year (year) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: month (month) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: second (second) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: minute (minute) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: hour (hour) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: day (day) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS
Re: Schema Design
I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs bill.spe...@gmail.com wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing: get system_x where month 1; gives me this error: No indexed columns present in index clause with operator EQ. The equals operator works as expected though: get system_x where month = 1; What schema would allow me to get date ranges? Thanks in advance... Bill- * ColumnFamily description * ColumnFamily: system_x_msg Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 1.1671875/249/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, proj_1_msg.7365636f6e64, proj_1_msg.79656172] Column Metadata: Column Name: year (year) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: month (month) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: second (second) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: minute (minute) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: hour (hour) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: day (day) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS -- *David McNelis* Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 *A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.*
Re: Schema Design
Having separate columns for Year, Month etc seems redundant. It's tons more efficient to keep say UTC time in POSIX format (basically integer). It's easy to convert back and forth. If you want to get a range of dates, in that case you might use Order Preserving Partitioner, and sort out which systems logged later in client. Read up on consequences of using OPP. Whether to shard data as per system depends on how many you have. If more than a few, don't do that, there are memory considerations. Cheers Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Schema Design
I like this approach, but I have 2 questions: 1) what is the implications of continually adding columns to a single row? I'm unsure how Cassandra is able to grow. I realize you can have a virtually infinite number of columns, but what are the implications of growing the number of columns over time? 2) maybe it's just a restriction of the CLI, but how do I do issue a slice request? Also, what if start (or end) columns don't exist? I'm guessing it's smart enough to get the columns in that range. Thanks! Bill- On Wed, Jan 26, 2011 at 4:12 PM, David McNelis dmcne...@agentisenergy.com wrote: I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs bill.spe...@gmail.com wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing: get system_x where month 1; gives me this error: No indexed columns present in index clause with operator EQ. The equals operator works as expected though: get system_x where month = 1; What schema would allow me to get date ranges? Thanks in advance... Bill- * ColumnFamily description * ColumnFamily: system_x_msg Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 1.1671875/249/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, proj_1_msg.7365636f6e64, proj_1_msg.79656172] Column Metadata: Column Name: year (year) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: month (month) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: second (second) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: minute (minute) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: hour (hour) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: day (day) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS -- David McNelis Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.
Re: Schema Design
I have a basic understanding of OPP... if most of my messages come within a single hour then a few nodes could be storing all of my values, right? You totally lost me on, whether to shard data as per system... Is my schema (one column family per system, and row keys as TimeUUIDType) sharding by system? I thought -- probably incorrectly -- that the row keys are used in the sharding process, not column families. Thanks... Bill- On Wed, Jan 26, 2011 at 4:17 PM, buddhasystem potek...@bnl.gov wrote: Having separate columns for Year, Month etc seems redundant. It's tons more efficient to keep say UTC time in POSIX format (basically integer). It's easy to convert back and forth. If you want to get a range of dates, in that case you might use Order Preserving Partitioner, and sort out which systems logged later in client. Read up on consequences of using OPP. Whether to shard data as per system depends on how many you have. If more than a few, don't do that, there are memory considerations. Cheers Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Schema Design
My cli knowledge sucks so far, so I'll leave that to othersI'm doing most of my reading/writing through a thrift client (hector/java based) As for the implications, as of the latest version of Cassandra there is not theoretical limit to the number of columns that a particular row can hold. Over time you've got a couple of different options, if you're concerned that you end up with too many columns to manage then you'd probably want to start thinking about a warehousing strategy long-term for your older records that involves expiring columns that are older than X in your Cassandra cluster. But for the most part you shouldn't *need* to do that. On Wed, Jan 26, 2011 at 3:23 PM, Bill Speirs bill.spe...@gmail.com wrote: I like this approach, but I have 2 questions: 1) what is the implications of continually adding columns to a single row? I'm unsure how Cassandra is able to grow. I realize you can have a virtually infinite number of columns, but what are the implications of growing the number of columns over time? 2) maybe it's just a restriction of the CLI, but how do I do issue a slice request? Also, what if start (or end) columns don't exist? I'm guessing it's smart enough to get the columns in that range. Thanks! Bill- On Wed, Jan 26, 2011 at 4:12 PM, David McNelis dmcne...@agentisenergy.com wrote: I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs bill.spe...@gmail.com wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing: get system_x where month 1; gives me this error: No indexed columns present in index clause with operator EQ. The equals operator works as expected though: get system_x where month = 1; What schema would allow me to get date ranges? Thanks in advance... Bill- * ColumnFamily description * ColumnFamily: system_x_msg Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 1.1671875/249/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, proj_1_msg.7365636f6e64, proj_1_msg.79656172] Column Metadata: Column Name: year (year) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: month (month) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: second (second) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: minute (minute) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: hour (hour) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: day (day) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS -- David McNelis Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource. -- *David McNelis* Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 *A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.*
Re: Schema Design
One thing you can do is create one CF, then as the row key use the application name + timestamp, with that you can do your range query using OOP. then store whatever you want in the row problem would be if one app generates far more logs than the others Nicolas Santini On Thu, Jan 27, 2011 at 10:26 AM, Bill Speirs bill.spe...@gmail.com wrote: I have a basic understanding of OPP... if most of my messages come within a single hour then a few nodes could be storing all of my values, right? You totally lost me on, whether to shard data as per system... Is my schema (one column family per system, and row keys as TimeUUIDType) sharding by system? I thought -- probably incorrectly -- that the row keys are used in the sharding process, not column families. Thanks... Bill- On Wed, Jan 26, 2011 at 4:17 PM, buddhasystem potek...@bnl.gov wrote: Having separate columns for Year, Month etc seems redundant. It's tons more efficient to keep say UTC time in POSIX format (basically integer). It's easy to convert back and forth. If you want to get a range of dates, in that case you might use Order Preserving Partitioner, and sort out which systems logged later in client. Read up on consequences of using OPP. Whether to shard data as per system depends on how many you have. If more than a few, don't do that, there are memory considerations. Cheers Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Schema Design
I used the term sharding a bit frivolously. Sorry. It's just splitting semantically homogenious data among CFs doesn't scale too well, as each CF is allocated a piece of memory on the server. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964326.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
RE: Schema Design
Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually. More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well. From: Bill Speirs [bill.spe...@gmail.com] Sent: Wednesday, January 26, 2011 1:23 PM To: user@cassandra.apache.org Subject: Re: Schema Design I like this approach, but I have 2 questions: 1) what is the implications of continually adding columns to a single row? I'm unsure how Cassandra is able to grow. I realize you can have a virtually infinite number of columns, but what are the implications of growing the number of columns over time? 2) maybe it's just a restriction of the CLI, but how do I do issue a slice request? Also, what if start (or end) columns don't exist? I'm guessing it's smart enough to get the columns in that range. Thanks! Bill- On Wed, Jan 26, 2011 at 4:12 PM, David McNelis dmcne...@agentisenergy.com wrote: I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs bill.spe...@gmail.com wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing: get system_x where month 1; gives me this error: No indexed columns present in index clause with operator EQ. The equals operator works as expected though: get system_x where month = 1; What schema would allow me to get date ranges? Thanks in advance... Bill- * ColumnFamily description * ColumnFamily: system_x_msg Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 1.1671875/249/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, proj_1_msg.7365636f6e64, proj_1_msg.79656172] Column Metadata: Column Name: year (year) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: month (month) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: second (second) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: minute (minute) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: hour (hour) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: day (day) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS -- David McNelis Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.
Re: Schema Design
It makes sense that the single row for a system (with a growing number of columns) will reside on a single machine. With that in mind, here is my updated schema: - A single column family for all the messages. The row keys will be the TimeUUID of the message with the following columns: date/time (in UTC POSIX), system name/id (with an index for fast/easy gets), the actual message payload. - A column family for each system. The row keys will be UTC POSIX time with 1 second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of any messages that were logged during that time bucket. My only hesitation with this design is that buddhasystem warned that each column family, is allocated a piece of memory on the server. I'm not sure what the implications of this are and/or if this would be a problem if a I had a number of systems on the order of hundreds. Thanks... Bill- On 01/26/2011 06:51 PM, Shu Zhang wrote: Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually. More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well. From: Bill Speirs [bill.spe...@gmail.com] Sent: Wednesday, January 26, 2011 1:23 PM To: user@cassandra.apache.org Subject: Re: Schema Design I like this approach, but I have 2 questions: 1) what is the implications of continually adding columns to a single row? I'm unsure how Cassandra is able to grow. I realize you can have a virtually infinite number of columns, but what are the implications of growing the number of columns over time? 2) maybe it's just a restriction of the CLI, but how do I do issue a slice request? Also, what if start (or end) columns don't exist? I'm guessing it's smart enough to get the columns in that range. Thanks! Bill- On Wed, Jan 26, 2011 at 4:12 PM, David McNelis dmcne...@agentisenergy.com wrote: I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirsbill.spe...@gmail.com wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing: get system_x where month 1; gives me this error: No indexed columns present in index clause with operator EQ. The equals operator works as expected though: get system_x where month = 1; What schema would allow me to get date ranges? Thanks in advance... Bill- * ColumnFamily description * ColumnFamily: system_x_msg Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 1.1671875/249/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, proj_1_msg.7365636f6e64, proj_1_msg.79656172] Column Metadata: Column Name: year (year) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: month (month) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: second (second) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: minute (minute) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: hour (hour) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: day (day) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS -- David McNelis Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.
Re: Schema Design
Ah, sweet... thanks for the link! Bill- On 01/26/2011 08:20 PM, buddhasystem wrote: Bill, it's all explained here: http://wiki.apache.org/cassandra/MemtableThresholds#JVM_Heap_Size,the Watch the number of CFs and the memtable sizes. In my experience, this all matters.
Re: Schema Design
I am also working on a system store logs from hundreds system. In my scenario, most query will like this: let's look at login logs (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time range). My data model like this: . only 1 CF. that's enough for this scenario. . group logs from each host and day to one row. Key format is hostname.category.date . store each log entry as a super column, super olumn name is TimeUUID of the log. each attribute as a column. Then this query can be done as 3 GET, no need to do key range scan. Then I can use RP instead of OPP. If I use OPP, I have to worry about load balance myself. I hate that. However, if I need to do a time range access, I can still use column slice. An additional benefit is, I can clean old logs very easily. We only store logs in 1 year. Just deleting by keys can do this job well. I think storing all logs for a host in a single row is not a good choice. 2 reason: 1, too few keys, so your data will not distributing well. 2, data under a key will always increase. So Cassandra have to do more SSTable compaction. -邮件原件- 发件人: William R Speirs [mailto:bill.spe...@gmail.com] 发送时间: 2011年1月27日 9:15 收件人: user@cassandra.apache.org 主题: Re: Schema Design It makes sense that the single row for a system (with a growing number of columns) will reside on a single machine. With that in mind, here is my updated schema: - A single column family for all the messages. The row keys will be the TimeUUID of the message with the following columns: date/time (in UTC POSIX), system name/id (with an index for fast/easy gets), the actual message payload. - A column family for each system. The row keys will be UTC POSIX time with 1 second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of any messages that were logged during that time bucket. My only hesitation with this design is that buddhasystem warned that each column family, is allocated a piece of memory on the server. I'm not sure what the implications of this are and/or if this would be a problem if a I had a number of systems on the order of hundreds. Thanks... Bill- On 01/26/2011 06:51 PM, Shu Zhang wrote: Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually. More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well. From: Bill Speirs [bill.spe...@gmail.com] Sent: Wednesday, January 26, 2011 1:23 PM To: user@cassandra.apache.org Subject: Re: Schema Design I like this approach, but I have 2 questions: 1) what is the implications of continually adding columns to a single row? I'm unsure how Cassandra is able to grow. I realize you can have a virtually infinite number of columns, but what are the implications of growing the number of columns over time? 2) maybe it's just a restriction of the CLI, but how do I do issue a slice request? Also, what if start (or end) columns don't exist? I'm guessing it's smart enough to get the columns in that range. Thanks! Bill- On Wed, Jan 26, 2011 at 4:12 PM, David McNelis dmcne...@agentisenergy.com wrote: I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirsbill.spe...@gmail.com wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: What are all the log messages that were generated between X Y? The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing: get system_x where month 1; gives me this error: No indexed columns present in index clause with operator EQ. The equals operator works as expected though: get system_x where month = 1; What schema would allow me to get date ranges? Thanks in advance... Bill- * ColumnFamily description * ColumnFamily: system_x_msg Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 1.1671875/249/60 GC grace seconds: 864000 Compaction min/max
Re: Is this a good schema design to implement a social application..
On Fri, Jan 7, 2011 at 11:38 PM, Rajkumar Gupta rajkumar@gmail.com wrote: In the twissandra example, http://www.riptano.com/docs/0.6/data_model/twissandra#adding-friends , I find that they have split the materialized view of a user's homepage (like his followers list, tweets from friends) into several columnfamilies instead of putting in supercolumns inside a single SupercolumnFamily thereby making the rows skinnier, I was wandering as to which one will give better performance in terms of reads. I think skinnier will definitely have the advantage of less row mutations thus good read performance, when, only they, need to be retrieved, plus supercolumns of followerlist ,etc are avoided(this sounds good as supercolumn indexing limitations will not suck), but I still not pretty sure whether it would beneficial in terms of performance numbers, if I split the materialized view of single user into several columnfamilies instead of single row in single Supercolumnfamily. On Sat, Jan 8, 2011 at 2:05 AM, Rajkumar Gupta rajkumar@gmail.com wrote: The fact that subcolumns inside the supercolumns aren't indexed currently may suck here, whenever a small no (10-20 ) of subcolumns need to be retreived from a large list of subcolumns of a supercolumn like MyPostsIdKeysList. On Fri, Jan 7, 2011 at 9:58 PM, Raj rajkumar@gmail.com wrote: My question is in context of a social network schema design I am thinking of following schema for storing a user's data that is required as he logs in is led to his homepage:- (I aimed at a schema design such that through a single row read query all the data that would be required to put up the homepage of that user, is retreived.) UserSuperColumnFamily: { // Column Family UserIDKey: {columns: MyName, MyEmail, MyCity,...etc supercolumns: MyFollowersList, MyFollowiesList, MyPostsIdKeysList, MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList, RecentNotificationsForUserList, MessagesReceivedList, MessagesSentList, AccountSettingsList, RecentSelfActivityList, UpdatesFromFollowiesList } } Thus user's newfeed would be generated using superColumn: UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would obviously contain only Id of the posts and not the entire post data. Questions: 1.) What could be the problems with this design, any improvements ? 2.) Would frequent heavy overwrite operations/ row mutations (for example; when propagating the post updates for news-feed from some user to all his followies) which leads to rows ultimately being in several SSTables, will lead to degraded read performance ?? Is it suitable to use row Cache(too big row but all data required uptil user is logged in) If I do not use cache, it may be very expensive to pull the row each time a data is required for the given user since row would be in several sstables. How can I improve the read performance here The actual data of the posts from network would be retrieved using PostIdKey through subsequent read queries from columnFamily PostsSuperColumnFamily which would be like follows: PostsSuperColumnFamily:{ PostIdKey: { columns: PostOwnerId, PostBody supercolumns: TagsForPost {list of columns of all tags for the post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the likers} } } Is this the best design to go with or are there any issues to consider here ? Thanks in anticipation of your valuable comments.! From your description UserSuperColumnFamily it seems to be both a Standard Column and a Super Column. You can not do that. However you can encode things such as MyName MyCity and MyState into a 'UserInfo' super Column column. UserInfo:MyState... (as your mentioned) Super Columns are not indexed and have to be completely de-serialized for each access. Because of this they are not widely used for anything but small keys with a few columns. This also applies to mutations as well, the row can exist in multiple SSTables until it finally gets compacted. That can result in much more storage used for an object that changes often. Most designs use composite keys or using something like JSON encoded values with Standard Column Families to achieve something like a Super Column. (SuperColumns are not always as Super as they seem :)
Re: SQL Server to Cassandra Schema Design - Ideas Anyone?
the main idea is denormalize your data into multiple CFs at write time so that each CF lets you answer a query from a single row. http://www.rackspacecloud.com/blog/2010/05/12/cassandra-by-example/ is a good place to start. On Tue, Jun 22, 2010 at 3:43 PM, Craig Faulkner craig.faulk...@wallst.com wrote: I'm having a little block in converting an existing SQL Server schema that we have into Cassandra Keyspace(s). The whole key-value thing has just not clicked yet. Do any of you know of any good examples that are more complex than the example in the readme file? We are looking to report on web traffic so things like hits, page views, unique visitors,... All the basic web stuff. I'm very sure that one of you, likely many more, is already doing this. Here are two queries just to give you a few key works related to the metrics that we want to move into Cassandra: /* Data logged */ select t.Datetime,c.CustomerNumber,ct.cust_type,ws.SiteNumber,ws.SiteName ,f.Session,wa.page,wa.Note,f.CPUTime,f.DCLWaitTime,f.DCLRequestCount,'clientip' = dbo.u_IpInt2Str(ClientIP) from warehouse.dbo.fact_WebHit f join Warehouse.dbo.dim_Time t on t.ID = f.TimeID join Warehouse.dbo.dim_CustomerType ct on ct.ID = f.CustomerTypeID join Warehouse.dbo.dim_Customer c on c.ID = f.CustomerID join Warehouse.dbo.dim_Symbol s on s.ID = f.SymbolID join Warehouse.dbo.dim_WebAction wa on wa.ID = f.WebActionID join Warehouse.dbo.dim_WebSite ws on ws.ID = f.WebSiteID /* Data with surrogate keys */ select f.Timeid,f.CustomerID,f.CustomerTypeID,f.WebSiteID ,f.Session,f.WebActionID,f.CPUTime ,f.DCLWaitTime,f.DCLRequestCount,ClientIP from warehouse.dbo.fact_WebHit f Any good info would be appreciated. I have of course checked the main web sites but I could have missed something along the way. Craig -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
SQL Server to Cassandra Schema Design - Ideas Anyone?
I'm having a little block in converting an existing SQL Server schema that we have into Cassandra Keyspace(s). The whole key-value thing has just not clicked yet. Do any of you know of any good examples that are more complex than the example in the readme file? We are looking to report on web traffic so things like hits, page views, unique visitors,... All the basic web stuff. I'm very sure that one of you, likely many more, is already doing this. Here are two queries just to give you a few key works related to the metrics that we want to move into Cassandra: /* Data logged */ select t.Datetime,c.CustomerNumber,ct.cust_type,ws.SiteNumber,ws.SiteName ,f.Session,wa.page,wa.Note,f.CPUTime,f.DCLWaitTime,f.DCLRequestCount,'clientip' = dbo.u_IpInt2Str(ClientIP) from warehouse.dbo.fact_WebHit f join Warehouse.dbo.dim_Time t on t.ID = f.TimeID join Warehouse.dbo.dim_CustomerType ct on ct.ID = f.CustomerTypeID join Warehouse.dbo.dim_Customer c on c.ID = f.CustomerID join Warehouse.dbo.dim_Symbol s on s.ID = f.SymbolID join Warehouse.dbo.dim_WebAction wa on wa.ID = f.WebActionID join Warehouse.dbo.dim_WebSite ws on ws.ID = f.WebSiteID /* Data with surrogate keys */ select f.Timeid,f.CustomerID,f.CustomerTypeID,f.WebSiteID ,f.Session,f.WebActionID,f.CPUTime ,f.DCLWaitTime,f.DCLRequestCount,ClientIP from warehouse.dbo.fact_WebHit f Any good info would be appreciated. I have of course checked the main web sites but I could have missed something along the way. Craig