Re: Dropping & Creating Column Families Never Returns
What would/could take so long for the nodes to agree? It's a small cluster (7 nodes) all on local LAN and not being used by anything else. I think a delete & refresh might be in order... Thanks! Bill- On 02/15/2011 09:13 PM, Jonathan Ellis wrote: "command never returns" means "it's waiting for the nodes to agree on the new schema version." Bad Mojo will ensue if you issue more schema updates anyway. On Tue, Feb 15, 2011 at 3:46 PM, Bill Speirs wrote: Has anyone ever tried to drop a column family and/or create one and have the command not return from the cli? I'm using 0.7.1 and I tried to drop a column family and the command never returned. However, on another node it showed it was gone. I Ctrl-C out of the command, then issued a create for a column family of the same name, different schema. That command never returned, but again in other host it showed it was there. I went to describe and list this column family and got this: [default@Logging] describe keyspace Logging; Keyspace: Logging: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Replication Factor: 3 Column Families: ColumnFamily: Messages Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/14400 Memtable thresholds: 0.5953125/127/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [] [default@Logging] list Messages; Messages not found in current keyspace. Any ideas? Bill-
Re: postgis > cassandra?
I know nothing about postgis and little about spacial data, but if you're simply talking about data that relates to some latitude & longitude pair, you could have your row key simply be the concatenation of the two: lat:long. Can you provide more details about the type of data you're looking to store? Thanks... Bill- On 02/05/2011 12:22 PM, Sean Ochoa wrote: Can someone tell me how to represent spatial data (coming from postgis) in Cassandra? - Sean
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
I did not understand before... sorry. Again, depending upon how many reminders you have for a single user, this could be a long/wide row. Again, it really comes down to how many reminders are we talking about and how often will they be read/written. While a single row can contain millions (maybe more) columns, that doesn't mean it's a good idea. I'm working on a logging system with Cassandra and ran into this same type of problem. Do I put all of the messages for a single system into a single row keyed off that system's name? I quickly came to the answer of "no" and now I break my row keys into POSIX_timestamp:system where my timestamps are buckets for every 5 minutes. This nicely distributes the load across the nodes in my system. Bill- On 02/02/2011 11:18 AM, Aditya Narayan wrote: You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirs wrote: Any time I see/hear "a single row containing all ..." I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders (¬ split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. "While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.." Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at on
Re: Unsubscribe
JJ you need to be sending this to: user-unsubscr...@cassandra.apache.org Cheers... Bill- On 02/02/2011 10:58 AM, JJ wrote: Sent from my iPad
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Any time I see/hear "a single row containing all ..." I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders (& not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. "While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.." Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanwrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing "all the subcolumns data serialized in single column(s) " ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing "all the subcolumns data serialized in single column(s) " ? Thanks Aditya Narayan
Re: cassandra as session store
I'm still very new to Cassandra, but when I started reading about it the first thing I thought about was a session store. It's based (in part from what I understand) on Dynamo which is (again, I could be wrong) used at Amazon as the session store for your shopping cart. So I would certainly reach for Cassandra if I needed a reliable distributed session store. Bill- On 02/01/2011 03:24 PM, Sasha Dolgy wrote: What I'm still unclear about, and where I think this is suitable, is Cassandra being used as a data warehouse for current and past sessions tied to a user. Yes, other things are great for session management, but I want to provide near real time session information to my users ... quick and simple and i want to use cassandra ... surely i can't be that bad for thinking this is a good idea? -sd On Tue, Feb 1, 2011 at 9:20 PM, Kallin Nagelberg mailto:kallin.nagelb...@gmail.com>> wrote: nvm on the persistence, it seems like it does support it: 'Since version 1.1 the safer alternative is an append-only file (a journal) that is written as operations modifying the dataset in memory are processed. Redis is able to rewrite the append-only file in the background in order to avoid an indefinite growth of the journal.' This thread probably shouldn't digress too much from Cassandra's suitability for session management though..
Re: Is it recommended to store two types of data (not related to each other but need to be retrieved together) in one super column family ?
I'm very new to Cassandra, but I'll pitch in my $0.02. Row look-ups are super fast, why do you think it would be more efficient to store these two rows "together" in the super column method you describe? Why would you not just look-up the rows, one after the other? If I understand correctly, you have post_ids, user_ids, and groups. A group contains user_ids (people posting to the group) and post_ids (posts made to that group)? So I write a post to a group. You'd add this post_id to the row holding all the posts for this group (this might be bad if the number of posts/columns grows huge). You'd then have another row associated with the group where you'd insert my user_id. Am I close to what you want? If you can give a more concrete example, I (or someone more familiar with Cassandra) could give you more help on designing a schema. Bill- On 01/29/2011 01:48 PM, Ertio Lew wrote: Could someone please point me in right direction by commenting on the above ideas ? On Fri, Jan 28, 2011 at 11:50 PM, Ertio Lew mailto:ertio...@gmail.com>> wrote: Hi, I have two kinds of data that I would like to fit in one super column family; I am trying this, for the reasons of implementing fast database retrievals by combining the data of two rows into just one row. First kind of data, in supercolumn family, is named with timeUUIDs as supercolumn names; Think of this as, the postIds of posts in a Group. These posts will need to be sorted by time (so that list of latest posts is retrieved). Thus each post has one supercolumn each with name as (timeUUID+userID) and sorted by timeUUIDtype. Second kind of data would be just a single supercolumn containing columns of userId of all members in a group(very small). (The no of members in group will be around 40-50 max). The name of this single supercolumn may be kept suitable(perhaps max. time in future ) so as to keep this supercolumn to the beginning. (The supercolumns are required as we need to store some additional data in the columns of 1st kind of data). So is it recommended to store these two types of data (not related to each other but need to be retrieved together) in one super column family ?
Re: Schema Design
Ah, sweet... thanks for the link! Bill- On 01/26/2011 08:20 PM, buddhasystem wrote: Bill, it's all explained here: http://wiki.apache.org/cassandra/MemtableThresholds#JVM_Heap_Size,the Watch the number of CFs and the memtable sizes. In my experience, this all matters.
Re: Schema Design
It makes sense that the single row for a system (with a growing number of columns) will reside on a single machine. With that in mind, here is my updated schema: - A single column family for all the messages. The row keys will be the TimeUUID of the message with the following columns: date/time (in UTC POSIX), system name/id (with an index for fast/easy gets), the actual message payload. - A column family for each system. The row keys will be UTC POSIX time with 1 second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of any messages that were logged during that time bucket. My only hesitation with this design is that buddhasystem warned that each column family, "is allocated a piece of memory on the server." I'm not sure what the implications of this are and/or if this would be a problem if a I had a number of systems on the order of hundreds. Thanks... Bill- On 01/26/2011 06:51 PM, Shu Zhang wrote: Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually. More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well. From: Bill Speirs [bill.spe...@gmail.com] Sent: Wednesday, January 26, 2011 1:23 PM To: user@cassandra.apache.org Subject: Re: Schema Design I like this approach, but I have 2 questions: 1) what is the implications of continually adding columns to a single row? I'm unsure how Cassandra is able to grow. I realize you can have a virtually infinite number of columns, but what are the implications of growing the number of columns over time? 2) maybe it's just a restriction of the CLI, but how do I do issue a slice request? Also, what if start (or end) columns don't exist? I'm guessing it's smart enough to get the columns in that range. Thanks! Bill- On Wed, Jan 26, 2011 at 4:12 PM, David McNelis wrote: I would say in that case you might want to try a single column family where the key to the column is the system name. Then, you could name your columns as the timestamp. Then when retrieving information from the data store you can can, in your slice request, specify your start column as X and end column as Y. Then you can use the stored column name to know when an event occurred. On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs wrote: I'm looking to use Cassandra to store log messages from various systems. A log message only has a message (UTF8Type) and a data/time. My thought is to create a column family for each system. The row key will be a TimeUUIDType. Each row will have 7 columns: year, month, day, hour, minute, second, and message. I then have indexes setup for each of the date/time columns. I was hoping this would allow me to answer queries like: "What are all the log messages that were generated between X& Y?" The problem is that I can ONLY use the equals operator on these column values. For example, I cannot issuing: get system_x where month> 1; gives me this error: "No indexed columns present in index clause with operator EQ." The equals operator works as expected though: get system_x where month = 1; What schema would allow me to get date ranges? Thanks in advance... Bill- * ColumnFamily description * ColumnFamily: system_x_msg Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 20.0/3600 Memtable thresholds: 1.1671875/249/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572, proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468, proj_1_msg.7365636f6e64, proj_1_msg.79656172] Column Metadata: Column Name: year (year) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: month (month) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: second (second) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: minute (minute) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: hour (hour) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS Column Name: day (day) Validation Class: org.apache.cassandra.db.marshal.IntegerType Index Type: KEYS -- David McNelis Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.