Re: Dropping & Creating Column Families Never Returns

2011-02-15 Thread William R Speirs
What would/could take so long for the nodes to agree? It's a small cluster (7 
nodes) all on local LAN and not being used by anything else.


I think a delete & refresh might be in order...

Thanks!

Bill-

On 02/15/2011 09:13 PM, Jonathan Ellis wrote:

"command never returns" means "it's waiting for the nodes to agree on
the new schema version."  Bad Mojo will ensue if you issue more schema
updates anyway.

On Tue, Feb 15, 2011 at 3:46 PM, Bill Speirs  wrote:

Has anyone ever tried to drop a column family and/or create one and
have the command not return from the cli? I'm using 0.7.1 and I tried
to drop a column family and the command never returned. However, on
another node it showed it was gone. I Ctrl-C out of the command, then
issued a create for a column family of the same name, different
schema. That command never returned, but again in other host it showed
it was there. I went to describe and list this column family and got
this:

[default@Logging] describe keyspace Logging;
Keyspace: Logging:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Replication Factor: 3
  Column Families:
ColumnFamily: Messages
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/14400
  Memtable thresholds: 0.5953125/127/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: []
[default@Logging] list Messages;
Messages not found in current keyspace.


Any ideas?

Bill-







Re: postgis > cassandra?

2011-02-05 Thread William R Speirs
I know nothing about postgis and little about spacial data, but if you're simply 
talking about data that relates to some latitude & longitude pair, you could 
have your row key simply be the concatenation of the two: lat:long.


Can you provide more details about the type of data you're looking to store?

Thanks...

Bill-

On 02/05/2011 12:22 PM, Sean Ochoa wrote:

Can someone tell me how to represent spatial data (coming from postgis) in
Cassandra?

  - Sean


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs

I did not understand before... sorry.

Again, depending upon how many reminders you have for a single user, this could 
be a long/wide row. Again, it really comes down to how many reminders are we 
talking about and how often will they be read/written. While a single row can 
contain millions (maybe more) columns, that doesn't mean it's a good idea.


I'm working on a logging system with Cassandra and ran into this same type of 
problem. Do I put all of the messages for a single system into a single row 
keyed off that system's name? I quickly came to the answer of "no" and now I 
break my row keys into POSIX_timestamp:system where my timestamps are buckets 
for every 5 minutes. This nicely distributes the load across the nodes in my system.


Bill-

On 02/02/2011 11:18 AM, Aditya Narayan wrote:

You got me wrong perhaps..

I am already splitting the row on per user basis ofcourse, otherwise
the schema wont make sense for my usage. The row contains only
*reminders of a single user* sorted in chronological order. The
reminder Id are stored as supercolumn name and subcolumn contain tags
for that reminder.



On Wed, Feb 2, 2011 at 9:19 PM, William R Speirs  wrote:

Any time I see/hear "a single row containing all ..." I get nervous. That
single row is going to reside on a single node. That is potentially a lot of
load (don't know the system) for that single node. Why wouldn't you split it
by at least user? If it won't be a lot of load, then why are you using
Cassandra? This seems like something that could easily fit into an
SQL/relational style DB. If it's too much data (millions of users, 100s of
millions of reminders) for a standard SQL/relational model, then it's
probably too much for a single row.

I'm not familiar with the TTL functionality of Cassandra... sorry cannot
help/comment there, still learning :-)

Yea, my $0.02 is that this is an effective way to leverage super columns.

Bill-

On 02/02/2011 10:43 AM, Aditya Narayan wrote:


I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders (&not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

"While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up.."

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs
  wrote:


To reiterate, so I know we're both on the same page, your schema would be
something like this:

- A column family (as you describe) to store the details of a reminder.
One
reminder per row. The row key would be a TimeUUID.

- A super column family to store the reminders for each user, for each
day.
The row key would be something like: MMDD:user_id. The column names
would simply be the TimeUUID of the messages. The sub column names would
be
the tag names of the various reminders.

The idea is that you would then get a slice of each row for a user, for a
day, that would only contain sub column names with the tags you're
looking
for? Then based upon the column names returned, you'd look-up the
reminders.

That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:


Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are
requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan
  wrote:


Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at on

Re: Unsubscribe

2011-02-02 Thread William R Speirs

JJ you need to be sending this to: user-unsubscr...@cassandra.apache.org

Cheers...

Bill-

On 02/02/2011 10:58 AM, JJ wrote:



Sent from my iPad


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs
Any time I see/hear "a single row containing all ..." I get nervous. That single 
row is going to reside on a single node. That is potentially a lot of load 
(don't know the system) for that single node. Why wouldn't you split it by at 
least user? If it won't be a lot of load, then why are you using Cassandra? This 
seems like something that could easily fit into an SQL/relational style DB. If 
it's too much data (millions of users, 100s of millions of reminders) for a 
standard SQL/relational model, then it's probably too much for a single row.


I'm not familiar with the TTL functionality of Cassandra... sorry cannot 
help/comment there, still learning :-)


Yea, my $0.02 is that this is an effective way to leverage super columns.

Bill-

On 02/02/2011 10:43 AM, Aditya Narayan wrote:

I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders (&  not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

"While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up.."

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs  wrote:

To reiterate, so I know we're both on the same page, your schema would be
something like this:

- A column family (as you describe) to store the details of a reminder. One
reminder per row. The row key would be a TimeUUID.

- A super column family to store the reminders for each user, for each day.
The row key would be something like: MMDD:user_id. The column names
would simply be the TimeUUID of the messages. The sub column names would be
the tag names of the various reminders.

The idea is that you would then get a slice of each row for a user, for a
day, that would only contain sub column names with the tags you're looking
for? Then based upon the column names returned, you'd look-up the reminders.

That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:


Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are
requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanwrote:


Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing "all the subcolumns data serialized in single
column(s) " ?

Thanks
Aditya Narayan





Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs
To reiterate, so I know we're both on the same page, your schema would be 
something like this:


- A column family (as you describe) to store the details of a reminder. One 
reminder per row. The row key would be a TimeUUID.


- A super column family to store the reminders for each user, for each day. The 
row key would be something like: MMDD:user_id. The column names would simply 
be the TimeUUID of the messages. The sub column names would be the tag names of 
the various reminders.


The idea is that you would then get a slice of each row for a user, for a day, 
that would only contain sub column names with the tags you're looking for? Then 
based upon the column names returned, you'd look-up the reminders.


That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:

Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan  wrote:

Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing "all the subcolumns data serialized in single
column(s) " ?

Thanks
Aditya Narayan



Re: cassandra as session store

2011-02-01 Thread William R Speirs
I'm still very new to Cassandra, but when I started reading about it the first 
thing I thought about was a session store. It's based (in part from what I 
understand) on Dynamo which is (again, I could be wrong) used at Amazon as the 
session store for your shopping cart.


So I would certainly reach for Cassandra if I needed a reliable distributed 
session store.


Bill-

On 02/01/2011 03:24 PM, Sasha Dolgy wrote:


What I'm still unclear about, and where I think this is suitable, is Cassandra
being used as a data warehouse for current and past sessions tied to a user.
  Yes, other things are great for session management, but I want to provide near
real time session information to my users ... quick and simple and i want to use
cassandra ... surely i can't be that bad for thinking this is a good idea?
-sd

On Tue, Feb 1, 2011 at 9:20 PM, Kallin Nagelberg mailto:kallin.nagelb...@gmail.com>> wrote:

nvm on the persistence, it seems like it does support it:

'Since version 1.1 the safer alternative is an append-only file (a
journal) that is written as operations modifying the dataset in memory
are processed. Redis is able to rewrite the append-only file in the
background in order to avoid an indefinite growth of the journal.'

This thread probably shouldn't digress too much from Cassandra's
suitability for session management though..



Re: Is it recommended to store two types of data (not related to each other but need to be retrieved together) in one super column family ?

2011-01-29 Thread William R Speirs

I'm very new to Cassandra, but I'll pitch in my $0.02.

Row look-ups are super fast, why do you think it would be more efficient to 
store these two rows "together" in the super column method you describe?


Why would you not just look-up the rows, one after the other?

If I understand correctly, you have post_ids, user_ids, and groups. A group 
contains user_ids (people posting to the group) and post_ids (posts made to that 
group)?


So I write a post to a group. You'd add this post_id to the row holding all the 
posts for this group (this might be bad if the number of posts/columns grows 
huge). You'd then have another row associated with the group where you'd insert 
my user_id. Am I close to what you want?


If you can give a more concrete example, I (or someone more familiar with 
Cassandra) could give you more help on designing a schema.


Bill-

On 01/29/2011 01:48 PM, Ertio Lew wrote:

Could someone please point me in right direction by commenting on the above 
ideas ?

On Fri, Jan 28, 2011 at 11:50 PM, Ertio Lew mailto:ertio...@gmail.com>> wrote:

Hi,

I have two kinds of data that I would like to fit in one super column
family; I am trying this, for the reasons of implementing fast
database retrievals by combining the data of two rows into just one
row.

First kind of data, in supercolumn family, is named with timeUUIDs as
supercolumn names; Think of this as, the postIds of posts in a Group.
These posts will need to be sorted by time (so that list of latest
posts is retrieved). Thus each post has one supercolumn each with name
as (timeUUID+userID) and sorted by timeUUIDtype.

Second kind of data would be just a single supercolumn containing
columns of userId of all members in a group(very small). (The no of
members in group will be around 40-50 max). The name of this single
supercolumn may be kept suitable(perhaps max. time in future ) so as
to keep this supercolumn to the beginning.

(The supercolumns are required as we need to store some additional
data in the columns of 1st kind of data).

So is it recommended to store these two types of data (not related to
each other but need to be retrieved together) in one super column
family ?




Re: Schema Design

2011-01-26 Thread William R Speirs

Ah, sweet... thanks for the link!

Bill-

On 01/26/2011 08:20 PM, buddhasystem wrote:


Bill, it's all explained here:

http://wiki.apache.org/cassandra/MemtableThresholds#JVM_Heap_Size,the

Watch the number of CFs and the memtable sizes.

In my experience, this all matters.


Re: Schema Design

2011-01-26 Thread William R Speirs
It makes sense that the single row for a system (with a growing number of 
columns) will reside on a single machine.


With that in mind, here is my updated schema:

- A single column family for all the messages. The row keys will be the TimeUUID 
of the message with the following columns: date/time (in UTC POSIX), system 
name/id (with an index for fast/easy gets), the actual message payload.


- A column family for each system. The row keys will be UTC POSIX time with 1 
second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of 
any messages that were logged during that time bucket.


My only hesitation with this design is that buddhasystem warned that each column 
family, "is allocated a piece of memory on the server." I'm not sure what the 
implications of this are and/or if this would be a problem if a I had a number 
of systems on the order of hundreds.


Thanks...

Bill-

On 01/26/2011 06:51 PM, Shu Zhang wrote:

Each row can have a maximum of 2 billion columns, which a logging system will 
probably hit eventually.

More importantly, you'll only have 1 row per set of system logs. Every row is 
stored on the same machine(s), which you means you'll definitely not be able to 
distribute your load very well.

From: Bill Speirs [bill.spe...@gmail.com]
Sent: Wednesday, January 26, 2011 1:23 PM
To: user@cassandra.apache.org
Subject: Re: Schema Design

I like this approach, but I have 2 questions:

1) what is the implications of continually adding columns to a single
row? I'm unsure how Cassandra is able to grow. I realize you can have
a virtually infinite number of columns, but what are the implications
of growing the number of columns over time?

2) maybe it's just a restriction of the CLI, but how do I do issue a
slice request? Also, what if start (or end) columns don't exist? I'm
guessing it's smart enough to get the columns in that range.

Thanks!

Bill-

On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
  wrote:

I would say in that case you might want  to try a  single column family
where the key to the column is the system name.
Then, you could name your columns as the timestamp.  Then when retrieving
information from the data store you can can, in your slice request, specify
your start column as  X and end  column as Y.
Then you can use the stored column name to know when an event  occurred.

On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs  wrote:


I'm looking to use Cassandra to store log messages from various
systems. A log message only has a message (UTF8Type) and a data/time.
My thought is to create a column family for each system. The row key
will be a TimeUUIDType. Each row will have 7 columns: year, month,
day, hour, minute, second, and message. I then have indexes setup for
each of the date/time columns.

I was hoping this would allow me to answer queries like: "What are all
the log messages that were generated between X&  Y?" The problem is
that I can ONLY use the equals operator on these column values. For
example, I cannot issuing: get system_x where month>  1; gives me this
error: "No indexed columns present in index clause with operator EQ."
The equals operator works as expected though: get system_x where month
= 1;

What schema would allow me to get date ranges?

Thanks in advance...

Bill-

* ColumnFamily description *
ColumnFamily: system_x_msg
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/3600
  Memtable thresholds: 1.1671875/249/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
proj_1_msg.7365636f6e64, proj_1_msg.79656172]
  Column Metadata:
Column Name: year (year)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: month (month)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: second (second)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: minute (minute)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: hour (hour)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: day (day)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS




--
David McNelis
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143
A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.