Re: NoSE: Automated schema design for Cassandra

2017-05-11 Thread Michael Mior
Thanks for the feedback! I did change column families to tables. I agree
the documentation could use some work. If you're interested in seeing what
the input and output look like, here's a sample:

https://michael.mior.ca/projects/nose/rubis

So far we haven't had any schemas used directly for production although it
has provided some advice on design alternatives. NoSE actually already
contains a mechanism to execute different schema alternatives which is what
we used during our evaluation. However, it does not currently directly
provide mechanism for synthetic data generation. It would definitely be
possible however to add automated generation of test data in the future.

Cheers,
--
Michael Mior
mm...@uwaterloo.ca

2017-05-10 3:55 GMT-04:00 Jacques-Henri Berthemet <
jacques-henri.berthe...@genesys.com>:

> Hi,
>
>
>
> This is interesting, I’d just advise to put full examples and more
> documentation on how to use it (the articles are a bit too detailed).
>
> Also, you should not mention “column families” but just tables.
>
>
>
> Was this used to generate a schema used for production?
>
> Do you think it’s possible to generate test code to validate the workload?
>
>
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* michael.m...@gmail.com [mailto:michael.m...@gmail.com] *On Behalf
> Of *Michael Mior
> *Sent:* mardi 9 mai 2017 17:30
> *To:* user <user@cassandra.apache.org>
> *Subject:* NoSE: Automated schema design for Cassandra
>
>
>
> Hi all,
>
>
>
> I wanted to share a tool I've been working on that tries to help automate
> the schema design process for Cassandra. The short description is that you
> provide information on the kind of data you want to store and the queries
> and updates you want to issue, and NoSE will perform a cost-based analysis
> to suggest an optimal schema.
>
>
>
> There's lots of room for improvement and many Cassandra features which are
> not currently supported, but hopefully some in the community may still find
> it useful as a starting point.
>
>
>
> Link to more details and the source code below:
>
>
>
> https://michael.mior.ca/projects/nose/
>
>
>
> If you're interested in trying it out, don't hesitate to reach out and I'm
> happy to help!
>
>
>
> Cheers,
>
> --
>
> Michael Mior
>
> mm...@uwaterloo.ca
>


RE: NoSE: Automated schema design for Cassandra

2017-05-10 Thread Jacques-Henri Berthemet
Hi,

This is interesting, I’d just advise to put full examples and more 
documentation on how to use it (the articles are a bit too detailed).
Also, you should not mention “column families” but just tables.

Was this used to generate a schema used for production?
Do you think it’s possible to generate test code to validate the workload?

--
Jacques-Henri Berthemet

From: michael.m...@gmail.com [mailto:michael.m...@gmail.com] On Behalf Of 
Michael Mior
Sent: mardi 9 mai 2017 17:30
To: user <user@cassandra.apache.org>
Subject: NoSE: Automated schema design for Cassandra

Hi all,

I wanted to share a tool I've been working on that tries to help automate the 
schema design process for Cassandra. The short description is that you provide 
information on the kind of data you want to store and the queries and updates 
you want to issue, and NoSE will perform a cost-based analysis to suggest an 
optimal schema.

There's lots of room for improvement and many Cassandra features which are not 
currently supported, but hopefully some in the community may still find it 
useful as a starting point.

Link to more details and the source code below:

https://michael.mior.ca/projects/nose/<https://michael.mior.ca/projects/nose/>

If you're interested in trying it out, don't hesitate to reach out and I'm 
happy to help!

Cheers,
--
Michael Mior
mm...@uwaterloo.ca<mailto:mm...@uwaterloo.ca>


NoSE: Automated schema design for Cassandra

2017-05-09 Thread Michael Mior
Hi all,

I wanted to share a tool I've been working on that tries to help automate
the schema design process for Cassandra. The short description is that you
provide information on the kind of data you want to store and the queries
and updates you want to issue, and NoSE will perform a cost-based analysis
to suggest an optimal schema.

There's lots of room for improvement and many Cassandra features which are
not currently supported, but hopefully some in the community may still find
it useful as a starting point.

Link to more details and the source code below:

https://michael.mior.ca/projects/nose/

If you're interested in trying it out, don't hesitate to reach out and I'm
happy to help!

Cheers,
--
Michael Mior
mm...@uwaterloo.ca


Read Performance / Schema Design

2011-10-26 Thread Ben Gambley

Hi Everyone

I have a question with regards read performance and schema design if someone 
could help please.


Our requirement is to store per user, many unique results (which is basically 
an attempt at some questions ..) so I had thought of having the userid as the 
row key and the result id as columns. 

The keys  for the result ids are maintained in a separate location so are known 
without having to perform any additional lookups.

My concern is that over time reading a single result would incur the overhead 
of reading the entire row from disk so gradually slow things down.


So I was considering if changing the row key to userid + result id would be a 
better solution ?



Cheers
Ben



Re: Read Performance / Schema Design

2011-10-26 Thread Tyler Hobbs
On Wed, Oct 26, 2011 at 9:35 PM, Ben Gambley ben.gamb...@intoscience.comwrote:


 Hi Everyone

 I have a question with regards read performance and schema design if
 someone could help please.


 Our requirement is to store per user, many unique results (which is
 basically an attempt at some questions ..) so I had thought of having the
 userid as the row key and the result id as columns.

 The keys  for the result ids are maintained in a separate location so are
 known without having to perform any additional lookups.

 My concern is that over time reading a single result would incur the
 overhead of reading the entire row from disk so gradually slow things down.


 So I was considering if changing the row key to *userid + result id* would
 be a better solution ?



Do you regularly need to read all of the results for a given userid?  If
not, go with the user_id + result_id approach. It will be more efficient for
single-result lookups.

-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Read Performance / Schema Design

2011-10-26 Thread David Jeske
On Wed, Oct 26, 2011 at 7:35 PM, Ben Gambley ben.gamb...@intoscience.comwrote:

 Our requirement is to store per user, many unique results (which is
 basically an attempt at some questions ..) so I had thought of having the
 userid as the row key and the result id as columns.

 The keys  for the result ids are maintained in a separate location so are
 known without having to perform any additional lookups.

 My concern is that over time reading a single result would incur the
 overhead of reading the entire row from disk so gradually slow things down.

 So I was considering if changing the row key to *userid + result id* would
 be a better solution ?


This is a clustering choice. Assuming your dataset is too big to fit in
system memory, some considerations which should drive your decision are
locality of access, cache efficiency, and worst-case performance.

1) If you access many results for a user-id around the same time, then
putting them close together will get you better lookup throughput, as once
you cause a disk seek to access a result, other results will be available in
memory (until it falls out of cache). This is generally going to be somewhat
true whether you use a key (userid/resultid) or a key (userid) and column
(resultid).

2) If some (or one) result-id is accessed much more frequently than others
for all users, then you might wish for most of that result-id to get good
cache hitrate, while other result-ids that are less frequently accessed do
not need to be cached as much. To get this behavior, you'd want the
hottest-most-frequently-accessed result-id to be packed together, and thus
you'd want to either use a key such as result-id/user-id, or use a
separate column family for hot result-ids.

3) how many results can a user have? If every user will have an unbounded
number of results (say  40k), but you generally only need one result at a
time, then you probably want to use a compound key (userid/resultid) rather
than a key(userid)+column(resultid), because you don't want to have to deal
with large data just to get a small piece. That said, it seems that
cassandra's handling of wide-rows is improving, so perhaps in the future
this will not be as large an issue.


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-03 Thread Aditya Narayan
Thanks Tyler!


On Thu, Feb 3, 2011 at 12:06 PM, Tyler Hobbs ty...@datastax.com wrote:
 On Wed, Feb 2, 2011 at 3:27 PM, Aditya Narayan ady...@gmail.com wrote:

 Can I have some more feedback about my schema perhaps somewhat more
 criticisive/harsh ?

 It sounds reasonable to me.

 Since you're writing/reading all of the subcolumns at the same time, I would
 opt for a standard column with the tags serialized into a column value.

 I don't think you need to worry about row lengths here.

 Depending on the reminder size and how many times it's likely to be repeated
 in the timeline, you could explore denormalizing a bit more by storing the
 reminders in the timelines themselves, perhaps with a separate row per
 (user, tag) combination.  This would cut down on your seeks quite a bit, but
 it may not be necessary at this point (or at all).

 --
 Tyler Hobbs
 Software Engineer, DataStax
 Maintainer of the pycassa Cassandra Python client library




Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing all the subcolumns data serialized in single
column(s)  ?

Thanks
Aditya Narayan


Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan ady...@gmail.com wrote:
 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan



Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs
To reiterate, so I know we're both on the same page, your schema would be 
something like this:


- A column family (as you describe) to store the details of a reminder. One 
reminder per row. The row key would be a TimeUUID.


- A super column family to store the reminders for each user, for each day. The 
row key would be something like: MMDD:user_id. The column names would simply 
be the TimeUUID of the messages. The sub column names would be the tag names of 
the various reminders.


The idea is that you would then get a slice of each row for a user, for a day, 
that would only contain sub column names with the tags you're looking for? Then 
based upon the column names returned, you'd look-up the reminders.


That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:

Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com  wrote:

Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing all the subcolumns data serialized in single
column(s)  ?

Thanks
Aditya Narayan



Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders ( not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up..

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs bill.spe...@gmail.com wrote:
 To reiterate, so I know we're both on the same page, your schema would be
 something like this:

 - A column family (as you describe) to store the details of a reminder. One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names would be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for a
 day, that would only contain sub column names with the tags you're looking
 for? Then based upon the column names returned, you'd look-up the reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column family.
 -For presenting the timeline of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside that supercolumns could contain the list of tags associated
 with particular reminder. All tags set at once during first write. The
 no of tags(subcolumns) will be around 8 maximum.

 Any comments, suggestions and feedback on the schema design are
 requested..

 Thanks
 Aditya Narayan


 On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com  wrote:

 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan




Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs
Any time I see/hear a single row containing all ... I get nervous. That single 
row is going to reside on a single node. That is potentially a lot of load 
(don't know the system) for that single node. Why wouldn't you split it by at 
least user? If it won't be a lot of load, then why are you using Cassandra? This 
seems like something that could easily fit into an SQL/relational style DB. If 
it's too much data (millions of users, 100s of millions of reminders) for a 
standard SQL/relational model, then it's probably too much for a single row.


I'm not familiar with the TTL functionality of Cassandra... sorry cannot 
help/comment there, still learning :-)


Yea, my $0.02 is that this is an effective way to leverage super columns.

Bill-

On 02/02/2011 10:43 AM, Aditya Narayan wrote:

I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders (  not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up..

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com  wrote:

To reiterate, so I know we're both on the same page, your schema would be
something like this:

- A column family (as you describe) to store the details of a reminder. One
reminder per row. The row key would be a TimeUUID.

- A super column family to store the reminders for each user, for each day.
The row key would be something like: MMDD:user_id. The column names
would simply be the TimeUUID of the messages. The sub column names would be
the tag names of the various reminders.

The idea is that you would then get a slice of each row for a user, for a
day, that would only contain sub column names with the tags you're looking
for? Then based upon the column names returned, you'd look-up the reminders.

That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:


Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are
requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.comwrote:


Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved together. The data in each subcolumn is not big,
it just contains keys to other rows.

Would it be preferred to have a supercolumn family or just a standard
column family containing all the subcolumns data serialized in single
column(s)  ?

Thanks
Aditya Narayan





Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
You got me wrong perhaps..

I am already splitting the row on per user basis ofcourse, otherwise
the schema wont make sense for my usage. The row contains only
*reminders of a single user* sorted in chronological order. The
reminder Id are stored as supercolumn name and subcolumn contain tags
for that reminder.



On Wed, Feb 2, 2011 at 9:19 PM, William R Speirs bill.spe...@gmail.com wrote:
 Any time I see/hear a single row containing all ... I get nervous. That
 single row is going to reside on a single node. That is potentially a lot of
 load (don't know the system) for that single node. Why wouldn't you split it
 by at least user? If it won't be a lot of load, then why are you using
 Cassandra? This seems like something that could easily fit into an
 SQL/relational style DB. If it's too much data (millions of users, 100s of
 millions of reminders) for a standard SQL/relational model, then it's
 probably too much for a single row.

 I'm not familiar with the TTL functionality of Cassandra... sorry cannot
 help/comment there, still learning :-)

 Yea, my $0.02 is that this is an effective way to leverage super columns.

 Bill-

 On 02/02/2011 10:43 AM, Aditya Narayan wrote:

 I think you got it exactly what I wanted to convey except for few
 things I want to clarify:

 I was thinking of a single row containing all reminders (  not split
 by day). History of the reminders need to be maintained for some time.
 After certain time (say 3 or 6 months) they may be deleted by ttl
 facility.

 While presenting the reminders timeline to the user, latest
 supercolumns like around 50 from the start_end will be picked up and
 their subcolumns values will be compared to the Tags user has chosen
 to see and, corresponding to the filtered subcolumn values(tags), the
 rows of the reminder details would be picked up..

 Is supercolumn a preferable choice for this ? Can there be a better
 schema than this ?


 -Aditya Narayan



 On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:

 To reiterate, so I know we're both on the same page, your schema would be
 something like this:

 - A column family (as you describe) to store the details of a reminder.
 One
 reminder per row. The row key would be a TimeUUID.

 - A super column family to store the reminders for each user, for each
 day.
 The row key would be something like: MMDD:user_id. The column names
 would simply be the TimeUUID of the messages. The sub column names would
 be
 the tag names of the various reminders.

 The idea is that you would then get a slice of each row for a user, for a
 day, that would only contain sub column names with the tags you're
 looking
 for? Then based upon the column names returned, you'd look-up the
 reminders.

 That seems like a solid schema to me.

 Bill-

 On 02/02/2011 09:37 AM, Aditya Narayan wrote:

 Actually, I am trying to use Cassandra to display to users on my
 applicaiton, the list of all Reminders set by themselves for
 themselves, on the application.

 I need to store rows containing the timeline of daily Reminders put by
 the users, for themselves, on application. The reminders need to be
 presented to the user in a chronological order like a news feed.
 Each reminder has got certain tags associated with it(so that, at
 times, user may also choose to see the reminders filtered by tags in
 chronological order).

 So I thought of a schema something like this:-

 -Each Reminder details may be stored as separate rows in column family.
 -For presenting the timeline of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside that supercolumns could contain the list of tags associated
 with particular reminder. All tags set at once during first write. The
 no of tags(subcolumns) will be around 8 maximum.

 Any comments, suggestions and feedback on the schema design are
 requested..

 Thanks
 Aditya Narayan


 On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com
  wrote:

 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in single
 column(s)  ?

 Thanks
 Aditya Narayan





Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread William R Speirs

I did not understand before... sorry.

Again, depending upon how many reminders you have for a single user, this could 
be a long/wide row. Again, it really comes down to how many reminders are we 
talking about and how often will they be read/written. While a single row can 
contain millions (maybe more) columns, that doesn't mean it's a good idea.


I'm working on a logging system with Cassandra and ran into this same type of 
problem. Do I put all of the messages for a single system into a single row 
keyed off that system's name? I quickly came to the answer of no and now I 
break my row keys into POSIX_timestamp:system where my timestamps are buckets 
for every 5 minutes. This nicely distributes the load across the nodes in my system.


Bill-

On 02/02/2011 11:18 AM, Aditya Narayan wrote:

You got me wrong perhaps..

I am already splitting the row on per user basis ofcourse, otherwise
the schema wont make sense for my usage. The row contains only
*reminders of a single user* sorted in chronological order. The
reminder Id are stored as supercolumn name and subcolumn contain tags
for that reminder.



On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com  wrote:

Any time I see/hear a single row containing all ... I get nervous. That
single row is going to reside on a single node. That is potentially a lot of
load (don't know the system) for that single node. Why wouldn't you split it
by at least user? If it won't be a lot of load, then why are you using
Cassandra? This seems like something that could easily fit into an
SQL/relational style DB. If it's too much data (millions of users, 100s of
millions of reminders) for a standard SQL/relational model, then it's
probably too much for a single row.

I'm not familiar with the TTL functionality of Cassandra... sorry cannot
help/comment there, still learning :-)

Yea, my $0.02 is that this is an effective way to leverage super columns.

Bill-

On 02/02/2011 10:43 AM, Aditya Narayan wrote:


I think you got it exactly what I wanted to convey except for few
things I want to clarify:

I was thinking of a single row containing all reminders (not split
by day). History of the reminders need to be maintained for some time.
After certain time (say 3 or 6 months) they may be deleted by ttl
facility.

While presenting the reminders timeline to the user, latest
supercolumns like around 50 from the start_end will be picked up and
their subcolumns values will be compared to the Tags user has chosen
to see and, corresponding to the filtered subcolumn values(tags), the
rows of the reminder details would be picked up..

Is supercolumn a preferable choice for this ? Can there be a better
schema than this ?


-Aditya Narayan



On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com
  wrote:


To reiterate, so I know we're both on the same page, your schema would be
something like this:

- A column family (as you describe) to store the details of a reminder.
One
reminder per row. The row key would be a TimeUUID.

- A super column family to store the reminders for each user, for each
day.
The row key would be something like: MMDD:user_id. The column names
would simply be the TimeUUID of the messages. The sub column names would
be
the tag names of the various reminders.

The idea is that you would then get a slice of each row for a user, for a
day, that would only contain sub column names with the tags you're
looking
for? Then based upon the column names returned, you'd look-up the
reminders.

That seems like a solid schema to me.

Bill-

On 02/02/2011 09:37 AM, Aditya Narayan wrote:


Actually, I am trying to use Cassandra to display to users on my
applicaiton, the list of all Reminders set by themselves for
themselves, on the application.

I need to store rows containing the timeline of daily Reminders put by
the users, for themselves, on application. The reminders need to be
presented to the user in a chronological order like a news feed.
Each reminder has got certain tags associated with it(so that, at
times, user may also choose to see the reminders filtered by tags in
chronological order).

So I thought of a schema something like this:-

-Each Reminder details may be stored as separate rows in column family.
-For presenting the timeline of reminders set by user to be presented
to the user, the timeline row of each user would contain the Id/Key(s)
(of the Reminder rows) as the supercolumn names and the subcolumns
inside that supercolumns could contain the list of tags associated
with particular reminder. All tags set at once during first write. The
no of tags(subcolumns) will be around 8 maximum.

Any comments, suggestions and feedback on the schema design are
requested..

Thanks
Aditya Narayan


On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com
  wrote:


Hey all,

I need to store supercolumns each with around 8 subcolumns;
All the data for a supercolumn is written at once and all subcolumns
need to be retrieved

Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
 that supercolumns could contain the list of tags associated
 with particular reminder. All tags set at once during first write. The
 no of tags(subcolumns) will be around 8 maximum.

 Any comments, suggestions and feedback on the schema design are
 requested..

 Thanks
 Aditya Narayan


 On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com
  wrote:

 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in
 single
 column(s)  ?

 Thanks
 Aditya Narayan






Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Aditya Narayan
 of reminders set by user to be presented
 to the user, the timeline row of each user would contain the Id/Key(s)
 (of the Reminder rows) as the supercolumn names and the subcolumns
 inside that supercolumns could contain the list of tags associated
 with particular reminder. All tags set at once during first write. The
 no of tags(subcolumns) will be around 8 maximum.

 Any comments, suggestions and feedback on the schema design are
 requested..

 Thanks
 Aditya Narayan


 On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com
  wrote:

 Hey all,

 I need to store supercolumns each with around 8 subcolumns;
 All the data for a supercolumn is written at once and all subcolumns
 need to be retrieved together. The data in each subcolumn is not big,
 it just contains keys to other rows.

 Would it be preferred to have a supercolumn family or just a standard
 column family containing all the subcolumns data serialized in
 single
 column(s)  ?

 Thanks
 Aditya Narayan







Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?

2011-02-02 Thread Tyler Hobbs
On Wed, Feb 2, 2011 at 3:27 PM, Aditya Narayan ady...@gmail.com wrote:

 Can I have some more feedback about my schema perhaps somewhat more
 criticisive/harsh ?


It sounds reasonable to me.

Since you're writing/reading all of the subcolumns at the same time, I would
opt for a standard column with the tags serialized into a column value.

I don't think you need to worry about row lengths here.

Depending on the reminder size and how many times it's likely to be repeated
in the timeline, you could explore denormalizing a bit more by storing the
reminders in the timelines themselves, perhaps with a separate row per
(user, tag) combination.  This would cut down on your seeks quite a bit, but
it may not be necessary at this point (or at all).

-- 
Tyler Hobbs
Software Engineer, DataStax http://datastax.com/
Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
Python client library


Re: Schema Design

2011-01-30 Thread Jonathan Ellis
fwiw, https://github.com/thobbs/logsandra is more recent.

2011/1/30 aaron morton aa...@thelastpickle.com:
 This project may be what you are looking for, or provide some inspiration 
 https://github.com/jbohman/logsandra

 Cloud Kick has an example or rolling up time series data 
 https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

 The schema below sounds reasonable. If you will always bring back the entire 
 log record, consider using a Standard CF rather than a Super CF. Then pack 
 the log message using your favourite serialisation, e.g. JSON.

 Hope that helps.
 Aaron

 On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote:

 I am also working on a system store logs from hundreds system.
 In my scenario, most query will like this: let's look at login logs 
 (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time 
 range).
 My data model like this:
 . only 1 CF. that's enough for this scenario.
 . group logs from each host and day to one row. Key format is 
 hostname.category.date
 . store each log entry as a super column, super olumn name is TimeUUID of 
 the log. each attribute as a column.

 Then this query can be done as 3 GET, no need to do key range scan.
 Then I can use RP instead of OPP. If I use OPP, I have to worry about load 
 balance myself. I hate that.
 However, if I need to do a time range access, I can still use column slice.

 An additional benefit is, I can clean old logs very easily. We only store 
 logs in 1 year. Just deleting by keys can do this job well.

 I think storing all logs for a host in a single row is not a good choice. 2 
 reason:
 1, too few keys, so your data will not distributing well.
 2, data under a key will always increase. So Cassandra have to do more 
 SSTable compaction.

 -邮件原件-
 发件人: William R Speirs [mailto:bill.spe...@gmail.com]
 发送时间: 2011年1月27日 9:15
 收件人: user@cassandra.apache.org
 主题: Re: Schema Design

 It makes sense that the single row for a system (with a growing number of
 columns) will reside on a single machine.

 With that in mind, here is my updated schema:

 - A single column family for all the messages. The row keys will be the 
 TimeUUID
 of the message with the following columns: date/time (in UTC POSIX), system
 name/id (with an index for fast/easy gets), the actual message payload.

 - A column family for each system. The row keys will be UTC POSIX time with 1
 second (maybe 1 minute) bucketing, and the column names will be the TimeUUID 
 of
 any messages that were logged during that time bucket.

 My only hesitation with this design is that buddhasystem warned that each 
 column
 family, is allocated a piece of memory on the server. I'm not sure what the
 implications of this are and/or if this would be a problem if a I had a 
 number
 of systems on the order of hundreds.

 Thanks...

 Bill-

 On 01/26/2011 06:51 PM, Shu Zhang wrote:
 Each row can have a maximum of 2 billion columns, which a logging system 
 will probably hit eventually.

 More importantly, you'll only have 1 row per set of system logs. Every row 
 is stored on the same machine(s), which you means you'll definitely not be 
 able to distribute your load very well.
 
 From: Bill Speirs [bill.spe...@gmail.com]
 Sent: Wednesday, January 26, 2011 1:23 PM
 To: user@cassandra.apache.org
 Subject: Re: Schema Design

 I like this approach, but I have 2 questions:

 1) what is the implications of continually adding columns to a single
 row? I'm unsure how Cassandra is able to grow. I realize you can have
 a virtually infinite number of columns, but what are the implications
 of growing the number of columns over time?

 2) maybe it's just a restriction of the CLI, but how do I do issue a
 slice request? Also, what if start (or end) columns don't exist? I'm
 guessing it's smart enough to get the columns in that range.

 Thanks!

 Bill-

 On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
 dmcne...@agentisenergy.com  wrote:
 I would say in that case you might want  to try a  single column family
 where the key to the column is the system name.
 Then, you could name your columns as the timestamp.  Then when retrieving
 information from the data store you can can, in your slice request, specify
 your start column as  X and end  column as Y.
 Then you can use the stored column name to know when an event  occurred.

 On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirsbill.spe...@gmail.com  wrote:

 I'm looking to use Cassandra to store log messages from various
 systems. A log message only has a message (UTF8Type) and a data/time.
 My thought is to create a column family for each system. The row key
 will be a TimeUUIDType. Each row will have 7 columns: year, month,
 day, hour, minute, second, and message. I then have indexes setup for
 each of the date/time columns.

 I was hoping this would allow me to answer queries like: What are all
 the log messages that were generated between X  Y? The problem is
 that I can

Re: Schema Design

2011-01-29 Thread aaron morton
This project may be what you are looking for, or provide some inspiration 
https://github.com/jbohman/logsandra

Cloud Kick has an example or rolling up time series data 
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

The schema below sounds reasonable. If you will always bring back the entire 
log record, consider using a Standard CF rather than a Super CF. Then pack the 
log message using your favourite serialisation, e.g. JSON. 

Hope that helps.
Aaron

On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote:

 I am also working on a system store logs from hundreds system.
 In my scenario, most query will like this: let's look at login logs 
 (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time 
 range).
 My data model like this:
 . only 1 CF. that's enough for this scenario.
 . group logs from each host and day to one row. Key format is 
 hostname.category.date
 . store each log entry as a super column, super olumn name is TimeUUID of the 
 log. each attribute as a column.
 
 Then this query can be done as 3 GET, no need to do key range scan.
 Then I can use RP instead of OPP. If I use OPP, I have to worry about load 
 balance myself. I hate that.
 However, if I need to do a time range access, I can still use column slice.
 
 An additional benefit is, I can clean old logs very easily. We only store 
 logs in 1 year. Just deleting by keys can do this job well.
 
 I think storing all logs for a host in a single row is not a good choice. 2 
 reason:
 1, too few keys, so your data will not distributing well.
 2, data under a key will always increase. So Cassandra have to do more 
 SSTable compaction.
 
 -邮件原件-
 发件人: William R Speirs [mailto:bill.spe...@gmail.com] 
 发送时间: 2011年1月27日 9:15
 收件人: user@cassandra.apache.org
 主题: Re: Schema Design
 
 It makes sense that the single row for a system (with a growing number of 
 columns) will reside on a single machine.
 
 With that in mind, here is my updated schema:
 
 - A single column family for all the messages. The row keys will be the 
 TimeUUID 
 of the message with the following columns: date/time (in UTC POSIX), system 
 name/id (with an index for fast/easy gets), the actual message payload.
 
 - A column family for each system. The row keys will be UTC POSIX time with 1 
 second (maybe 1 minute) bucketing, and the column names will be the TimeUUID 
 of 
 any messages that were logged during that time bucket.
 
 My only hesitation with this design is that buddhasystem warned that each 
 column 
 family, is allocated a piece of memory on the server. I'm not sure what the 
 implications of this are and/or if this would be a problem if a I had a 
 number 
 of systems on the order of hundreds.
 
 Thanks...
 
 Bill-
 
 On 01/26/2011 06:51 PM, Shu Zhang wrote:
 Each row can have a maximum of 2 billion columns, which a logging system 
 will probably hit eventually.
 
 More importantly, you'll only have 1 row per set of system logs. Every row 
 is stored on the same machine(s), which you means you'll definitely not be 
 able to distribute your load very well.
 
 From: Bill Speirs [bill.spe...@gmail.com]
 Sent: Wednesday, January 26, 2011 1:23 PM
 To: user@cassandra.apache.org
 Subject: Re: Schema Design
 
 I like this approach, but I have 2 questions:
 
 1) what is the implications of continually adding columns to a single
 row? I'm unsure how Cassandra is able to grow. I realize you can have
 a virtually infinite number of columns, but what are the implications
 of growing the number of columns over time?
 
 2) maybe it's just a restriction of the CLI, but how do I do issue a
 slice request? Also, what if start (or end) columns don't exist? I'm
 guessing it's smart enough to get the columns in that range.
 
 Thanks!
 
 Bill-
 
 On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
 dmcne...@agentisenergy.com  wrote:
 I would say in that case you might want  to try a  single column family
 where the key to the column is the system name.
 Then, you could name your columns as the timestamp.  Then when retrieving
 information from the data store you can can, in your slice request, specify
 your start column as  X and end  column as Y.
 Then you can use the stored column name to know when an event  occurred.
 
 On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirsbill.spe...@gmail.com  wrote:
 
 I'm looking to use Cassandra to store log messages from various
 systems. A log message only has a message (UTF8Type) and a data/time.
 My thought is to create a column family for each system. The row key
 will be a TimeUUIDType. Each row will have 7 columns: year, month,
 day, hour, minute, second, and message. I then have indexes setup for
 each of the date/time columns.
 
 I was hoping this would allow me to answer queries like: What are all
 the log messages that were generated between X  Y? The problem is
 that I can ONLY use the equals operator on these column values. For
 example, I cannot issuing

Schema Design

2011-01-26 Thread Bill Speirs
I'm looking to use Cassandra to store log messages from various
systems. A log message only has a message (UTF8Type) and a data/time.
My thought is to create a column family for each system. The row key
will be a TimeUUIDType. Each row will have 7 columns: year, month,
day, hour, minute, second, and message. I then have indexes setup for
each of the date/time columns.

I was hoping this would allow me to answer queries like: What are all
the log messages that were generated between X  Y? The problem is
that I can ONLY use the equals operator on these column values. For
example, I cannot issuing: get system_x where month  1; gives me this
error: No indexed columns present in index clause with operator EQ.
The equals operator works as expected though: get system_x where month
= 1;

What schema would allow me to get date ranges?

Thanks in advance...

Bill-

* ColumnFamily description *
ColumnFamily: system_x_msg
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/3600
  Memtable thresholds: 1.1671875/249/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
proj_1_msg.7365636f6e64, proj_1_msg.79656172]
  Column Metadata:
Column Name: year (year)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: month (month)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: second (second)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: minute (minute)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: hour (hour)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: day (day)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS


Re: Schema Design

2011-01-26 Thread David McNelis
I would say in that case you might want  to try a  single column family
where the key to the column is the system name.

Then, you could name your columns as the timestamp.  Then when retrieving
information from the data store you can can, in your slice request, specify
your start column as  X and end  column as Y.

Then you can use the stored column name to know when an event  occurred.

On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs bill.spe...@gmail.com wrote:

 I'm looking to use Cassandra to store log messages from various
 systems. A log message only has a message (UTF8Type) and a data/time.
 My thought is to create a column family for each system. The row key
 will be a TimeUUIDType. Each row will have 7 columns: year, month,
 day, hour, minute, second, and message. I then have indexes setup for
 each of the date/time columns.

 I was hoping this would allow me to answer queries like: What are all
 the log messages that were generated between X  Y? The problem is
 that I can ONLY use the equals operator on these column values. For
 example, I cannot issuing: get system_x where month  1; gives me this
 error: No indexed columns present in index clause with operator EQ.
 The equals operator works as expected though: get system_x where month
 = 1;

 What schema would allow me to get date ranges?

 Thanks in advance...

 Bill-

 * ColumnFamily description *
ColumnFamily: system_x_msg
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/3600
  Memtable thresholds: 1.1671875/249/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
 proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
 proj_1_msg.7365636f6e64, proj_1_msg.79656172]
  Column Metadata:
Column Name: year (year)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: month (month)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: second (second)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: minute (minute)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: hour (hour)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: day (day)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS




-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*


Re: Schema Design

2011-01-26 Thread buddhasystem

Having separate columns for Year, Month etc seems redundant. It's tons more
efficient to keep say UTC time in POSIX format (basically integer). It's
easy to convert back and forth.

If you want to get a range of dates, in that case you might use Order
Preserving Partitioner, and sort out which systems logged later in client.
Read up on consequences of using OPP.

Whether to shard data as per system depends on how many you have. If more
than a few, don't do that, there are memory considerations.

Cheers

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Schema Design

2011-01-26 Thread Bill Speirs
I like this approach, but I have 2 questions:

1) what is the implications of continually adding columns to a single
row? I'm unsure how Cassandra is able to grow. I realize you can have
a virtually infinite number of columns, but what are the implications
of growing the number of columns over time?

2) maybe it's just a restriction of the CLI, but how do I do issue a
slice request? Also, what if start (or end) columns don't exist? I'm
guessing it's smart enough to get the columns in that range.

Thanks!

Bill-

On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
dmcne...@agentisenergy.com wrote:
 I would say in that case you might want  to try a  single column family
 where the key to the column is the system name.
 Then, you could name your columns as the timestamp.  Then when retrieving
 information from the data store you can can, in your slice request, specify
 your start column as  X and end  column as Y.
 Then you can use the stored column name to know when an event  occurred.

 On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs bill.spe...@gmail.com wrote:

 I'm looking to use Cassandra to store log messages from various
 systems. A log message only has a message (UTF8Type) and a data/time.
 My thought is to create a column family for each system. The row key
 will be a TimeUUIDType. Each row will have 7 columns: year, month,
 day, hour, minute, second, and message. I then have indexes setup for
 each of the date/time columns.

 I was hoping this would allow me to answer queries like: What are all
 the log messages that were generated between X  Y? The problem is
 that I can ONLY use the equals operator on these column values. For
 example, I cannot issuing: get system_x where month  1; gives me this
 error: No indexed columns present in index clause with operator EQ.
 The equals operator works as expected though: get system_x where month
 = 1;

 What schema would allow me to get date ranges?

 Thanks in advance...

 Bill-

 * ColumnFamily description *
    ColumnFamily: system_x_msg
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      Row cache size / save period: 0.0/0
      Key cache size / save period: 20.0/3600
      Memtable thresholds: 1.1671875/249/60
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
 proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
 proj_1_msg.7365636f6e64, proj_1_msg.79656172]
      Column Metadata:
        Column Name: year (year)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: month (month)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: second (second)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: minute (minute)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: hour (hour)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: day (day)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS



 --
 David McNelis
 Lead Software Engineer
 Agentis Energy
 www.agentisenergy.com
 o: 630.359.6395
 c: 219.384.5143
 A Smart Grid technology company focused on helping consumers of energy
 control an often under-managed resource.




Re: Schema Design

2011-01-26 Thread Bill Speirs
I have a basic understanding of OPP... if most of my messages come
within a single hour then a few nodes could be storing all of my
values, right?

You totally lost me on, whether to shard data as per system... Is my
schema (one column family per system, and row keys as TimeUUIDType)
sharding by system? I thought -- probably incorrectly -- that the row
keys are used in the sharding process, not column families.

Thanks...

Bill-

On Wed, Jan 26, 2011 at 4:17 PM, buddhasystem potek...@bnl.gov wrote:

 Having separate columns for Year, Month etc seems redundant. It's tons more
 efficient to keep say UTC time in POSIX format (basically integer). It's
 easy to convert back and forth.

 If you want to get a range of dates, in that case you might use Order
 Preserving Partitioner, and sort out which systems logged later in client.
 Read up on consequences of using OPP.

 Whether to shard data as per system depends on how many you have. If more
 than a few, don't do that, there are memory considerations.

 Cheers

 Maxim

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Schema Design

2011-01-26 Thread David McNelis
My cli knowledge sucks so far, so I'll leave that  to othersI'm doing
most of my reading/writing through a thrift client (hector/java based)

As for the implications, as of the latest version of Cassandra there is not
theoretical limit to the number of columns that a particular row can hold.
 Over time you've got a couple of different options, if you're concerned
that you end up with too many columns to manage then you'd probably want to
start thinking about a warehousing strategy long-term for your older records
that involves expiring columns that are older than X in your Cassandra
cluster.  But for the most part you shouldn't *need* to do that.



On Wed, Jan 26, 2011 at 3:23 PM, Bill Speirs bill.spe...@gmail.com wrote:

 I like this approach, but I have 2 questions:

 1) what is the implications of continually adding columns to a single
 row? I'm unsure how Cassandra is able to grow. I realize you can have
 a virtually infinite number of columns, but what are the implications
 of growing the number of columns over time?

 2) maybe it's just a restriction of the CLI, but how do I do issue a
 slice request? Also, what if start (or end) columns don't exist? I'm
 guessing it's smart enough to get the columns in that range.

 Thanks!

 Bill-

 On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
 dmcne...@agentisenergy.com wrote:
  I would say in that case you might want  to try a  single column family
  where the key to the column is the system name.
  Then, you could name your columns as the timestamp.  Then when retrieving
  information from the data store you can can, in your slice request,
 specify
  your start column as  X and end  column as Y.
  Then you can use the stored column name to know when an event  occurred.
 
  On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs bill.spe...@gmail.com
 wrote:
 
  I'm looking to use Cassandra to store log messages from various
  systems. A log message only has a message (UTF8Type) and a data/time.
  My thought is to create a column family for each system. The row key
  will be a TimeUUIDType. Each row will have 7 columns: year, month,
  day, hour, minute, second, and message. I then have indexes setup for
  each of the date/time columns.
 
  I was hoping this would allow me to answer queries like: What are all
  the log messages that were generated between X  Y? The problem is
  that I can ONLY use the equals operator on these column values. For
  example, I cannot issuing: get system_x where month  1; gives me this
  error: No indexed columns present in index clause with operator EQ.
  The equals operator works as expected though: get system_x where month
  = 1;
 
  What schema would allow me to get date ranges?
 
  Thanks in advance...
 
  Bill-
 
  * ColumnFamily description *
 ColumnFamily: system_x_msg
   Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
   Row cache size / save period: 0.0/0
   Key cache size / save period: 20.0/3600
   Memtable thresholds: 1.1671875/249/60
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 1.0
   Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
  proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
  proj_1_msg.7365636f6e64, proj_1_msg.79656172]
   Column Metadata:
 Column Name: year (year)
   Validation Class: org.apache.cassandra.db.marshal.IntegerType
   Index Type: KEYS
 Column Name: month (month)
   Validation Class: org.apache.cassandra.db.marshal.IntegerType
   Index Type: KEYS
 Column Name: second (second)
   Validation Class: org.apache.cassandra.db.marshal.IntegerType
   Index Type: KEYS
 Column Name: minute (minute)
   Validation Class: org.apache.cassandra.db.marshal.IntegerType
   Index Type: KEYS
 Column Name: hour (hour)
   Validation Class: org.apache.cassandra.db.marshal.IntegerType
   Index Type: KEYS
 Column Name: day (day)
   Validation Class: org.apache.cassandra.db.marshal.IntegerType
   Index Type: KEYS
 
 
 
  --
  David McNelis
  Lead Software Engineer
  Agentis Energy
  www.agentisenergy.com
  o: 630.359.6395
  c: 219.384.5143
  A Smart Grid technology company focused on helping consumers of energy
  control an often under-managed resource.
 
 




-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*


Re: Schema Design

2011-01-26 Thread Nick Santini
One thing you can do is create one CF, then as the row key use the
application name + timestamp, with that you can do your range query using
OOP. then store whatever you want in the row

problem would be if one app generates far more logs than the others

Nicolas Santini


On Thu, Jan 27, 2011 at 10:26 AM, Bill Speirs bill.spe...@gmail.com wrote:

 I have a basic understanding of OPP... if most of my messages come
 within a single hour then a few nodes could be storing all of my
 values, right?

 You totally lost me on, whether to shard data as per system... Is my
 schema (one column family per system, and row keys as TimeUUIDType)
 sharding by system? I thought -- probably incorrectly -- that the row
 keys are used in the sharding process, not column families.

 Thanks...

 Bill-

 On Wed, Jan 26, 2011 at 4:17 PM, buddhasystem potek...@bnl.gov wrote:
 
  Having separate columns for Year, Month etc seems redundant. It's tons
 more
  efficient to keep say UTC time in POSIX format (basically integer). It's
  easy to convert back and forth.
 
  If you want to get a range of dates, in that case you might use Order
  Preserving Partitioner, and sort out which systems logged later in
 client.
  Read up on consequences of using OPP.
 
  Whether to shard data as per system depends on how many you have. If more
  than a few, don't do that, there are memory considerations.
 
  Cheers
 
  Maxim
 
  --
  View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html
  Sent from the cassandra-u...@incubator.apache.org mailing list archive
 at Nabble.com.
 



Re: Schema Design

2011-01-26 Thread buddhasystem

I used the term sharding a bit frivolously. Sorry. It's just splitting
semantically homogenious data among CFs doesn't scale too well, as each CF
is allocated a piece of memory on the server.
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964326.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


RE: Schema Design

2011-01-26 Thread Shu Zhang
Each row can have a maximum of 2 billion columns, which a logging system will 
probably hit eventually.

More importantly, you'll only have 1 row per set of system logs. Every row is 
stored on the same machine(s), which you means you'll definitely not be able to 
distribute your load very well.

From: Bill Speirs [bill.spe...@gmail.com]
Sent: Wednesday, January 26, 2011 1:23 PM
To: user@cassandra.apache.org
Subject: Re: Schema Design

I like this approach, but I have 2 questions:

1) what is the implications of continually adding columns to a single
row? I'm unsure how Cassandra is able to grow. I realize you can have
a virtually infinite number of columns, but what are the implications
of growing the number of columns over time?

2) maybe it's just a restriction of the CLI, but how do I do issue a
slice request? Also, what if start (or end) columns don't exist? I'm
guessing it's smart enough to get the columns in that range.

Thanks!

Bill-

On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
dmcne...@agentisenergy.com wrote:
 I would say in that case you might want  to try a  single column family
 where the key to the column is the system name.
 Then, you could name your columns as the timestamp.  Then when retrieving
 information from the data store you can can, in your slice request, specify
 your start column as  X and end  column as Y.
 Then you can use the stored column name to know when an event  occurred.

 On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs bill.spe...@gmail.com wrote:

 I'm looking to use Cassandra to store log messages from various
 systems. A log message only has a message (UTF8Type) and a data/time.
 My thought is to create a column family for each system. The row key
 will be a TimeUUIDType. Each row will have 7 columns: year, month,
 day, hour, minute, second, and message. I then have indexes setup for
 each of the date/time columns.

 I was hoping this would allow me to answer queries like: What are all
 the log messages that were generated between X  Y? The problem is
 that I can ONLY use the equals operator on these column values. For
 example, I cannot issuing: get system_x where month  1; gives me this
 error: No indexed columns present in index clause with operator EQ.
 The equals operator works as expected though: get system_x where month
 = 1;

 What schema would allow me to get date ranges?

 Thanks in advance...

 Bill-

 * ColumnFamily description *
ColumnFamily: system_x_msg
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/3600
  Memtable thresholds: 1.1671875/249/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
 proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
 proj_1_msg.7365636f6e64, proj_1_msg.79656172]
  Column Metadata:
Column Name: year (year)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: month (month)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: second (second)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: minute (minute)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: hour (hour)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: day (day)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS



 --
 David McNelis
 Lead Software Engineer
 Agentis Energy
 www.agentisenergy.com
 o: 630.359.6395
 c: 219.384.5143
 A Smart Grid technology company focused on helping consumers of energy
 control an often under-managed resource.




Re: Schema Design

2011-01-26 Thread William R Speirs
It makes sense that the single row for a system (with a growing number of 
columns) will reside on a single machine.


With that in mind, here is my updated schema:

- A single column family for all the messages. The row keys will be the TimeUUID 
of the message with the following columns: date/time (in UTC POSIX), system 
name/id (with an index for fast/easy gets), the actual message payload.


- A column family for each system. The row keys will be UTC POSIX time with 1 
second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of 
any messages that were logged during that time bucket.


My only hesitation with this design is that buddhasystem warned that each column 
family, is allocated a piece of memory on the server. I'm not sure what the 
implications of this are and/or if this would be a problem if a I had a number 
of systems on the order of hundreds.


Thanks...

Bill-

On 01/26/2011 06:51 PM, Shu Zhang wrote:

Each row can have a maximum of 2 billion columns, which a logging system will 
probably hit eventually.

More importantly, you'll only have 1 row per set of system logs. Every row is 
stored on the same machine(s), which you means you'll definitely not be able to 
distribute your load very well.

From: Bill Speirs [bill.spe...@gmail.com]
Sent: Wednesday, January 26, 2011 1:23 PM
To: user@cassandra.apache.org
Subject: Re: Schema Design

I like this approach, but I have 2 questions:

1) what is the implications of continually adding columns to a single
row? I'm unsure how Cassandra is able to grow. I realize you can have
a virtually infinite number of columns, but what are the implications
of growing the number of columns over time?

2) maybe it's just a restriction of the CLI, but how do I do issue a
slice request? Also, what if start (or end) columns don't exist? I'm
guessing it's smart enough to get the columns in that range.

Thanks!

Bill-

On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
dmcne...@agentisenergy.com  wrote:

I would say in that case you might want  to try a  single column family
where the key to the column is the system name.
Then, you could name your columns as the timestamp.  Then when retrieving
information from the data store you can can, in your slice request, specify
your start column as  X and end  column as Y.
Then you can use the stored column name to know when an event  occurred.

On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirsbill.spe...@gmail.com  wrote:


I'm looking to use Cassandra to store log messages from various
systems. A log message only has a message (UTF8Type) and a data/time.
My thought is to create a column family for each system. The row key
will be a TimeUUIDType. Each row will have 7 columns: year, month,
day, hour, minute, second, and message. I then have indexes setup for
each of the date/time columns.

I was hoping this would allow me to answer queries like: What are all
the log messages that were generated between X  Y? The problem is
that I can ONLY use the equals operator on these column values. For
example, I cannot issuing: get system_x where month  1; gives me this
error: No indexed columns present in index clause with operator EQ.
The equals operator works as expected though: get system_x where month
= 1;

What schema would allow me to get date ranges?

Thanks in advance...

Bill-

* ColumnFamily description *
ColumnFamily: system_x_msg
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/3600
  Memtable thresholds: 1.1671875/249/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
proj_1_msg.7365636f6e64, proj_1_msg.79656172]
  Column Metadata:
Column Name: year (year)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: month (month)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: second (second)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: minute (minute)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: hour (hour)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS
Column Name: day (day)
  Validation Class: org.apache.cassandra.db.marshal.IntegerType
  Index Type: KEYS




--
David McNelis
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143
A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.




Re: Schema Design

2011-01-26 Thread William R Speirs

Ah, sweet... thanks for the link!

Bill-

On 01/26/2011 08:20 PM, buddhasystem wrote:


Bill, it's all explained here:

http://wiki.apache.org/cassandra/MemtableThresholds#JVM_Heap_Size,the

Watch the number of CFs and the memtable sizes.

In my experience, this all matters.


Re: Schema Design

2011-01-26 Thread Wangpei (Peter)
I am also working on a system store logs from hundreds system.
In my scenario, most query will like this: let's look at login logs (category 
EQ) of that proxy (host EQ) between this Monday and Wednesday(time range).
My data model like this:
. only 1 CF. that's enough for this scenario.
. group logs from each host and day to one row. Key format is 
hostname.category.date
. store each log entry as a super column, super olumn name is TimeUUID of the 
log. each attribute as a column.

Then this query can be done as 3 GET, no need to do key range scan.
Then I can use RP instead of OPP. If I use OPP, I have to worry about load 
balance myself. I hate that.
However, if I need to do a time range access, I can still use column slice.

An additional benefit is, I can clean old logs very easily. We only store logs 
in 1 year. Just deleting by keys can do this job well.

I think storing all logs for a host in a single row is not a good choice. 2 
reason:
1, too few keys, so your data will not distributing well.
2, data under a key will always increase. So Cassandra have to do more SSTable 
compaction.

-邮件原件-
发件人: William R Speirs [mailto:bill.spe...@gmail.com] 
发送时间: 2011年1月27日 9:15
收件人: user@cassandra.apache.org
主题: Re: Schema Design

It makes sense that the single row for a system (with a growing number of 
columns) will reside on a single machine.

With that in mind, here is my updated schema:

- A single column family for all the messages. The row keys will be the 
TimeUUID 
of the message with the following columns: date/time (in UTC POSIX), system 
name/id (with an index for fast/easy gets), the actual message payload.

- A column family for each system. The row keys will be UTC POSIX time with 1 
second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of 
any messages that were logged during that time bucket.

My only hesitation with this design is that buddhasystem warned that each 
column 
family, is allocated a piece of memory on the server. I'm not sure what the 
implications of this are and/or if this would be a problem if a I had a number 
of systems on the order of hundreds.

Thanks...

Bill-

On 01/26/2011 06:51 PM, Shu Zhang wrote:
 Each row can have a maximum of 2 billion columns, which a logging system will 
 probably hit eventually.

 More importantly, you'll only have 1 row per set of system logs. Every row is 
 stored on the same machine(s), which you means you'll definitely not be able 
 to distribute your load very well.
 
 From: Bill Speirs [bill.spe...@gmail.com]
 Sent: Wednesday, January 26, 2011 1:23 PM
 To: user@cassandra.apache.org
 Subject: Re: Schema Design

 I like this approach, but I have 2 questions:

 1) what is the implications of continually adding columns to a single
 row? I'm unsure how Cassandra is able to grow. I realize you can have
 a virtually infinite number of columns, but what are the implications
 of growing the number of columns over time?

 2) maybe it's just a restriction of the CLI, but how do I do issue a
 slice request? Also, what if start (or end) columns don't exist? I'm
 guessing it's smart enough to get the columns in that range.

 Thanks!

 Bill-

 On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
 dmcne...@agentisenergy.com  wrote:
 I would say in that case you might want  to try a  single column family
 where the key to the column is the system name.
 Then, you could name your columns as the timestamp.  Then when retrieving
 information from the data store you can can, in your slice request, specify
 your start column as  X and end  column as Y.
 Then you can use the stored column name to know when an event  occurred.

 On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirsbill.spe...@gmail.com  wrote:

 I'm looking to use Cassandra to store log messages from various
 systems. A log message only has a message (UTF8Type) and a data/time.
 My thought is to create a column family for each system. The row key
 will be a TimeUUIDType. Each row will have 7 columns: year, month,
 day, hour, minute, second, and message. I then have indexes setup for
 each of the date/time columns.

 I was hoping this would allow me to answer queries like: What are all
 the log messages that were generated between X  Y? The problem is
 that I can ONLY use the equals operator on these column values. For
 example, I cannot issuing: get system_x where month  1; gives me this
 error: No indexed columns present in index clause with operator EQ.
 The equals operator works as expected though: get system_x where month
 = 1;

 What schema would allow me to get date ranges?

 Thanks in advance...

 Bill-

 * ColumnFamily description *
 ColumnFamily: system_x_msg
   Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
   Row cache size / save period: 0.0/0
   Key cache size / save period: 20.0/3600
   Memtable thresholds: 1.1671875/249/60
   GC grace seconds: 864000
   Compaction min/max

Re: Is this a good schema design to implement a social application..

2011-01-08 Thread Edward Capriolo
On Fri, Jan 7, 2011 at 11:38 PM, Rajkumar Gupta rajkumar@gmail.com wrote:
 In the twissandra example,
 http://www.riptano.com/docs/0.6/data_model/twissandra#adding-friends ,
 I find that they have split the materialized view of a user's homepage
 (like his followers list, tweets from friends) into several
 columnfamilies instead of putting in supercolumns inside a single
 SupercolumnFamily thereby making the rows skinnier, I was wandering as
 to which one will give better performance in terms of reads.
 I think skinnier will definitely have the advantage of less row
 mutations thus good read performance, when, only they, need to be
 retrieved, plus supercolumns of followerlist ,etc are avoided(this
 sounds good as supercolumn indexing limitations will not suck), but I
 still not pretty sure whether it would beneficial in terms of
 performance numbers, if I split the materialized view of single user
 into several columnfamilies instead of single row in single
 Supercolumnfamily.





 On Sat, Jan 8, 2011 at 2:05 AM, Rajkumar Gupta rajkumar@gmail.com wrote:
 The fact that subcolumns inside the supercolumns aren't indexed
 currently may suck here, whenever a small no (10-20 ) of subcolumns
 need to be retreived from a large list of subcolumns of a supercolumn
 like MyPostsIdKeysList.

 On Fri, Jan 7, 2011 at 9:58 PM, Raj rajkumar@gmail.com wrote:
 My question is in context of a social network schema design

 I am thinking of following schema for storing a user's data that is
 required as he logs in  is led to his homepage:-
 (I aimed at a schema design such that through a single row read query
 all the data that would be required to put up the homepage of that
 user, is retreived.)

 UserSuperColumnFamily: {    // Column Family

 UserIDKey:
 {columns:            MyName, MyEmail, MyCity,...etc
  supercolumns:    MyFollowersList, MyFollowiesList, MyPostsIdKeysList,
 MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList,
 RecentNotificationsForUserList,  MessagesReceivedList,
 MessagesSentList, AccountSettingsList, RecentSelfActivityList,
 UpdatesFromFollowiesList
 }
 }

 Thus user's newfeed would be generated using superColumn:
 UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would
 obviously contain only Id of the posts and not the entire post data.

 Questions:

 1.) What could be the problems with this design, any improvements ?

 2.) Would frequent  heavy overwrite operations/ row mutations (for
 example; when propagating the post updates for news-feed from some
 user to all his followies) which leads to rows ultimately being in
 several SSTables, will lead to degraded read performance ?? Is it
 suitable to use row Cache(too big row but all data required uptil user
 is logged in) If I do not use cache, it may be very expensive to pull
 the row each time a data is required for the given user since row
 would be in several sstables. How can I improve the
 read performance here

 The actual data of the posts from network would be retrieved using
 PostIdKey through subsequent read queries from columnFamily
 PostsSuperColumnFamily which would be like follows:

 PostsSuperColumnFamily:{

 PostIdKey:
 {
 columns:            PostOwnerId, PostBody
 supercolumns:   TagsForPost {list of columns of all tags for the
 post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the
 likers}
 }
 }

 Is this the best design to go with or are there any issues to consider
 here ? Thanks in anticipation of your valuable comments.!




From your description UserSuperColumnFamily it seems to be both a
Standard Column and a Super Column. You can not do that. However you
can encode things such as MyName MyCity and MyState into a 'UserInfo'
super Column column. UserInfo:MyState...

(as your mentioned) Super Columns are not indexed and have to be
completely de-serialized for each access. Because of this they are not
widely used for anything but small keys with a few columns. This also
applies to mutations as well, the row can exist in multiple SSTables
until it finally gets compacted. That can result in much more storage
used for an object that changes often.

Most designs use composite keys or using something like JSON encoded
values with Standard Column Families to achieve something like a Super
Column.

(SuperColumns are not always as Super as they seem :)


Re: SQL Server to Cassandra Schema Design - Ideas Anyone?

2010-06-24 Thread Jonathan Ellis
the main idea is denormalize your data into multiple CFs at write
time so that each CF lets you answer a query from a single row.

http://www.rackspacecloud.com/blog/2010/05/12/cassandra-by-example/ is
a good place to start.

On Tue, Jun 22, 2010 at 3:43 PM, Craig Faulkner
craig.faulk...@wallst.com wrote:
 I'm having a little block in converting an existing SQL Server schema that we 
 have into Cassandra Keyspace(s).  The whole key-value thing has just not 
 clicked yet.  Do any of you know of any good examples that are more complex 
 than the example in the readme file?

 We are looking to report on web traffic so things like hits, page views, 
 unique visitors,...  All the basic web stuff.  I'm very sure that one of you, 
 likely many more, is already doing this.

 Here are two queries just to give you a few key works related to the metrics 
 that we want to move into Cassandra:


 /* Data logged */
 select t.Datetime,c.CustomerNumber,ct.cust_type,ws.SiteNumber,ws.SiteName
 ,f.Session,wa.page,wa.Note,f.CPUTime,f.DCLWaitTime,f.DCLRequestCount,'clientip'
  = dbo.u_IpInt2Str(ClientIP)
 from warehouse.dbo.fact_WebHit f
 join Warehouse.dbo.dim_Time t
      on t.ID = f.TimeID
 join Warehouse.dbo.dim_CustomerType ct
      on ct.ID = f.CustomerTypeID
 join Warehouse.dbo.dim_Customer c
      on c.ID = f.CustomerID
 join Warehouse.dbo.dim_Symbol s
      on s.ID = f.SymbolID
 join Warehouse.dbo.dim_WebAction wa
      on wa.ID = f.WebActionID
 join Warehouse.dbo.dim_WebSite ws
      on ws.ID = f.WebSiteID

 /* Data with surrogate keys */
 select f.Timeid,f.CustomerID,f.CustomerTypeID,f.WebSiteID 
 ,f.Session,f.WebActionID,f.CPUTime
 ,f.DCLWaitTime,f.DCLRequestCount,ClientIP
 from warehouse.dbo.fact_WebHit f


 Any good info would be appreciated.  I have of course checked the main web 
 sites but I could have missed something along the way.

 Craig




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


SQL Server to Cassandra Schema Design - Ideas Anyone?

2010-06-22 Thread Craig Faulkner
I'm having a little block in converting an existing SQL Server schema that we 
have into Cassandra Keyspace(s).  The whole key-value thing has just not 
clicked yet.  Do any of you know of any good examples that are more complex 
than the example in the readme file?

We are looking to report on web traffic so things like hits, page views, unique 
visitors,...  All the basic web stuff.  I'm very sure that one of you, likely 
many more, is already doing this.

Here are two queries just to give you a few key works related to the metrics 
that we want to move into Cassandra:


/* Data logged */
select t.Datetime,c.CustomerNumber,ct.cust_type,ws.SiteNumber,ws.SiteName
,f.Session,wa.page,wa.Note,f.CPUTime,f.DCLWaitTime,f.DCLRequestCount,'clientip' 
= dbo.u_IpInt2Str(ClientIP)
from warehouse.dbo.fact_WebHit f
join Warehouse.dbo.dim_Time t
  on t.ID = f.TimeID
join Warehouse.dbo.dim_CustomerType ct
  on ct.ID = f.CustomerTypeID
join Warehouse.dbo.dim_Customer c
  on c.ID = f.CustomerID
join Warehouse.dbo.dim_Symbol s
  on s.ID = f.SymbolID
join Warehouse.dbo.dim_WebAction wa
  on wa.ID = f.WebActionID
join Warehouse.dbo.dim_WebSite ws
  on ws.ID = f.WebSiteID

/* Data with surrogate keys */
select f.Timeid,f.CustomerID,f.CustomerTypeID,f.WebSiteID 
,f.Session,f.WebActionID,f.CPUTime
,f.DCLWaitTime,f.DCLRequestCount,ClientIP
from warehouse.dbo.fact_WebHit f


Any good info would be appreciated.  I have of course checked the main web 
sites but I could have missed something along the way.

Craig