anyone using Cassandra as an analytics/data warehouse?

2011-01-04 Thread Dave Viner
Does anyone use Cassandra to power an analytics or data warehouse
implementation?

As a concrete example, one could imagine Cassandra storing data for
something that reports on page-views on a website.  The basic notions might
be simple (url as row-key and columns as timeuuids of viewers).  But, how
would one store things like ip-geolocation to set of pages viewed?  Or
hour-of-day to pages viewed?

Also, how would one do a query like
- tell me how many page views occurred between 12/01/2010 and 12/31/2010?
- tell me how many page views occurred between 12/01/2010 and 12/31/2010
from the US?
- tell me how many page views occurred between 12/01/2010 and 12/31/2010
from the US in the 9th hour of the day (in gmt)?

Time slicing and dimension slicing seems like it might be very challenging
(especially since the windows of time would not be known in advance).

Thanks
Dave Viner


Re: anyone using Cassandra as an analytics/data warehouse?

2011-01-04 Thread Peter Harrison
Okay, here is two ways to handle this, both are quite different from each
other.


A)

This approach does not depend on counters. You simply have a Column Family
with the row key being the Unix time divided by 60x60 and a column key of...
pretty much anything unique. Then have another process look at the current
row every hour to actually compile the numbers, and store the count in the
same Column Family. This will solve the first and third use cases, as it is
just a matter of looking at the right rows. The second case will require a
similar index, but one which includes a country code to be appended to the
row key.

The downside here is that you are storing lots of data on individual
requests and retaining it. If you don't want the detailed data you might add
a second process to purge the detail every hour.

B)

There is a counter feature added to the latest versions of Cassandra. I
have not used them, but they should be able to be used to achieve the same
effect without a second process cleaning up every hour. Also means it is
more of a real time system so you can see how many requests in the hour you
are currently in.



Basically you have to design your approach based on the query you will be
doing. Don't get too hung up on traditional data structures and queries as
they have little relationship to a Cassandra approach.


On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner davevi...@gmail.com wrote:

 Does anyone use Cassandra to power an analytics or data warehouse
 implementation?

 As a concrete example, one could imagine Cassandra storing data for
 something that reports on page-views on a website.  The basic notions might
 be simple (url as row-key and columns as timeuuids of viewers).  But, how
 would one store things like ip-geolocation to set of pages viewed?  Or
 hour-of-day to pages viewed?

 Also, how would one do a query like
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010?
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010
 from the US?
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010
 from the US in the 9th hour of the day (in gmt)?

 Time slicing and dimension slicing seems like it might be very challenging
 (especially since the windows of time would not be known in advance).

 Thanks
 Dave Viner



Re: anyone using Cassandra as an analytics/data warehouse?

2011-01-04 Thread Dave Viner
Hi Peter,

Thanks.  These are great ideas.  One comment tho.  I'm actually not as
worried about the logging into the system performance and more
speculating/imagining the querying out of the system.

Most traditional data warehouses have a cube or a star schema or something
similar.  I'm trying to imagine how one might use Cassandra in situations
where that sort of design has historically been applied.

But, I want to make sure I understand your suggestion A.

Is it something like this?

a Column Family with the row key being the Unix time divided by 60x60 and a
column key of... pretty much anything unique
LogCF[hour-day-in-epoch-seconds][timeuuid] = 1
where 'hour-day-in-epoch-seconds' is something like the first second of the
given hour of the day, so 01/04/2011 19:00:00 (in epoch
seconds: 1294167600); 'timeuuid' is a TimeUUID from cassandra, and '1' is
the value of the entry.

Then look at the current row every hour to actually compile the numbers,
and store the count in the same Column Family
LogCF[hour-day-in-epoch-seconds][total] = x
where 'x' is the sum of the number of timeuuid columns in the row?


Is that what you're envisioning in Option A?

Thanks
Dave Viner



On Tue, Jan 4, 2011 at 6:38 PM, Peter Harrison cheetah...@gmail.com wrote:

 Okay, here is two ways to handle this, both are quite different from each
 other.


 A)

 This approach does not depend on counters. You simply have a Column Family
 with the row key being the Unix time divided by 60x60 and a column key of...
 pretty much anything unique. Then have another process look at the current
 row every hour to actually compile the numbers, and store the count in the
 same Column Family. This will solve the first and third use cases, as it is
 just a matter of looking at the right rows. The second case will require a
 similar index, but one which includes a country code to be appended to the
 row key.

 The downside here is that you are storing lots of data on individual
 requests and retaining it. If you don't want the detailed data you might add
 a second process to purge the detail every hour.

 B)

 There is a counter feature added to the latest versions of Cassandra. I
 have not used them, but they should be able to be used to achieve the same
 effect without a second process cleaning up every hour. Also means it is
 more of a real time system so you can see how many requests in the hour you
 are currently in.



 Basically you have to design your approach based on the query you will be
 doing. Don't get too hung up on traditional data structures and queries as
 they have little relationship to a Cassandra approach.



 On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner davevi...@gmail.com wrote:

 Does anyone use Cassandra to power an analytics or data warehouse
 implementation?

 As a concrete example, one could imagine Cassandra storing data for
 something that reports on page-views on a website.  The basic notions might
 be simple (url as row-key and columns as timeuuids of viewers).  But, how
 would one store things like ip-geolocation to set of pages viewed?  Or
 hour-of-day to pages viewed?

 Also, how would one do a query like
 - tell me how many page views occurred between 12/01/2010 and
 12/31/2010?
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010
 from the US?
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010
 from the US in the 9th hour of the day (in gmt)?

 Time slicing and dimension slicing seems like it might be very challenging
 (especially since the windows of time would not be known in advance).

 Thanks
 Dave Viner





Re: anyone using Cassandra as an analytics/data warehouse?

2011-01-04 Thread Jake Luciani
Some relevant information here:
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

On Tue, Jan 4, 2011 at 10:09 PM, Dave Viner davevi...@gmail.com wrote:

 Hi Peter,

 Thanks.  These are great ideas.  One comment tho.  I'm actually not as
 worried about the logging into the system performance and more
 speculating/imagining the querying out of the system.

 Most traditional data warehouses have a cube or a star schema or something
 similar.  I'm trying to imagine how one might use Cassandra in situations
 where that sort of design has historically been applied.

 But, I want to make sure I understand your suggestion A.

 Is it something like this?

 a Column Family with the row key being the Unix time divided by 60x60 and
 a column key of... pretty much anything unique
 LogCF[hour-day-in-epoch-seconds][timeuuid] = 1
 where 'hour-day-in-epoch-seconds' is something like the first second of the
 given hour of the day, so 01/04/2011 19:00:00 (in epoch
 seconds: 1294167600); 'timeuuid' is a TimeUUID from cassandra, and '1' is
 the value of the entry.

 Then look at the current row every hour to actually compile the numbers,
 and store the count in the same Column Family
 LogCF[hour-day-in-epoch-seconds][total] = x
 where 'x' is the sum of the number of timeuuid columns in the row?


 Is that what you're envisioning in Option A?

 Thanks
 Dave Viner



 On Tue, Jan 4, 2011 at 6:38 PM, Peter Harrison cheetah...@gmail.comwrote:

 Okay, here is two ways to handle this, both are quite different from each
 other.


 A)

 This approach does not depend on counters. You simply have a Column Family
 with the row key being the Unix time divided by 60x60 and a column key of...
 pretty much anything unique. Then have another process look at the current
 row every hour to actually compile the numbers, and store the count in the
 same Column Family. This will solve the first and third use cases, as it is
 just a matter of looking at the right rows. The second case will require a
 similar index, but one which includes a country code to be appended to the
 row key.

 The downside here is that you are storing lots of data on individual
 requests and retaining it. If you don't want the detailed data you might add
 a second process to purge the detail every hour.

 B)

 There is a counter feature added to the latest versions of Cassandra. I
 have not used them, but they should be able to be used to achieve the same
 effect without a second process cleaning up every hour. Also means it is
 more of a real time system so you can see how many requests in the hour you
 are currently in.



 Basically you have to design your approach based on the query you will be
 doing. Don't get too hung up on traditional data structures and queries as
 they have little relationship to a Cassandra approach.



 On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner davevi...@gmail.com wrote:

 Does anyone use Cassandra to power an analytics or data warehouse
 implementation?

 As a concrete example, one could imagine Cassandra storing data for
 something that reports on page-views on a website.  The basic notions might
 be simple (url as row-key and columns as timeuuids of viewers).  But, how
 would one store things like ip-geolocation to set of pages viewed?  Or
 hour-of-day to pages viewed?

 Also, how would one do a query like
 - tell me how many page views occurred between 12/01/2010 and
 12/31/2010?
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010
 from the US?
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010
 from the US in the 9th hour of the day (in gmt)?

 Time slicing and dimension slicing seems like it might be very
 challenging (especially since the windows of time would not be known in
 advance).

 Thanks
 Dave Viner