Hello,

We have a column family which stores incoming requests, and we would like to 
perform some analytics  on that data using Hadoop. The analytic results should 
be available pretty soon, not realtime, but within an hour or so. 
So we store the current hour number (calculated from timestamp) as a "partition 
number" field with secondary index.

Currently it looks like this (I skipped a few columns to avoid unnecessary 
details):

CREATE TABLE requests (
    request_id UUID PRIMARY KEY,
    partition_number INT,
    payload ASCII
 );

CREATE INDEX ON requests(partition_number);

Every hour we launch Hadoop jobs to process data for previous hour, so Hadoop 
performs query over the indexed "partition_number" column.
Currently having several million rows I observe very poor performance of such 
queries, and realize that secondary index on the field with high cardinality is 
a bad idea. However, I don't see good alternatives so far.
I was considering creating a temp column family every hour, write data there, 
process it with Hadoop next hour and throw it away, however there is a 
limitation - we need to store the raw incoming data, as in the future we'll 
have to provide new types of analytic reports.

So my questions are the following:

1.  Does the approach with hourly running Hadoop jobs is solid for the 
near-realtime analytics (when results should be available within 1 hour), or 
it's better to take a look at Storm and something like that?
2.  What's the recommended  data schema to store events "sharded" by hour, with 
further possibility to quickly retrieve them by hour? (assuming the hourly 
amount of data is big enough to fit in one wide row.)


Thank you.



Reply via email to