We are thinking about using Cassandra to store our search logs. Can
someone point me in the right direction/lend some guidance on design? I
am new to Cassandra and I am having trouble wrapping my head around some
of these new concepts. My brain keeps wanting to go back to a RDBMS design.
We will be storing the user query, # of hits returned and their session
id. We would like to be able to answer the following questions.
- What is the n most popular queries and their counts within the last x
(mins/hours/days/etc). Basically the most popular searches within a
given time range.
- What is the most popular query within the last x where hits = 0. Same
as above but with an extra "where" clause
- For session id x give me all their other queries
- What are all the session ids that searched for 'foos'
We accomplish the above functionality w/ MySQL using 2 tables. One for
the raw search log information and the other to keep the
aggregate/running counts of queries.
Would this sort of ad-hoc querying be better implemented using Hadoop +
Hive? If so, should I be storing all this information in Cassandra then
using Hadoop to retrieve it?
Thanks for your suggestions