A couple of days ago I came across Countandra ( http://countandra.org/ ).
It seems that it might be a solution for you.

Gr. Robin

2012/1/20 Tamar Fraenkel <ta...@tok-media.com>

> **
>
>   Hi!
>
> I am a newbie to Cassandra and seeking some advice regarding the data
> model I should use to best address my needs.
>
> For simplicity, what I want to accomplish is:
>
> I have a system that has users (potentially ~10,000 per day) and they
> perform actions in the system (total of ~50,000 a day).
>
> Each User’s action is taking place in a certain point in time, and is also
> classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s
> Categories and Tags has a score associated with it, the score is between 0
> to 1 (let’s assume precision of 0.0001).
>
> I want to be able to identify similar actions in the system (performed
> usually by more than one user). Similarity of actions is calculated based
> on their common Categories and Tags taking scores into account.
>
> I need the system to store:
>
>    - The list of my users with attributes like name, age etc
>    - For each action – the categories and tags associated with it and
>    their score, the time of the action, and the user who performed it.
>    - Groups of similar actions (ActionGroups) – the id’s of actions in
>    the group, the categories and tags describing the group, with their scores.
>    Those are calculated using an algorithm that takes into account the
>    categories and tags of the actions in the group.
>
> When a user performs a new action in the system, I want to add it to a
> fitting ActionGroups (with similar categories and tags).
>
> For this I need to be able to perform the following:
>
> Find all the recent ActionGroups (those who were updated with actions
> performed during the last T minutes), who has at list one of the new
> action’s categories AND at list one of the new action’s tags.
>
>
>
> I thought of two ways to address the issue and I would appreciate your
> insights.
>
>
>
> First one using secondary indexes
>
> Column Family: *Users*
>
> Key: userId
>
> Compare with Bytes Type
>
> Columns: name: <>, age: <> etc…
>
>
>
> Column Family: *Actions*
>
> Key: actionId
>
> Compare with Bytes Type
>
> Columns:  Category1 : <Score> ….
>
>           CategoriN: <Score>,
>
>           Tag1 : <Score>, ….
>
>           TagK:<Score>
>
>           Time: timestamp
>
>           user: userId
>
>
>
> Column Family: *ActionGroups*
>
> Key: actionGroupId
>
> Compare with Bytes Type
>
> Columns: Category1 : <Score> ….
>
>          CategoriN: <Score>,
>
>          Tag1 : <Score> ….
>
>          TagK:<Score>
>
>          lastUpdateTime: timestamp
>
>          actionId1: null, … ,
>
>          actionIdM: null
>
>
>
> I will then define secondary index on each tag columns, category columns,
> and the update time column.
>
> Let’s assume the new action I want to add to ActionGroup has
> NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 –
> NewActionTagN. I will perform the following query:
>
> Select  * From ActionGroups where
>
>    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
>
>    (NewActionTag1 > 0  … or NewActionTagN > 0) and
>
>    lastUpdateTime > T;
>
>
>
> Second solution
>
> Have the same CF as in the first solution *without the secondary* *index*, 
> and have two additional CF-ies:
>
> Column Family: *CategoriesToActionGroupId*
>
> Key: categoryId
>
> Compare with ByteType
>
> Columns: {Timestamp, ActionGroupsId1 } : null
>
>          {Timestamp, ActionGroupsId2} : null
>
>          ...
>
> *timestamp is the update time for the ActionGroup
>
>
>
> A similar CF will be defined for tags.
>
>
>
> I will then be able to run several queries on CategoriesToActionGroupId
> (one for each of the new story Categories), with column slice for the right
> update time of the ActionGroup.
>
> I will do the same for the TagsToActionGroupId.
>
> I will then use my client code to remove duplicates (ActionGroups who are
> associated with more than one Tag or Category).
>
>
>
> My questions are:
>
>    1. Are the two solutions viable? If yes, which is better
>    2. Is there any better way of doing this?
>    3. Can I use jdbc and CQL with both method, or do I have to use Hector
>    (I am using Java).
>
> Thanks
>
> Tamar
>
>
>
>
>

Reply via email to