Indeed I took the not delete approach. If time bucket rows are not that big, 
this is a good temporary solution.
I just finished implementation and testing now on a small staging environment. 
So far so good.
Tamar

Sent from my iPod

On May 21, 2012, at 9:11 PM, Filippo Diotalevi <fili...@ntoklo.com> wrote:

> Hi Tamar,
> the solution you propose is indeed a "temporary solution", but it might be 
> the best one.
> 
> Which approach did you follow?
> I'm a bit concerned about the deletion approach, since in case of concurrent 
> writes on the same counter you might "lose" the pointer to the column to 
> delete. 
> 
> -- 
> Filippo Diotalevi
> 
> 
> On Monday, 21 May 2012 at 18:51, Tamar Fraenkel wrote:
> 
>> I also had a similar problem. I have a temporary solution, which is not 
>> best, but may be of help.
>> I have the coutner cf to count events, but apart from that I hold leaders CF:
>> leaders = {
>>   // key is time bucket
>>   // values are composites(rank, event) ordered by
>>   // descending order of the rank
>>   // set relevant TTL on columns
>>   time_bucket1 : {
>>     composite(1000,event1) : ""
>>     composite(999, event2) : ""
>>   },
>>   ...
>> }
>> Whenever I increment counter for a specific event, I add a column in the 
>> time bucket row of the leaders CF, with the new value of the counter and the 
>> event name.
>> There are two ways to go here, either delete the old column(s) for that 
>> event (with lower counters) from leaders CF. Or let them be. 
>> If you choose to delete, there is the complication of not having getAndSet 
>> for counters, so you may end up not deleting all the old columns. 
>> If you choose not to  delete old column, and live with duplicate columns for 
>> events (each with different count), it will make your query to retrieve 
>> leaders run longer.
>> Anyway, when you need to retrieve the leaders, you can do slice query on 
>> leaders CF and ignore duplicates events using client (I use Java). This will 
>> happen less if you do delete old columns.
>> 
>> Another option is not to use Cassandra for that purpose, http://redis.io/ is 
>> a nice tool
>> 
>> Will be happy to hear you comments.
>> Thanks,
>> 
>> Tamar Fraenkel 
>> Senior Software Engineer, TOK Media 
>> 
>> <tokLogo.png>
>> 
>> ta...@tok-media.com
>> Tel:   +972 2 6409736 
>> Mob:  +972 54 8356490 
>> Fax:   +972 2 5612956 
>> 
>> 
>> 
>> 
>> 
>> On Mon, May 21, 2012 at 8:05 PM, Filippo Diotalevi <fili...@ntoklo.com> 
>> wrote:
>>> Hi Romain,
>>> thanks for your suggestion.
>>> 
>>> When you say " build every day a ranking in a dedicated CF by iterating 
>>> over events:" do you mean
>>> - load all the columns for the specified row key
>>> - iterate over each column, and write a new column in the inversed index
>>> ?
>>> 
>>> That's my current approach, but since I have many of these wide rows (1 per 
>>> day), the process is extremely slow as it involves moving an entire row 
>>> from Cassandra to client, inverting every column, and sending the data back 
>>> to create the inversed index.
>>> 
>>> -- 
>>> Filippo Diotalevi
>>> 
>>> 
>>> On Monday, 21 May 2012 at 17:19, Romain HARDOUIN wrote:
>>> 
>>>> 
>>>> If I understand you've got a data model which looks like this: 
>>>> 
>>>> CF Events: 
>>>>     "row1": { "event1": 1050, "event2": 1200, "event3": 830, ... } 
>>>> 
>>>> You can't query on column values but you can build every day a ranking in 
>>>> a dedicated CF by iterating over events: 
>>>> 
>>>> create column family Ranking 
>>>>     with comparator = 'LongType(reversed=true)'   
>>>>     ... 
>>>> 
>>>> CF Ranking: 
>>>>     "rank": { 1200: "event2", 1050: "event1", 830: "event3", ... } 
>>>>     
>>>> Then you can make a "top ten" or whatever you want because counter values 
>>>> will be sorted. 
>>>> 
>>>> 
>>>> Filippo Diotalevi <fili...@ntoklo.com> a écrit sur 21/05/2012 16:59:43 :
>>>> 
>>>> > Hi, 
>>>> > I'm trying to understand what's the best design for a simple 
>>>> > "ranking" use cases. 
>>>> > I have, in a row, a good number (10k - a few 100K) of counters; each
>>>> > one is counting the occurrence of an event. At the end of day, I 
>>>> > want to create a ranking of the most occurred event. 
>>>> > 
>>>> > What's the best approach to perform this task?  
>>>> > The brute force approach of retrieving the row and ordering it 
>>>> > doesn't work well (the call usually goes timeout, especially is 
>>>> > Cassandra is also under load); I also don't know in advance the full
>>>> > set of event names (column names), so it's difficult to slice the get 
>>>> > call. 
>>>> > 
>>>> > Is there any trick to solve this problem? Maybe a way to retrieve 
>>>> > the row ordering for counter values? 
>>>> > 
>>>> > Thanks, 
>>>> > -- 
>>>> > Filippo Diotalevi
>>> 
>> 
> 

Reply via email to