Indeed I took the not delete approach. If time bucket rows are not that big, this is a good temporary solution. I just finished implementation and testing now on a small staging environment. So far so good. Tamar
Sent from my iPod On May 21, 2012, at 9:11 PM, Filippo Diotalevi <fili...@ntoklo.com> wrote: > Hi Tamar, > the solution you propose is indeed a "temporary solution", but it might be > the best one. > > Which approach did you follow? > I'm a bit concerned about the deletion approach, since in case of concurrent > writes on the same counter you might "lose" the pointer to the column to > delete. > > -- > Filippo Diotalevi > > > On Monday, 21 May 2012 at 18:51, Tamar Fraenkel wrote: > >> I also had a similar problem. I have a temporary solution, which is not >> best, but may be of help. >> I have the coutner cf to count events, but apart from that I hold leaders CF: >> leaders = { >> // key is time bucket >> // values are composites(rank, event) ordered by >> // descending order of the rank >> // set relevant TTL on columns >> time_bucket1 : { >> composite(1000,event1) : "" >> composite(999, event2) : "" >> }, >> ... >> } >> Whenever I increment counter for a specific event, I add a column in the >> time bucket row of the leaders CF, with the new value of the counter and the >> event name. >> There are two ways to go here, either delete the old column(s) for that >> event (with lower counters) from leaders CF. Or let them be. >> If you choose to delete, there is the complication of not having getAndSet >> for counters, so you may end up not deleting all the old columns. >> If you choose not to delete old column, and live with duplicate columns for >> events (each with different count), it will make your query to retrieve >> leaders run longer. >> Anyway, when you need to retrieve the leaders, you can do slice query on >> leaders CF and ignore duplicates events using client (I use Java). This will >> happen less if you do delete old columns. >> >> Another option is not to use Cassandra for that purpose, http://redis.io/ is >> a nice tool >> >> Will be happy to hear you comments. >> Thanks, >> >> Tamar Fraenkel >> Senior Software Engineer, TOK Media >> >> <tokLogo.png> >> >> ta...@tok-media.com >> Tel: +972 2 6409736 >> Mob: +972 54 8356490 >> Fax: +972 2 5612956 >> >> >> >> >> >> On Mon, May 21, 2012 at 8:05 PM, Filippo Diotalevi <fili...@ntoklo.com> >> wrote: >>> Hi Romain, >>> thanks for your suggestion. >>> >>> When you say " build every day a ranking in a dedicated CF by iterating >>> over events:" do you mean >>> - load all the columns for the specified row key >>> - iterate over each column, and write a new column in the inversed index >>> ? >>> >>> That's my current approach, but since I have many of these wide rows (1 per >>> day), the process is extremely slow as it involves moving an entire row >>> from Cassandra to client, inverting every column, and sending the data back >>> to create the inversed index. >>> >>> -- >>> Filippo Diotalevi >>> >>> >>> On Monday, 21 May 2012 at 17:19, Romain HARDOUIN wrote: >>> >>>> >>>> If I understand you've got a data model which looks like this: >>>> >>>> CF Events: >>>> "row1": { "event1": 1050, "event2": 1200, "event3": 830, ... } >>>> >>>> You can't query on column values but you can build every day a ranking in >>>> a dedicated CF by iterating over events: >>>> >>>> create column family Ranking >>>> with comparator = 'LongType(reversed=true)' >>>> ... >>>> >>>> CF Ranking: >>>> "rank": { 1200: "event2", 1050: "event1", 830: "event3", ... } >>>> >>>> Then you can make a "top ten" or whatever you want because counter values >>>> will be sorted. >>>> >>>> >>>> Filippo Diotalevi <fili...@ntoklo.com> a écrit sur 21/05/2012 16:59:43 : >>>> >>>> > Hi, >>>> > I'm trying to understand what's the best design for a simple >>>> > "ranking" use cases. >>>> > I have, in a row, a good number (10k - a few 100K) of counters; each >>>> > one is counting the occurrence of an event. At the end of day, I >>>> > want to create a ranking of the most occurred event. >>>> > >>>> > What's the best approach to perform this task? >>>> > The brute force approach of retrieving the row and ordering it >>>> > doesn't work well (the call usually goes timeout, especially is >>>> > Cassandra is also under load); I also don't know in advance the full >>>> > set of event names (column names), so it's difficult to slice the get >>>> > call. >>>> > >>>> > Is there any trick to solve this problem? Maybe a way to retrieve >>>> > the row ordering for counter values? >>>> > >>>> > Thanks, >>>> > -- >>>> > Filippo Diotalevi >>> >> >