Re: Counters and Top 10
In our case we didn't need an exact daily top-10 list of pages, just a good guess of it. So the way we did it was to insert a column with a short TTL (e.g. 12 hours) with the page id as the column name. Then, when constructing the top-10 list, we'd just slice through the entire list of unexpired page id's, get the actual activity data for each from another CF and then sort. The theory is that if a page is popular, they'd be referenced at least once in the past 12 hours anyway. Depending on the size of your hot pages and the frequency at which you'd need the top-10 list, you can then tune the TTL accordingly. We started at 24 hrs, then went down to 12 and then gradually downwards. So while it's not guaranteed to be the precise top-10 list for the day, it is a fairly accurate sampling of one. /Janne On 23 Dec 2011, at 11:52, aaron morton wrote: Counters only update the value of the column, they cannot be used as column names. So you cannot have a dynamically updating top ten list using counters. You have a couple of options. First use something like redis if that fits your use case. Redis could either be the database of record for the counts. Or just an aggregation layer, write the data to cassandra and sorted sets in redis then read the top ten from redis and use cassandra to rebuild redis if needed. The other is to periodically pivot the counts into a top ten row where you use regular integers for the column name. With only 10K users you could do this with an process that periodically reads all the users rows or where ever the counters are and updates the aggregate row. Depending on data size you cold use hive/pig or whatever regular programming language your are happy with. I guess you could also use redis to keep the top ten sorted and then periodically dump that back to cassandra and serve the read traffic from there. Hope that helps - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/12/2011, at 3:46 AM, R. Verlangen wrote: I would suggest you to create a CF with a single row (or multiple for historical data) with a date as key (utf8, e.g. 2011-12-22) and multiple columns for every user's score. The column (utf8) would then be the score + something unique of the user (e.g. hex representation of the TimeUUID). The value would be the TimeUUID of the user. By default columns will be sorted and you can perform a slice to get the top 10. 2011/12/14 cbert...@libero.it cbert...@libero.it Hi all, I'm using Cassandra in production for a small social network (~10.000 people). Now I have to assign some credits to each user operation (login, write post and so on) and then beeing capable of providing in each moment the top 10 of the most active users. I'm on Cassandra 0.7.6 I'd like to migrate to a new version in order to use Counters for the user points but ... what about the top 10? I was thinking about a specific ROW that always keeps the 10 most active users ... but I think it would be heavy (to write and to handle in thread-safe mode) ... can counters provide something like a value ordered list? Thanks for any help. Best regards, Carlo
Re: cassandra data to hadoop.
You could read using Cassandra client and write to HDFS using Hadoop FS Api. On Fri, Dec 23, 2011 at 11:20 PM, ravikumar visweswara talk2had...@gmail.com wrote: Jeremy, We use cloudera distribution for our hadoop cluster and may not be possible to migrate to brisk quickly because of flume/hue dependencies. Did you successfully pull the data from independent cassandra cluster and dump into completely disconnected hadoop cluster? It will be really helpful if you elaborate on how to achieve this. -R On Fri, Dec 23, 2011 at 9:28 AM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: We do this all the time. Take a look at http://wiki.apache.org/cassandra/HadoopSupport for some details - you can use mapreduce or pig to get data out of cassandra. If it's going to a separate hadoop cluster, I don't think you'd need to co-locate task trackers or data nodes on your cassandra nodes - it would just need to copy over the network though. We also use oozie for job scheduling, fwiw. On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote: Hello All, I have a situation to dump cassandra data to hadoop cluster for further analytics. Lot of other relevant data which is not present in cassandra is already available in hdfs for analysis. Both are independent clusters right now. Is there a suggested way to get the data periodically or continuously to HDFS from cassandra? Any ideas or references will be very helpful for me. Thanks and Regards R
java.lang.AssertionError
I have a 4 cluster environment version 1.0.3 which was upgraded from 0.7.6. I keep getting java.lang.AssertionError from time to time on my cluster. Is there anything I can do to fix the problem Thanks Michael NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 java.lang.AssertionError: attempted to delete non-existing file AttractionUserIdx.AttractionUserIdx_09partition_idx-h-1-Digest.sha1 NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:49) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:44) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:139) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at org.apache.cassandra.io.sstable.SSTableDeletingTask.runMayThrow(SSTableDelet ingTask.java:81) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at java.util.concurrent.FutureTask.run(FutureTask.java:138) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$ 301(ScheduledThreadPoolExecutor.java:98) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch eduledThreadPoolExecutor.java:207) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at java.lang.Thread.run(Thread.java:619)