Re: Counters and Top 10

2011-12-24 Thread Janne Jalkanen

In our case we didn't need an exact daily top-10 list of pages, just a good 
guess of it.  So the way we did it was to insert a column with a short TTL 
(e.g. 12 hours) with the page id as the column name.  Then, when constructing 
the top-10 list, we'd just slice through the entire list of unexpired page 
id's, get the actual activity data for each from another CF and then sort.  The 
theory is that if a page is popular, they'd be referenced at least once in the 
past 12 hours anyway.  Depending on the size of your hot pages and the 
frequency at which you'd need the top-10 list, you can then tune the TTL 
accordingly.  We started at 24 hrs, then went down to 12 and then gradually 
downwards.

So while it's not guaranteed to be the precise top-10 list for the day, it is a 
fairly accurate sampling of one.

/Janne

On 23 Dec 2011, at 11:52, aaron morton wrote:

 Counters only update the value of the column, they cannot be used as column 
 names. So you cannot have a dynamically updating top ten list using counters.
 
 You have a couple of options. First use something like redis if that fits 
 your use case. Redis could either be the database of record for the counts. 
 Or just an aggregation layer, write the data to cassandra and sorted sets in 
 redis then read the top ten from redis and use cassandra to rebuild redis if 
 needed. 
 
 The other is to periodically pivot the counts into a top ten row where you 
 use regular integers for the column name. With only 10K users you could do 
 this with an process that periodically reads all the users rows or where ever 
 the counters are and updates the aggregate row. Depending on data size you 
 cold use hive/pig or whatever regular programming language your are happy 
 with.
 
 I guess you could also use redis to keep the top ten sorted and then 
 periodically dump that back to cassandra and serve the read traffic from 
 there.  
 
 Hope that helps 
 
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 23/12/2011, at 3:46 AM, R. Verlangen wrote:
 
 I would suggest you to create a CF with a single row (or multiple for 
 historical data) with a date as key (utf8, e.g. 2011-12-22) and multiple 
 columns for every user's score. The column (utf8) would then be the score + 
 something unique of the user (e.g. hex representation of the TimeUUID). The 
 value would be the TimeUUID of the user.
 
 By default columns will be sorted and you can perform a slice to get the top 
 10.
 
 2011/12/14 cbert...@libero.it cbert...@libero.it
 Hi all,
 I'm using Cassandra in production for a small social network (~10.000 
 people).
 Now I have to assign some credits to each user operation (login, write post
 and so on) and then beeing capable of providing in each moment the top 10 of
 the most active users. I'm on Cassandra 0.7.6 I'd like to migrate to a new
 version in order to use Counters for the user points but ... what about the 
 top
 10?
 I was thinking about a specific ROW that always keeps the 10 most active 
 users
 ... but I think it would be heavy (to write and to handle in thread-safe 
 mode)
 ... can counters provide something like a value ordered list?
 
 Thanks for any help.
 Best regards,
 
 Carlo
 
 
 
 



Re: cassandra data to hadoop.

2011-12-24 Thread Mohit Anchlia
You could read using Cassandra client and write to HDFS using Hadoop FS Api.

On Fri, Dec 23, 2011 at 11:20 PM, ravikumar visweswara
talk2had...@gmail.com wrote:
 Jeremy,

 We use cloudera distribution for our hadoop cluster and may not be possible
 to migrate to brisk quickly because of flume/hue dependencies. Did you
 successfully pull the data from independent cassandra cluster and dump into
 completely disconnected hadoop cluster? It will be really helpful if you
 elaborate on how to achieve this.

 -R


 On Fri, Dec 23, 2011 at 9:28 AM, Jeremy Hanna jeremy.hanna1...@gmail.com
 wrote:

 We do this all the time.  Take a look at
 http://wiki.apache.org/cassandra/HadoopSupport for some details - you can
 use mapreduce or pig to get data out of cassandra.  If it's going to a
 separate hadoop cluster, I don't think you'd need to co-locate task trackers
 or data nodes on your cassandra nodes - it would just need to copy over the
 network though.  We also use oozie for job scheduling, fwiw.

 On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote:

  Hello All,
 
  I have a situation to dump cassandra data to hadoop cluster for further
  analytics. Lot of other relevant data which is not present in cassandra is
  already available in hdfs for analysis. Both are independent clusters right
  now.
  Is there a suggested way to get the data periodically or continuously to
  HDFS from cassandra? Any ideas or references will be very helpful for me.
 
  Thanks and Regards
  R




java.lang.AssertionError

2011-12-24 Thread Michael Vaknine
I have a 4 cluster environment version 1.0.3 which was upgraded from 0.7.6.

I keep getting java.lang.AssertionError from time to time on my cluster.

Is there anything I can do to fix the problem

 

Thanks

Michael

 

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129
AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129
java.lang.AssertionError: attempted to delete non-existing file
AttractionUserIdx.AttractionUserIdx_09partition_idx-h-1-Digest.sha1

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:49)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:44)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:139)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
org.apache.cassandra.io.sstable.SSTableDeletingTask.runMayThrow(SSTableDelet
ingTask.java:81)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
java.util.concurrent.FutureTask.run(FutureTask.java:138)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
301(ScheduledThreadPoolExecutor.java:98)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch
eduledThreadPoolExecutor.java:207)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)

NYC-Cass1 ERROR [NonPeriodicTasks:1] 2011-12-24 23:06:11,129 at
java.lang.Thread.run(Thread.java:619)