Re: Suggested settings for number crunching

aaron morton Thu, 18 Aug 2011 15:41:01 -0700

couple of thoughts, 400 row mutations in a batch may be a bit high. More is not 
always better. Watch the TP stats to see if the mutation pool is backing up 
excessively.


Also if you feel like having fun take a look at the durable_writes config 
setting for keyspaces, from the cli help…
- durable_writes: When set to false all RowMutations on keyspace will by-pass 
CommitLog.
  Set to true by default.

This will remove disk access from the write path. Which sounds OK in your case. 

When you are doing the reads, the fastest slice predicate is one with no start, 
no finish, revered = false 
http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/) You can now reverse 
the storage ordered of comparators, so if you are getting cols from the end of 
the row consider changing the storage order. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 19/08/2011, at 3:43 AM, Paul Loy wrote:

> Yeah, the data after crunching drops to just 65000 columns so one Cassandra 
> is plenty. That will all go in memory on one box. It's only the crunching 
> where we have lots of data and then need it arranged in a structured manner. 
> That's why I don't use flat files that I just append to. I need them in order 
> of similarity to generate the vectors.
> 
> Bulk loading looks interesting.
> 
> On Thu, Aug 18, 2011 at 4:21 PM, Jake Luciani <jak...@gmail.com> wrote:
> So you only have 1 cassandra node?
> 
> If you are interested only in getting the complete work done as fast as 
> possible before you begin reading, take a look at the new bulk loader in 
> cassandra:
> 
> http://www.datastax.com/dev/blog/bulk-loading
> 
> -Jake
> 
> 
> On Thu, Aug 18, 2011 at 11:03 AM, Paul Loy <ketera...@gmail.com> wrote:
> Yeah, we're processing item similarities. So we are writing single columns at 
> a time. Although we do batch these into 400 mutations before sending to 
> Cassy. We currently perform almost 2 billion calculations that then write 
> almost 4 billion columns.
> 
> Once all similarities are calculated, we just grab a slice per item and 
> create a denormalised vector of similar items (trimmed down to topN and only 
> those above a certain threshold). This makes lookup super fast as we only get 
> one column from cassandra.
> 
> So we just want to optimise the crunching and storing phase as that's a 
> O(n^2) complexity problem. The quicker we can make that the quicker the whole 
> process works.
> 
> I'm going to try disabling minor compactions as a start.
> 
> 
> > is the loading disk or cpu or network bound?
> 
> cpu is at 40% free
> only one cassy node on the same box as the processor for now so no network 
> traffic
> so I think it's disk access. Will find out for sure tomorrow after the 
> current test runs.
> 
> Thanks,
> 
> Paul.
> 
> 
> On Thu, Aug 18, 2011 at 2:23 PM, Jake Luciani <jak...@gmail.com> wrote:
> Are you writing lots of tiny rows or a few very large rows, are you batching 
> mutations? is the loading disk or cpu or network bound?
> 
> -Jake
> 
> On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <ketera...@gmail.com> wrote:
> Hi All,
> 
> I have a program that crunches through around 3 billion calculations. We 
> store the result of each of these in cassandra to later query once in order 
> to create some vectors. Our processing is limited by Cassandra now, rather 
> than the calculations themselves.
> 
> I was wondering what settings I can change to increase the write throughput. 
> Perhaps disabling all caching, etc, as I won't be able to keep it all in 
> memory anyway and only want to query the results once.
> 
> Any thoughts would be appreciated,
> 
> Paul.
> 
> -- 
> ---------------------------------------------
> Paul Loy
> p...@keteracel.com
> http://uk.linkedin.com/in/paulloy
> 
> 
> 
> -- 
> http://twitter.com/tjake
> 
> 
> 
> -- 
> ---------------------------------------------
> Paul Loy
> p...@keteracel.com
> http://uk.linkedin.com/in/paulloy
> 
> 
> 
> -- 
> http://twitter.com/tjake
> 
> 
> 
> -- 
> ---------------------------------------------
> Paul Loy
> p...@keteracel.com
> http://uk.linkedin.com/in/paulloy

Re: Suggested settings for number crunching

Reply via email to