Step 0: use multiple threads to insert On Thu, Aug 18, 2011 at 10:03 AM, Paul Loy <ketera...@gmail.com> wrote: > Yeah, we're processing item similarities. So we are writing single columns > at a time. Although we do batch these into 400 mutations before sending to > Cassy. We currently perform almost 2 billion calculations that then write > almost 4 billion columns. > > Once all similarities are calculated, we just grab a slice per item and > create a denormalised vector of similar items (trimmed down to topN and only > those above a certain threshold). This makes lookup super fast as we only > get one column from cassandra. > > So we just want to optimise the crunching and storing phase as that's a > O(n^2) complexity problem. The quicker we can make that the quicker the > whole process works. > > I'm going to try disabling minor compactions as a start. > >> is the loading disk or cpu or network bound? > > cpu is at 40% free > only one cassy node on the same box as the processor for now so no network > traffic > so I think it's disk access. Will find out for sure tomorrow after the > current test runs. > > Thanks, > > Paul. > > On Thu, Aug 18, 2011 at 2:23 PM, Jake Luciani <jak...@gmail.com> wrote: >> >> Are you writing lots of tiny rows or a few very large rows, are you >> batching mutations? is the loading disk or cpu or network bound? >> -Jake >> On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <ketera...@gmail.com> wrote: >>> >>> Hi All, >>> >>> I have a program that crunches through around 3 billion calculations. We >>> store the result of each of these in cassandra to later query once in order >>> to create some vectors. Our processing is limited by Cassandra now, rather >>> than the calculations themselves. >>> >>> I was wondering what settings I can change to increase the write >>> throughput. Perhaps disabling all caching, etc, as I won't be able to keep >>> it all in memory anyway and only want to query the results once. >>> >>> Any thoughts would be appreciated, >>> >>> Paul. >>> >>> -- >>> --------------------------------------------- >>> Paul Loy >>> p...@keteracel.com >>> http://uk.linkedin.com/in/paulloy >> >> >> >> -- >> http://twitter.com/tjake > > > > -- > --------------------------------------------- > Paul Loy > p...@keteracel.com > http://uk.linkedin.com/in/paulloy >
-- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com