Thank you Aaron,  your advice about a newer client it is really
interesting. We will take in account it!

Here, some numbers about our tests: we found that more or less that with
more than 500k elements (multiplying rows and columns requested) there was
the inflection point, and so asking for more the performance can only
decrease. The combination was performing better was querying for 500 rows
at a time with 1000 columns while different combinations, such as 125 rows
for 4000 columns or 1000 rows for 500 columns, were about the 15% slower.
Other combinations have even bigger differences.

It was a cluster of 16 nodes, with 24GBs or ram, sata-2 SSDs and 8-cores
CPUs@2.6 GHz.

The issue is this memory limit can be reached with many combinations of row
and columns. Broadly speaking, in using more rows or columns there is a
trade-off between better having a better parallelization and and higher
overhead.
If you consider it depends also on the number of nodes in the cluster, the
memory available  and the number of rows and column the query needs, the
problem of how  optimally divide a request  becomes quite  complex.

Does these numbers make sense for you?

Cheers


2013/7/17 aaron morton <aa...@thelastpickle.com>

> >  In ours tests,  we found there's a significant performance difference
> between various  configurations and we are studying a policy to optimize
> it. The doubt is that, if the needing of issuing multiple requests is
> caused only by a fixable implementation detail, would make pointless do
> this study.
> if you provide your numbers we can see if you are getting expected results.
>
> There are some limiting factors. Using the thrift API the max message size
> is 15 MB. And each row you ask for becomes (roughly) RF number of tasks in
> the thread pools on replicas. When you ask for 1000 rows it creates
> (roughly) 3,000 tasks in the replicas. If you have other clients trying to
> do reads at the same time this can cause delays to their reads.
>
> Like everything in computing, more is not always better. Run some tests to
> try multi gets with different sizes and see where improvements in the
> overall throughput begin to decline.
>
> Also consider using a newer client with token aware balancing and async
> networking. Again though, if you try to read everything at once you are
> going to have a bad day.
>
> Cheers
>
> -----------------
> Aaron Morton
> Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/07/2013, at 8:24 PM, cesare cugnasco <cesare.cugna...@gmail.com>
> wrote:
>
> > Hi Rob,
> > of course, we could issue multiple requests, but then we should
>  consider which is the optimal way to split the query in smaller ones.
> Moreover, we should choose how many of sub-query run in parallel.
> >  In ours tests,  we found there's a significant performance difference
> between various  configurations and we are studying a policy to optimize
> it. The doubt is that, if the needing of issuing multiple requests is
> caused only by a fixable implementation detail, would make pointless do
> this study.
> >
> > Does anyone made similar analysis?
> >
> >
> > 2013/7/16 Robert Coli <rc...@eventbrite.com>
> >
> > On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco <
> cesare.cugna...@gmail.com> wrote:
> > We  are working on porting some life science applications to Cassandra,
> but we have to deal with its limits managing huge queries. Our queries are
> usually multiget_slice ones: many rows with many columns each.
> >
> > You are not getting much "win" by increasing request size in Cassandra,
> and you expose yourself to "lose" such as you have experienced.
> >
> > Is there some reason you cannot just issue multiple requests?
> >
> > =Rob
> >
>
>

Reply via email to