Re: Huge query Cassandra limits

aaron morton Sun, 21 Jul 2013 11:41:33 -0700

> .The combination was performing better was querying for 500 rows at a time 
> with 1000 columns while different combinations, such as 125 rows for 4000 
> columns or 1000 rows for 500 columns, were about the 15% slower. 
I would rarely go above 100 rows, specially if you are asking for 1000 columns.


> If you consider it depends also on the number of nodes in the cluster, the 
> memory available  and the number of rows and column the query needs, the 
> problem of how  optimally divide a request  becomes quite  complex. 

It sounds like you are targeting single read thread performance. 
If you want to go faster make your client do smaller requests in parallel. 

Cheers

-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/07/2013, at 12:26 AM, cesare cugnasco <cesare.cugna...@gmail.com> wrote:

> Thank you Aaron,  your advice about a newer client it is really interesting. 
> We will take in account it!
> 
> Here, some numbers about our tests: we found that more or less that with more 
> than 500k elements (multiplying rows and columns requested) there was the 
> inflection point, and so asking for more the performance can only 
> decrease.The combination was performing better was querying for 500 rows at a 
> time with 1000 columns while different combinations, such as 125 rows for 
> 4000 columns or 1000 rows for 500 columns, were about the 15% slower. Other 
> combinations have even bigger differences.
> 
> It was a cluster of 16 nodes, with 24GBs or ram, sata-2 SSDs and 8-cores 
> CPUs@2.6 GHz.
> 
> The issue is this memory limit can be reached with many combinations of row 
> and columns. Broadly speaking, in using more rows or columns there is a 
> trade-off between better having a better parallelization and and higher 
> overhead. 
> If you consider it depends also on the number of nodes in the cluster, the 
> memory available  and the number of rows and column the query needs, the 
> problem of how  optimally divide a request  becomes quite  complex. 
>  
> Does these numbers make sense for you?
> 
> Cheers
> 
> 
> 2013/7/17 aaron morton <aa...@thelastpickle.com>
> >  In ours tests,  we found there's a significant performance difference 
> > between various  configurations and we are studying a policy to optimize 
> > it. The doubt is that, if the needing of issuing multiple requests is 
> > caused only by a fixable implementation detail, would make pointless do 
> > this study.
> if you provide your numbers we can see if you are getting expected results.
> 
> There are some limiting factors. Using the thrift API the max message size is 
> 15 MB. And each row you ask for becomes (roughly) RF number of tasks in the 
> thread pools on replicas. When you ask for 1000 rows it creates (roughly) 
> 3,000 tasks in the replicas. If you have other clients trying to do reads at 
> the same time this can cause delays to their reads.
> 
> Like everything in computing, more is not always better. Run some tests to 
> try multi gets with different sizes and see where improvements in the overall 
> throughput begin to decline.
> 
> Also consider using a newer client with token aware balancing and async 
> networking. Again though, if you try to read everything at once you are going 
> to have a bad day.
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/07/2013, at 8:24 PM, cesare cugnasco <cesare.cugna...@gmail.com> wrote:
> 
> > Hi Rob,
> > of course, we could issue multiple requests, but then we should  consider 
> > which is the optimal way to split the query in smaller ones. Moreover, we 
> > should choose how many of sub-query run in parallel.
> >  In ours tests,  we found there's a significant performance difference 
> > between various  configurations and we are studying a policy to optimize 
> > it. The doubt is that, if the needing of issuing multiple requests is 
> > caused only by a fixable implementation detail, would make pointless do 
> > this study.
> >
> > Does anyone made similar analysis?
> >
> >
> > 2013/7/16 Robert Coli <rc...@eventbrite.com>
> >
> > On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco 
> > <cesare.cugna...@gmail.com> wrote:
> > We  are working on porting some life science applications to Cassandra, but 
> > we have to deal with its limits managing huge queries. Our queries are 
> > usually multiget_slice ones: many rows with many columns each.
> >
> > You are not getting much "win" by increasing request size in Cassandra, and 
> > you expose yourself to "lose" such as you have experienced.
> >
> > Is there some reason you cannot just issue multiple requests?
> >
> > =Rob
> >
> 
>

Re: Huge query Cassandra limits

Reply via email to