> .The combination was performing better was querying for 500 rows at a time > with 1000 columns while different combinations, such as 125 rows for 4000 > columns or 1000 rows for 500 columns, were about the 15% slower. I would rarely go above 100 rows, specially if you are asking for 1000 columns.
> If you consider it depends also on the number of nodes in the cluster, the > memory available and the number of rows and column the query needs, the > problem of how optimally divide a request becomes quite complex. It sounds like you are targeting single read thread performance. If you want to go faster make your client do smaller requests in parallel. Cheers ----------------- Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 19/07/2013, at 12:26 AM, cesare cugnasco <cesare.cugna...@gmail.com> wrote: > Thank you Aaron, your advice about a newer client it is really interesting. > We will take in account it! > > Here, some numbers about our tests: we found that more or less that with more > than 500k elements (multiplying rows and columns requested) there was the > inflection point, and so asking for more the performance can only > decrease.The combination was performing better was querying for 500 rows at a > time with 1000 columns while different combinations, such as 125 rows for > 4000 columns or 1000 rows for 500 columns, were about the 15% slower. Other > combinations have even bigger differences. > > It was a cluster of 16 nodes, with 24GBs or ram, sata-2 SSDs and 8-cores > CPUs@2.6 GHz. > > The issue is this memory limit can be reached with many combinations of row > and columns. Broadly speaking, in using more rows or columns there is a > trade-off between better having a better parallelization and and higher > overhead. > If you consider it depends also on the number of nodes in the cluster, the > memory available and the number of rows and column the query needs, the > problem of how optimally divide a request becomes quite complex. > > Does these numbers make sense for you? > > Cheers > > > 2013/7/17 aaron morton <aa...@thelastpickle.com> > > In ours tests, we found there's a significant performance difference > > between various configurations and we are studying a policy to optimize > > it. The doubt is that, if the needing of issuing multiple requests is > > caused only by a fixable implementation detail, would make pointless do > > this study. > if you provide your numbers we can see if you are getting expected results. > > There are some limiting factors. Using the thrift API the max message size is > 15 MB. And each row you ask for becomes (roughly) RF number of tasks in the > thread pools on replicas. When you ask for 1000 rows it creates (roughly) > 3,000 tasks in the replicas. If you have other clients trying to do reads at > the same time this can cause delays to their reads. > > Like everything in computing, more is not always better. Run some tests to > try multi gets with different sizes and see where improvements in the overall > throughput begin to decline. > > Also consider using a newer client with token aware balancing and async > networking. Again though, if you try to read everything at once you are going > to have a bad day. > > Cheers > > ----------------- > Aaron Morton > Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 17/07/2013, at 8:24 PM, cesare cugnasco <cesare.cugna...@gmail.com> wrote: > > > Hi Rob, > > of course, we could issue multiple requests, but then we should consider > > which is the optimal way to split the query in smaller ones. Moreover, we > > should choose how many of sub-query run in parallel. > > In ours tests, we found there's a significant performance difference > > between various configurations and we are studying a policy to optimize > > it. The doubt is that, if the needing of issuing multiple requests is > > caused only by a fixable implementation detail, would make pointless do > > this study. > > > > Does anyone made similar analysis? > > > > > > 2013/7/16 Robert Coli <rc...@eventbrite.com> > > > > On Tue, Jul 16, 2013 at 4:46 AM, cesare cugnasco > > <cesare.cugna...@gmail.com> wrote: > > We are working on porting some life science applications to Cassandra, but > > we have to deal with its limits managing huge queries. Our queries are > > usually multiget_slice ones: many rows with many columns each. > > > > You are not getting much "win" by increasing request size in Cassandra, and > > you expose yourself to "lose" such as you have experienced. > > > > Is there some reason you cannot just issue multiple requests? > > > > =Rob > > > >