Just an FYI, my benchmarking of the new python driver, which uses the asynchronous CQL native transport, indicates that one can largely overcome client-to-node latency effects if you employ a suitable level of concurrency and non-blocking techniques.
Of course response size and other factors come into play, but having a hundred or so queries simultaneously in the pipeline from each worker subprocess is a big help. On Thu, Jun 12, 2014 at 10:46 AM, Jeremy Jongsma <jer...@barchart.com> wrote: > Good to know, thanks Peter. I am worried about client-to-node latency if I > have to do 20,000 individual queries, but that makes it clearer that at > least batching in smaller sizes is a good idea. > > > On Wed, Jun 11, 2014 at 6:34 PM, Peter Sanford <psanf...@retailnext.net> > wrote: > >> On Wed, Jun 11, 2014 at 10:12 AM, Jeremy Jongsma <jer...@barchart.com> >> wrote: >> >>> The big problem seems to have been requesting a large number of row keys >>> combined with a large number of named columns in a query. 20K rows with 20K >>> columns destroyed my cluster. Splitting it into slices of 100 sequential >>> queries fixed the performance issue. >>> >>> When updating 20K rows at a time, I saw a different issue - >>> BrokenPipeException from all nodes. Splitting into slices of 1000 fixed >>> that issue. >>> >>> Is there any documentation on this? Obviously these limits will vary by >>> cluster capacity, but for new users it would be great to know that you can >>> run into problems with large queries, and how they present themselves when >>> you hit them. The errors I saw are pretty opaque, and took me a couple days >>> to track down. >>> >>> >> The first thing that comes to mind is the Multiget section on the >> Datastax anti-patterns page: >> http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningAntiPatterns_c.html?scroll=concept_ds_emm_hwl_fk__multiple-gets >> >> >> >> -psanford >> >> >> >