Just an FYI, my benchmarking of the new python driver, which uses the
asynchronous CQL native transport, indicates that one can largely overcome
client-to-node latency effects if you employ a suitable level of
concurrency and non-blocking techniques.

Of course response size and other factors come into play, but having a
hundred or so queries simultaneously in the pipeline from each worker
subprocess is a big help.


On Thu, Jun 12, 2014 at 10:46 AM, Jeremy Jongsma <jer...@barchart.com>
wrote:

> Good to know, thanks Peter. I am worried about client-to-node latency if I
> have to do 20,000 individual queries, but that makes it clearer that at
> least batching in smaller sizes is a good idea.
>
>
> On Wed, Jun 11, 2014 at 6:34 PM, Peter Sanford <psanf...@retailnext.net>
> wrote:
>
>> On Wed, Jun 11, 2014 at 10:12 AM, Jeremy Jongsma <jer...@barchart.com>
>> wrote:
>>
>>> The big problem seems to have been requesting a large number of row keys
>>> combined with a large number of named columns in a query. 20K rows with 20K
>>> columns destroyed my cluster. Splitting it into slices of 100 sequential
>>> queries fixed the performance issue.
>>>
>>> When updating 20K rows at a time, I saw a different issue -
>>> BrokenPipeException from all nodes. Splitting into slices of 1000 fixed
>>> that issue.
>>>
>>> Is there any documentation on this? Obviously these limits will vary by
>>> cluster capacity, but for new users it would be great to know that you can
>>> run into problems with large queries, and how they present themselves when
>>> you hit them. The errors I saw are pretty opaque, and took me a couple days
>>> to track down.
>>>
>>>
>> The first thing that comes to mind is the Multiget section on the
>> Datastax anti-patterns page:
>> http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningAntiPatterns_c.html?scroll=concept_ds_emm_hwl_fk__multiple-gets
>>
>>
>>
>> -psanford
>>
>>
>>
>

Reply via email to