Re: batch_size_warn_threshold_in_kb

Jonathan Haddad Sat, 13 Dec 2014 10:11:28 -0800

Not a problem - it's good to hash this stuff out and understand the
technical reasons why something works or doesn't work.



On Sat Dec 13 2014 at 10:07:10 AM Jonathan Haddad <j...@jonhaddad.com> wrote:

> On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens <migh...@gmail.com> wrote:
>
>> Isn't the net effect of coordination overhead incurred by batches
>> basically the same as the overhead incurred by RoundRobin or other
>> non-token-aware request routing?  As the cluster size increases, each node
>> would coordinate the same percentage of writes in batches under token
>> awareness as they would under a more naive single statement routing
>> strategy.  If write volume per time unit is the same in both approaches,
>> each node ends up coordinating the majority of writes under either strategy
>> as the cluster grows.
>>
>
> If you're not token aware, there's extra coordinator overhead, yes.  If
> you are token aware, not the case.  I'm operating under the assumption that
> you'd want to be token aware, since I don't see a point in not doing so :)
>
> Unfortunately my Scala isn't the best so I'm going to have to take a
> little bit to wade through the code.
>
> It may be useful to run cassandra-stress (it doesn't seem to have a mode
> for batches) to get a baseline on non-batches.  I'm curious to know if you
> get different numbers than the scala profiler.
>
>
>
>>
>> GC pressure in the cluster is a concern of course, as you observe.  But
>> delta performance is *substantial* from what I can see.  As in the case
>> where you're bumping up against retries, this will cause you to fall over
>> much more rapidly as you approach your tipping point, but in a healthy
>> cluster, it's the same write volume, just a longer tenancy in eden.  If
>> reasonable sized batches are causing survivors, you're not far off from
>> falling over anyway.
>>
>> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <j...@jonhaddad.com>
>> wrote:
>>
>>> One thing to keep in mind is the overhead of a batch goes up as the
>>> number of servers increases.  Talking to 3 is going to have a much
>>> different performance profile than talking to 20.  Keep in mind that the
>>> coordinator is going to be talking to every server in the cluster with a
>>> big batch.  The amount of local writes will decrease as it owns a smaller
>>> portion of the ring.  All you've done is add an extra network hop between
>>> your client and where the data should actually be.  You also start to have
>>> an impact on GC in a very negative way.
>>>
>>> Your point is valid about topology changes, but that's a relatively rare
>>> occurrence, and the driver is notified pretty quickly, so I wouldn't
>>> optimize for that case.
>>>
>>> Can you post your test code in a gist or something?  I can't really talk
>>> about your benchmark without seeing it and you're basing your stance on the
>>> premise that it is correct, which it may not be.
>>>
>>>
>>>
>>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <migh...@gmail.com> wrote:
>>>
>>>> You can seen what the partition key strategies are for each of the
>>>> tables, test5 shows the least improvement.  The set (aid, end) should be
>>>> unique, and bckt is derived from end.  Some of these layouts result in
>>>> clustering on the same partition keys, that's actually tunable with the
>>>> "~15 per bucket" reported (exact number of entries per bucket will vary but
>>>> should have a mean of 15 in that run - it's an input parameter to my
>>>> tests).  "test5" obviously ends up being exclusively unique partitions for
>>>> each record.
>>>>
>>>> Your points about:
>>>> 1) Failed batches having a higher cost than failed single statements
>>>> 2) In my test, every node was a replica for all data.
>>>>
>>>> These are both very good points.
>>>>
>>>> For #1, since the worst case scenario is nearly twice fast in batches
>>>> as its single statement equivalent, in terms of impact on the client, you'd
>>>> have to be retrying half your batches before you broke even there (but of
>>>> course those retries are not free to the cluster, so you probably make the
>>>> performance tipping point approach a lot faster).  This alone may be cause
>>>> to justify avoiding batches, or at least severely limiting their size (hey,
>>>> that's what this discussion is about!).
>>>>
>>>> For #2, that's certainly a good point, for this test cluster, I should
>>>> at least re-run with RF=1 so that proxying times start to matter.  If
>>>> you're not using a token aware client or not using a token aware policy for
>>>> whatever reason, this should even out though, no?  Each node will end up
>>>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
>>>> batched or single statements.  The DS driver is very careful to caution
>>>> that the topology map it maintains makes no guarantees on freshness, so you
>>>> may see a significant performance penalty in your client when the topology
>>>> changes if you're depending on token aware routing as part of your
>>>> performance requirements.
>>>>
>>>>
>>>> I'm curious what your thoughts are on grouping statements by primary
>>>> replica according to the routing policy, and executing unlogged batches
>>>> that way (so that for token aware routing, all statements are executed on a
>>>> replica, for others it'd make no difference).  Retries are still more
>>>> expensive, but token aware proxying avoidance is still had.  It's pretty
>>>> easy to do in Scala:
>>>>
>>>>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
>>>> session: Session): Map[Host, Seq[Statement]] = {
>>>>     val meta = session.getCluster.getMetadata
>>>>     statements.groupBy { st =>
>>>>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().
>>>> next
>>>>     }
>>>>   }
>>>>   val result = 
>>>> Future.traverse(groupByFirstReplica(statements).values).map(st
>>>> => newBatch(st).executeAsync())
>>>>
>>>>
>>>> Let me get together my test code, it depends on some existing utilities
>>>> we use elsewhere, such as implicit conversions between Google and Scala
>>>> native futures.  I'll try to put this together in a format that's runnable
>>>> for you in a Scala REPL console without having to resolve our internal
>>>> dependencies.  This may not be today though.
>>>>
>>>> Also, @Ryan, I don't think that shuffling would make a difference for
>>>> my above tests since as Jon observed, all my nodes were already replicas
>>>> there.
>>>>
>>>>
>>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rsvi...@datastax.com>
>>>> wrote:
>>>>
>>>>> Also..what happens when you turn on shuffle with token aware?
>>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/
>>>>> driver/core/policies/TokenAwarePolicy.html
>>>>>
>>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <j...@jonhaddad.com>
>>>>> wrote:
>>>>>>
>>>>>> To add to Ryan's (extremely valid!) point, your test works because
>>>>>> the coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>>>>> Batching works great at RF=N=3 because it always gets to write to local 
>>>>>> and
>>>>>> talk to exactly 2 other servers on every request.  Consider what happens
>>>>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>>>>> overhead on the server side.
>>>>>>
>>>>>> To save network overhead, Cassandra 2.1 added support for response
>>>>>> grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-
>>>>>> over-50-faster) which massively helps performance.  It provides the
>>>>>> benefit of batches but without the coordinator overhead.
>>>>>>
>>>>>> Can you post your benchmark code?
>>>>>>
>>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <j...@jonhaddad.com>
>>>>>> wrote:
>>>>>>
>>>>>>> There are cases where it can.  For instance, if you batch multiple
>>>>>>> mutations to the same partition (and talk to a replica for that 
>>>>>>> partition)
>>>>>>> they can reduce network overhead because they're effectively a single
>>>>>>> mutation in the eye of the cluster.  However, if you're not doing that 
>>>>>>> (and
>>>>>>> most people aren't!) you end up putting additional pressure on the
>>>>>>> coordinator because now it has to talk to several other servers.  If you
>>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could 
>>>>>>> have
>>>>>>> a coordinator that's
>>>>>>>
>>>>>>> 1) talking to every machine in the cluster and
>>>>>>> b) waiting on a response from a significant portion of them
>>>>>>>
>>>>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>>>>> disk, can affect the performance of the entire batch.
>>>>>>>
>>>>>>>
>>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>>>>> j...@basetechnology.com> wrote:
>>>>>>>
>>>>>>>>   Jonathan and Ryan,
>>>>>>>>
>>>>>>>> Jonathan says “It is absolutely not going to help you if you're
>>>>>>>> trying to lump queries together to reduce network & server overhead - 
>>>>>>>> in
>>>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “
>>>>>>>> The BATCH statement ... serves several purposes: 1. It saves
>>>>>>>> network round-trips between the client and the server (and sometimes
>>>>>>>> between the server coordinator and the replicas) when batching multiple
>>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with 
>>>>>>>> your
>>>>>>>> statement.
>>>>>>>>
>>>>>>>> See:
>>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>>>>
>>>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>>>>> change to make it accurate.
>>>>>>>>
>>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>>>>> can save network exchanges between the client/server and server
>>>>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>>>>> optimize performance. Using batches to optimize performance is usually 
>>>>>>>> not
>>>>>>>> successful, as described in Using and misusing batches section. For
>>>>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>>>>> loading without the Batch keyword."”
>>>>>>>>
>>>>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>>>>> simply a way to collect “batches” of operations in the client/driver 
>>>>>>>> and
>>>>>>>> then let the driver determine what degree of batching and asynchronous
>>>>>>>> operation is appropriate.
>>>>>>>>
>>>>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>>>>> batch size is most optimal for the cluster, like number of mutations 
>>>>>>>> in a
>>>>>>>> batch and number of simultaneous connections, and to have that be 
>>>>>>>> dynamic
>>>>>>>> based on overall cluster load.
>>>>>>>>
>>>>>>>> I would also note that the example in the spec has multiple inserts
>>>>>>>> with different partition key values, which flies in the face of the
>>>>>>>> admonition to to refrain from using server-side distribution of 
>>>>>>>> requests.
>>>>>>>>
>>>>>>>> At a minimum the CQL spec should make a more clear statement of
>>>>>>>> intent and non-intent for BATCH.
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>>  *From:* Jonathan Haddad <j...@jonhaddad.com>
>>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla
>>>>>>>> <rsvi...@datastax.com>
>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>
>>>>>>>> The really important thing to really take away from Ryan's original
>>>>>>>> post is that batches are not there for performance.  The only case I
>>>>>>>> consider batches to be useful for is when you absolutely need to know 
>>>>>>>> that
>>>>>>>> several tables all get a mutation (via logged batches).  The use case 
>>>>>>>> for
>>>>>>>> this is when you've got multiple tables that are serving as different 
>>>>>>>> views
>>>>>>>> for data.  It is absolutely not going to help you if you're trying to 
>>>>>>>> lump
>>>>>>>> queries together to reduce network & server overhead - in fact it'll 
>>>>>>>> do the
>>>>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>>>>> queries.  The overhead of batches in cassandra is significant and 
>>>>>>>> you're
>>>>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>>>>> failures).
>>>>>>>>
>>>>>>>> tl;dr: you probably don't want batch, you most likely want many
>>>>>>>> async calls
>>>>>>>>
>>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>>>>> moham...@glassbeam.com> wrote:
>>>>>>>>
>>>>>>>>>  Ryan,
>>>>>>>>>
>>>>>>>>> Thanks for the quick response.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I did see that jira before posting my question on this list.
>>>>>>>>> However, I didn’t see any information about why 5kb+ data will cause
>>>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each
>>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that
>>>>>>>>> threshold.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In addition, Patrick is saying that he does not recommend more
>>>>>>>>> than 100 mutations per batch. So why not warn users just on the # of
>>>>>>>>> mutations in a batch?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mohammed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
>>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>>>>> *To:* user@cassandra.apache.org
>>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>>>>> the story behind the original recommendation here
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't
>>>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I 
>>>>>>>>> came
>>>>>>>>> up with 5k as 100 x 50 byte mutations.
>>>>>>>>>
>>>>>>>>> Totally up for debate."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It's totally changeable, however, it's there in no small part
>>>>>>>>> because so many people confuse the BATCH keyword as a performance
>>>>>>>>> optimization, this helps flag those cases of misuse.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>>>>> moham...@glassbeam.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi –
>>>>>>>>>
>>>>>>>>> The cassandra.yaml file has property called 
>>>>>>>>> *batch_size_warn_threshold_in_kb.
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>>>>> kilobytes. It says caution should be taken on increasing the size of 
>>>>>>>>> this
>>>>>>>>> threshold as it can lead to node instability.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mohammed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>>>
>>>>>>>>> Ryan Svihla
>>>>>>>>>
>>>>>>>>> Solution Architect
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>>>>> linkedin.png]
>>>>>>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>>>> scalable to any size. With more than 500 customers in 45 countries, 
>>>>>>>>> DataStax
>>>>>>>>> is the database technology and transactional backbone of choice for 
>>>>>>>>> the
>>>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and 
>>>>>>>>> eBay.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>
>>>>> Ryan Svihla
>>>>>
>>>>> Solution Architect
>>>>>
>>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>
>>>>> DataStax is the fastest, most scalable distributed database
>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>> scalable to any size. With more than 500 customers in 45 countries, 
>>>>> DataStax
>>>>> is the database technology and transactional backbone of choice for the
>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>
>>>>>
>>>>
>>

Re: batch_size_warn_threshold_in_kb

Reply via email to