And by the way Jon and Ryan, I want to thank you for engaging in the conversation. I hope I'm not coming across as argumentative or combative or anything like that. But I would definitely love to reconcile my measurements with recommended practices so that I can make good decisions about how to model my application.
On Sat, Dec 13, 2014 at 10:58 AM, Eric Stevens <migh...@gmail.com> wrote: > Isn't the net effect of coordination overhead incurred by batches > basically the same as the overhead incurred by RoundRobin or other > non-token-aware request routing? As the cluster size increases, each node > would coordinate the same percentage of writes in batches under token > awareness as they would under a more naive single statement routing > strategy. If write volume per time unit is the same in both approaches, > each node ends up coordinating the majority of writes under either strategy > as the cluster grows. > > GC pressure in the cluster is a concern of course, as you observe. But > delta performance is *substantial* from what I can see. As in the case > where you're bumping up against retries, this will cause you to fall over > much more rapidly as you approach your tipping point, but in a healthy > cluster, it's the same write volume, just a longer tenancy in eden. If > reasonable sized batches are causing survivors, you're not far off from > falling over anyway. > > On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <j...@jonhaddad.com> > wrote: > >> One thing to keep in mind is the overhead of a batch goes up as the >> number of servers increases. Talking to 3 is going to have a much >> different performance profile than talking to 20. Keep in mind that the >> coordinator is going to be talking to every server in the cluster with a >> big batch. The amount of local writes will decrease as it owns a smaller >> portion of the ring. All you've done is add an extra network hop between >> your client and where the data should actually be. You also start to have >> an impact on GC in a very negative way. >> >> Your point is valid about topology changes, but that's a relatively rare >> occurrence, and the driver is notified pretty quickly, so I wouldn't >> optimize for that case. >> >> Can you post your test code in a gist or something? I can't really talk >> about your benchmark without seeing it and you're basing your stance on the >> premise that it is correct, which it may not be. >> >> >> >> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <migh...@gmail.com> wrote: >> >>> You can seen what the partition key strategies are for each of the >>> tables, test5 shows the least improvement. The set (aid, end) should be >>> unique, and bckt is derived from end. Some of these layouts result in >>> clustering on the same partition keys, that's actually tunable with the >>> "~15 per bucket" reported (exact number of entries per bucket will vary but >>> should have a mean of 15 in that run - it's an input parameter to my >>> tests). "test5" obviously ends up being exclusively unique partitions for >>> each record. >>> >>> Your points about: >>> 1) Failed batches having a higher cost than failed single statements >>> 2) In my test, every node was a replica for all data. >>> >>> These are both very good points. >>> >>> For #1, since the worst case scenario is nearly twice fast in batches as >>> its single statement equivalent, in terms of impact on the client, you'd >>> have to be retrying half your batches before you broke even there (but of >>> course those retries are not free to the cluster, so you probably make the >>> performance tipping point approach a lot faster). This alone may be cause >>> to justify avoiding batches, or at least severely limiting their size (hey, >>> that's what this discussion is about!). >>> >>> For #2, that's certainly a good point, for this test cluster, I should >>> at least re-run with RF=1 so that proxying times start to matter. If >>> you're not using a token aware client or not using a token aware policy for >>> whatever reason, this should even out though, no? Each node will end up >>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are >>> batched or single statements. The DS driver is very careful to caution >>> that the topology map it maintains makes no guarantees on freshness, so you >>> may see a significant performance penalty in your client when the topology >>> changes if you're depending on token aware routing as part of your >>> performance requirements. >>> >>> >>> I'm curious what your thoughts are on grouping statements by primary >>> replica according to the routing policy, and executing unlogged batches >>> that way (so that for token aware routing, all statements are executed on a >>> replica, for others it'd make no difference). Retries are still more >>> expensive, but token aware proxying avoidance is still had. It's pretty >>> easy to do in Scala: >>> >>> def groupByFirstReplica(statements: Iterable[Statement])(implicit >>> session: Session): Map[Host, Seq[Statement]] = { >>> val meta = session.getCluster.getMetadata >>> statements.groupBy { st => >>> meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next >>> } >>> } >>> val result = >>> Future.traverse(groupByFirstReplica(statements).values).map(st => >>> newBatch(st).executeAsync()) >>> >>> >>> Let me get together my test code, it depends on some existing utilities >>> we use elsewhere, such as implicit conversions between Google and Scala >>> native futures. I'll try to put this together in a format that's runnable >>> for you in a Scala REPL console without having to resolve our internal >>> dependencies. This may not be today though. >>> >>> Also, @Ryan, I don't think that shuffling would make a difference for my >>> above tests since as Jon observed, all my nodes were already replicas there. >>> >>> >>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rsvi...@datastax.com> >>> wrote: >>> >>>> Also..what happens when you turn on shuffle with token aware? >>>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html >>>> >>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <j...@jonhaddad.com> >>>> wrote: >>>>> >>>>> To add to Ryan's (extremely valid!) point, your test works because the >>>>> coordinator is always a replica. Try again using 20 (or 50) nodes. >>>>> Batching works great at RF=N=3 because it always gets to write to local >>>>> and >>>>> talk to exactly 2 other servers on every request. Consider what happens >>>>> when the coordinator needs to talk to 100 servers. It's unnecessary >>>>> overhead on the server side. >>>>> >>>>> To save network overhead, Cassandra 2.1 added support for response >>>>> grouping (see >>>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) >>>>> which massively helps performance. It provides the benefit of batches but >>>>> without the coordinator overhead. >>>>> >>>>> Can you post your benchmark code? >>>>> >>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <j...@jonhaddad.com> >>>>> wrote: >>>>> >>>>>> There are cases where it can. For instance, if you batch multiple >>>>>> mutations to the same partition (and talk to a replica for that >>>>>> partition) >>>>>> they can reduce network overhead because they're effectively a single >>>>>> mutation in the eye of the cluster. However, if you're not doing that >>>>>> (and >>>>>> most people aren't!) you end up putting additional pressure on the >>>>>> coordinator because now it has to talk to several other servers. If you >>>>>> have 100 servers, and perform a mutation on 100 partitions, you could >>>>>> have >>>>>> a coordinator that's >>>>>> >>>>>> 1) talking to every machine in the cluster and >>>>>> b) waiting on a response from a significant portion of them >>>>>> >>>>>> before it can respond success or fail. Any delay, from GC to a bad >>>>>> disk, can affect the performance of the entire batch. >>>>>> >>>>>> >>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky < >>>>>> j...@basetechnology.com> wrote: >>>>>> >>>>>>> Jonathan and Ryan, >>>>>>> >>>>>>> Jonathan says “It is absolutely not going to help you if you're >>>>>>> trying to lump queries together to reduce network & server overhead - in >>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “ >>>>>>> The BATCH statement ... serves several purposes: 1. It saves >>>>>>> network round-trips between the client and the server (and sometimes >>>>>>> between the server coordinator and the replicas) when batching multiple >>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with your >>>>>>> statement. >>>>>>> >>>>>>> See: >>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html >>>>>>> >>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a >>>>>>> change to make it accurate. >>>>>>> >>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements >>>>>>> can save network exchanges between the client/server and server >>>>>>> coordinator/replicas. However, because of the distributed nature of >>>>>>> Cassandra, spread requests across nearby nodes as much as possible to >>>>>>> optimize performance. Using batches to optimize performance is usually >>>>>>> not >>>>>>> successful, as described in Using and misusing batches section. For >>>>>>> information about the fastest way to load data, see "Cassandra: Batch >>>>>>> loading without the Batch keyword."” >>>>>>> >>>>>>> Maybe what we really need is a “client/driver-side batch”, which is >>>>>>> simply a way to collect “batches” of operations in the client/driver and >>>>>>> then let the driver determine what degree of batching and asynchronous >>>>>>> operation is appropriate. >>>>>>> >>>>>>> It might also be nice to have an inquiry for the cluster as to what >>>>>>> batch size is most optimal for the cluster, like number of mutations in >>>>>>> a >>>>>>> batch and number of simultaneous connections, and to have that be >>>>>>> dynamic >>>>>>> based on overall cluster load. >>>>>>> >>>>>>> I would also note that the example in the spec has multiple inserts >>>>>>> with different partition key values, which flies in the face of the >>>>>>> admonition to to refrain from using server-side distribution of >>>>>>> requests. >>>>>>> >>>>>>> At a minimum the CQL spec should make a more clear statement of >>>>>>> intent and non-intent for BATCH. >>>>>>> >>>>>>> -- Jack Krupansky >>>>>>> >>>>>>> *From:* Jonathan Haddad <j...@jonhaddad.com> >>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM >>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rsvi...@datastax.com> >>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb >>>>>>> >>>>>>> The really important thing to really take away from Ryan's original >>>>>>> post is that batches are not there for performance. The only case I >>>>>>> consider batches to be useful for is when you absolutely need to know >>>>>>> that >>>>>>> several tables all get a mutation (via logged batches). The use case >>>>>>> for >>>>>>> this is when you've got multiple tables that are serving as different >>>>>>> views >>>>>>> for data. It is absolutely not going to help you if you're trying to >>>>>>> lump >>>>>>> queries together to reduce network & server overhead - in fact it'll do >>>>>>> the >>>>>>> opposite. If you're trying to do that, instead perform many async >>>>>>> queries. The overhead of batches in cassandra is significant and you're >>>>>>> going to hit a lot of problems if you use them excessively (timeouts / >>>>>>> failures). >>>>>>> >>>>>>> tl;dr: you probably don't want batch, you most likely want many >>>>>>> async calls >>>>>>> >>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller < >>>>>>> moham...@glassbeam.com> wrote: >>>>>>> >>>>>>>> Ryan, >>>>>>>> >>>>>>>> Thanks for the quick response. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I did see that jira before posting my question on this list. >>>>>>>> However, I didn’t see any information about why 5kb+ data will cause >>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each >>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that >>>>>>>> threshold. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> In addition, Patrick is saying that he does not recommend more than >>>>>>>> 100 mutations per batch. So why not warn users just on the # of >>>>>>>> mutations >>>>>>>> in a batch? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Mohammed >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From:* Ryan Svihla [mailto:rsvi...@datastax.com] >>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM >>>>>>>> *To:* user@cassandra.apache.org >>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Nothing magic, just put in there based on experience. You can find >>>>>>>> the story behind the original recommendation here >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Key reasoning for the desire comes from Patrick McFadden: >>>>>>>> >>>>>>>> >>>>>>>> "Yes that was in bytes. Just in my own experience, I don't >>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I >>>>>>>> came >>>>>>>> up with 5k as 100 x 50 byte mutations. >>>>>>>> >>>>>>>> Totally up for debate." >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> It's totally changeable, however, it's there in no small part >>>>>>>> because so many people confuse the BATCH keyword as a performance >>>>>>>> optimization, this helps flag those cases of misuse. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller < >>>>>>>> moham...@glassbeam.com> wrote: >>>>>>>> >>>>>>>> Hi – >>>>>>>> >>>>>>>> The cassandra.yaml file has property called >>>>>>>> *batch_size_warn_threshold_in_kb. >>>>>>>> * >>>>>>>> >>>>>>>> The default size is 5kb and according to the comments in the yaml >>>>>>>> file, it is used to log WARN on any batch size exceeding this value in >>>>>>>> kilobytes. It says caution should be taken on increasing the size of >>>>>>>> this >>>>>>>> threshold as it can lead to node instability. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Does anybody know the significance of this magic number 5kb? Why >>>>>>>> would a higher number (say 10kb) lead to node instability? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Mohammed >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/> >>>>>>>> >>>>>>>> Ryan Svihla >>>>>>>> >>>>>>>> Solution Architect >>>>>>>> >>>>>>>> >>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image: >>>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> DataStax is the fastest, most scalable distributed database >>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative >>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably >>>>>>>> scalable to any size. With more than 500 customers in 45 countries, >>>>>>>> DataStax >>>>>>>> is the database technology and transactional backbone of choice for the >>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and >>>>>>>> eBay. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>> >>>> -- >>>> >>>> [image: datastax_logo.png] <http://www.datastax.com/> >>>> >>>> Ryan Svihla >>>> >>>> Solution Architect >>>> >>>> [image: twitter.png] <https://twitter.com/foundev> [image: >>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/> >>>> >>>> DataStax is the fastest, most scalable distributed database technology, >>>> delivering Apache Cassandra to the world’s most innovative enterprises. >>>> Datastax is built to be agile, always-on, and predictably scalable to any >>>> size. With more than 500 customers in 45 countries, DataStax is the >>>> database technology and transactional backbone of choice for the worlds >>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay. >>>> >>>> >>> >