Not a problem - it's good to hash this stuff out and understand the technical reasons why something works or doesn't work.
On Sat Dec 13 2014 at 10:07:10 AM Jonathan Haddad <j...@jonhaddad.com> wrote: > On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens <migh...@gmail.com> wrote: > >> Isn't the net effect of coordination overhead incurred by batches >> basically the same as the overhead incurred by RoundRobin or other >> non-token-aware request routing? As the cluster size increases, each node >> would coordinate the same percentage of writes in batches under token >> awareness as they would under a more naive single statement routing >> strategy. If write volume per time unit is the same in both approaches, >> each node ends up coordinating the majority of writes under either strategy >> as the cluster grows. >> > > If you're not token aware, there's extra coordinator overhead, yes. If > you are token aware, not the case. I'm operating under the assumption that > you'd want to be token aware, since I don't see a point in not doing so :) > > Unfortunately my Scala isn't the best so I'm going to have to take a > little bit to wade through the code. > > It may be useful to run cassandra-stress (it doesn't seem to have a mode > for batches) to get a baseline on non-batches. I'm curious to know if you > get different numbers than the scala profiler. > > > >> >> GC pressure in the cluster is a concern of course, as you observe. But >> delta performance is *substantial* from what I can see. As in the case >> where you're bumping up against retries, this will cause you to fall over >> much more rapidly as you approach your tipping point, but in a healthy >> cluster, it's the same write volume, just a longer tenancy in eden. If >> reasonable sized batches are causing survivors, you're not far off from >> falling over anyway. >> >> On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad <j...@jonhaddad.com> >> wrote: >> >>> One thing to keep in mind is the overhead of a batch goes up as the >>> number of servers increases. Talking to 3 is going to have a much >>> different performance profile than talking to 20. Keep in mind that the >>> coordinator is going to be talking to every server in the cluster with a >>> big batch. The amount of local writes will decrease as it owns a smaller >>> portion of the ring. All you've done is add an extra network hop between >>> your client and where the data should actually be. You also start to have >>> an impact on GC in a very negative way. >>> >>> Your point is valid about topology changes, but that's a relatively rare >>> occurrence, and the driver is notified pretty quickly, so I wouldn't >>> optimize for that case. >>> >>> Can you post your test code in a gist or something? I can't really talk >>> about your benchmark without seeing it and you're basing your stance on the >>> premise that it is correct, which it may not be. >>> >>> >>> >>> On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens <migh...@gmail.com> wrote: >>> >>>> You can seen what the partition key strategies are for each of the >>>> tables, test5 shows the least improvement. The set (aid, end) should be >>>> unique, and bckt is derived from end. Some of these layouts result in >>>> clustering on the same partition keys, that's actually tunable with the >>>> "~15 per bucket" reported (exact number of entries per bucket will vary but >>>> should have a mean of 15 in that run - it's an input parameter to my >>>> tests). "test5" obviously ends up being exclusively unique partitions for >>>> each record. >>>> >>>> Your points about: >>>> 1) Failed batches having a higher cost than failed single statements >>>> 2) In my test, every node was a replica for all data. >>>> >>>> These are both very good points. >>>> >>>> For #1, since the worst case scenario is nearly twice fast in batches >>>> as its single statement equivalent, in terms of impact on the client, you'd >>>> have to be retrying half your batches before you broke even there (but of >>>> course those retries are not free to the cluster, so you probably make the >>>> performance tipping point approach a lot faster). This alone may be cause >>>> to justify avoiding batches, or at least severely limiting their size (hey, >>>> that's what this discussion is about!). >>>> >>>> For #2, that's certainly a good point, for this test cluster, I should >>>> at least re-run with RF=1 so that proxying times start to matter. If >>>> you're not using a token aware client or not using a token aware policy for >>>> whatever reason, this should even out though, no? Each node will end up >>>> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are >>>> batched or single statements. The DS driver is very careful to caution >>>> that the topology map it maintains makes no guarantees on freshness, so you >>>> may see a significant performance penalty in your client when the topology >>>> changes if you're depending on token aware routing as part of your >>>> performance requirements. >>>> >>>> >>>> I'm curious what your thoughts are on grouping statements by primary >>>> replica according to the routing policy, and executing unlogged batches >>>> that way (so that for token aware routing, all statements are executed on a >>>> replica, for others it'd make no difference). Retries are still more >>>> expensive, but token aware proxying avoidance is still had. It's pretty >>>> easy to do in Scala: >>>> >>>> def groupByFirstReplica(statements: Iterable[Statement])(implicit >>>> session: Session): Map[Host, Seq[Statement]] = { >>>> val meta = session.getCluster.getMetadata >>>> statements.groupBy { st => >>>> meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator(). >>>> next >>>> } >>>> } >>>> val result = >>>> Future.traverse(groupByFirstReplica(statements).values).map(st >>>> => newBatch(st).executeAsync()) >>>> >>>> >>>> Let me get together my test code, it depends on some existing utilities >>>> we use elsewhere, such as implicit conversions between Google and Scala >>>> native futures. I'll try to put this together in a format that's runnable >>>> for you in a Scala REPL console without having to resolve our internal >>>> dependencies. This may not be today though. >>>> >>>> Also, @Ryan, I don't think that shuffling would make a difference for >>>> my above tests since as Jon observed, all my nodes were already replicas >>>> there. >>>> >>>> >>>> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rsvi...@datastax.com> >>>> wrote: >>>> >>>>> Also..what happens when you turn on shuffle with token aware? >>>>> http://www.datastax.com/drivers/java/2.1/com/datastax/ >>>>> driver/core/policies/TokenAwarePolicy.html >>>>> >>>>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <j...@jonhaddad.com> >>>>> wrote: >>>>>> >>>>>> To add to Ryan's (extremely valid!) point, your test works because >>>>>> the coordinator is always a replica. Try again using 20 (or 50) nodes. >>>>>> Batching works great at RF=N=3 because it always gets to write to local >>>>>> and >>>>>> talk to exactly 2 other servers on every request. Consider what happens >>>>>> when the coordinator needs to talk to 100 servers. It's unnecessary >>>>>> overhead on the server side. >>>>>> >>>>>> To save network overhead, Cassandra 2.1 added support for response >>>>>> grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now- >>>>>> over-50-faster) which massively helps performance. It provides the >>>>>> benefit of batches but without the coordinator overhead. >>>>>> >>>>>> Can you post your benchmark code? >>>>>> >>>>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <j...@jonhaddad.com> >>>>>> wrote: >>>>>> >>>>>>> There are cases where it can. For instance, if you batch multiple >>>>>>> mutations to the same partition (and talk to a replica for that >>>>>>> partition) >>>>>>> they can reduce network overhead because they're effectively a single >>>>>>> mutation in the eye of the cluster. However, if you're not doing that >>>>>>> (and >>>>>>> most people aren't!) you end up putting additional pressure on the >>>>>>> coordinator because now it has to talk to several other servers. If you >>>>>>> have 100 servers, and perform a mutation on 100 partitions, you could >>>>>>> have >>>>>>> a coordinator that's >>>>>>> >>>>>>> 1) talking to every machine in the cluster and >>>>>>> b) waiting on a response from a significant portion of them >>>>>>> >>>>>>> before it can respond success or fail. Any delay, from GC to a bad >>>>>>> disk, can affect the performance of the entire batch. >>>>>>> >>>>>>> >>>>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky < >>>>>>> j...@basetechnology.com> wrote: >>>>>>> >>>>>>>> Jonathan and Ryan, >>>>>>>> >>>>>>>> Jonathan says “It is absolutely not going to help you if you're >>>>>>>> trying to lump queries together to reduce network & server overhead - >>>>>>>> in >>>>>>>> fact it'll do the opposite”, but I would note that the CQL3 spec says “ >>>>>>>> The BATCH statement ... serves several purposes: 1. It saves >>>>>>>> network round-trips between the client and the server (and sometimes >>>>>>>> between the server coordinator and the replicas) when batching multiple >>>>>>>> updates.” Is the spec inaccurate? I mean, it seems in conflict with >>>>>>>> your >>>>>>>> statement. >>>>>>>> >>>>>>>> See: >>>>>>>> https://cassandra.apache.org/doc/cql3/CQL.html >>>>>>>> >>>>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a >>>>>>>> change to make it accurate. >>>>>>>> >>>>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements >>>>>>>> can save network exchanges between the client/server and server >>>>>>>> coordinator/replicas. However, because of the distributed nature of >>>>>>>> Cassandra, spread requests across nearby nodes as much as possible to >>>>>>>> optimize performance. Using batches to optimize performance is usually >>>>>>>> not >>>>>>>> successful, as described in Using and misusing batches section. For >>>>>>>> information about the fastest way to load data, see "Cassandra: Batch >>>>>>>> loading without the Batch keyword."” >>>>>>>> >>>>>>>> Maybe what we really need is a “client/driver-side batch”, which is >>>>>>>> simply a way to collect “batches” of operations in the client/driver >>>>>>>> and >>>>>>>> then let the driver determine what degree of batching and asynchronous >>>>>>>> operation is appropriate. >>>>>>>> >>>>>>>> It might also be nice to have an inquiry for the cluster as to what >>>>>>>> batch size is most optimal for the cluster, like number of mutations >>>>>>>> in a >>>>>>>> batch and number of simultaneous connections, and to have that be >>>>>>>> dynamic >>>>>>>> based on overall cluster load. >>>>>>>> >>>>>>>> I would also note that the example in the spec has multiple inserts >>>>>>>> with different partition key values, which flies in the face of the >>>>>>>> admonition to to refrain from using server-side distribution of >>>>>>>> requests. >>>>>>>> >>>>>>>> At a minimum the CQL spec should make a more clear statement of >>>>>>>> intent and non-intent for BATCH. >>>>>>>> >>>>>>>> -- Jack Krupansky >>>>>>>> >>>>>>>> *From:* Jonathan Haddad <j...@jonhaddad.com> >>>>>>>> *Sent:* Friday, December 12, 2014 12:58 PM >>>>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla >>>>>>>> <rsvi...@datastax.com> >>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb >>>>>>>> >>>>>>>> The really important thing to really take away from Ryan's original >>>>>>>> post is that batches are not there for performance. The only case I >>>>>>>> consider batches to be useful for is when you absolutely need to know >>>>>>>> that >>>>>>>> several tables all get a mutation (via logged batches). The use case >>>>>>>> for >>>>>>>> this is when you've got multiple tables that are serving as different >>>>>>>> views >>>>>>>> for data. It is absolutely not going to help you if you're trying to >>>>>>>> lump >>>>>>>> queries together to reduce network & server overhead - in fact it'll >>>>>>>> do the >>>>>>>> opposite. If you're trying to do that, instead perform many async >>>>>>>> queries. The overhead of batches in cassandra is significant and >>>>>>>> you're >>>>>>>> going to hit a lot of problems if you use them excessively (timeouts / >>>>>>>> failures). >>>>>>>> >>>>>>>> tl;dr: you probably don't want batch, you most likely want many >>>>>>>> async calls >>>>>>>> >>>>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller < >>>>>>>> moham...@glassbeam.com> wrote: >>>>>>>> >>>>>>>>> Ryan, >>>>>>>>> >>>>>>>>> Thanks for the quick response. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I did see that jira before posting my question on this list. >>>>>>>>> However, I didn’t see any information about why 5kb+ data will cause >>>>>>>>> instability. 5kb or even 50kb seems too small. For example, if each >>>>>>>>> mutation is 1000+ bytes, then with just 5 mutations, you will hit that >>>>>>>>> threshold. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> In addition, Patrick is saying that he does not recommend more >>>>>>>>> than 100 mutations per batch. So why not warn users just on the # of >>>>>>>>> mutations in a batch? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Mohammed >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *From:* Ryan Svihla [mailto:rsvi...@datastax.com] >>>>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM >>>>>>>>> *To:* user@cassandra.apache.org >>>>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Nothing magic, just put in there based on experience. You can find >>>>>>>>> the story behind the original recommendation here >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Key reasoning for the desire comes from Patrick McFadden: >>>>>>>>> >>>>>>>>> >>>>>>>>> "Yes that was in bytes. Just in my own experience, I don't >>>>>>>>> recommend more than ~100 mutations per batch. Doing some quick math I >>>>>>>>> came >>>>>>>>> up with 5k as 100 x 50 byte mutations. >>>>>>>>> >>>>>>>>> Totally up for debate." >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> It's totally changeable, however, it's there in no small part >>>>>>>>> because so many people confuse the BATCH keyword as a performance >>>>>>>>> optimization, this helps flag those cases of misuse. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller < >>>>>>>>> moham...@glassbeam.com> wrote: >>>>>>>>> >>>>>>>>> Hi – >>>>>>>>> >>>>>>>>> The cassandra.yaml file has property called >>>>>>>>> *batch_size_warn_threshold_in_kb. >>>>>>>>> * >>>>>>>>> >>>>>>>>> The default size is 5kb and according to the comments in the yaml >>>>>>>>> file, it is used to log WARN on any batch size exceeding this value in >>>>>>>>> kilobytes. It says caution should be taken on increasing the size of >>>>>>>>> this >>>>>>>>> threshold as it can lead to node instability. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Does anybody know the significance of this magic number 5kb? Why >>>>>>>>> would a higher number (say 10kb) lead to node instability? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Mohammed >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/> >>>>>>>>> >>>>>>>>> Ryan Svihla >>>>>>>>> >>>>>>>>> Solution Architect >>>>>>>>> >>>>>>>>> >>>>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image: >>>>>>>>> linkedin.png] >>>>>>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> DataStax is the fastest, most scalable distributed database >>>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative >>>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably >>>>>>>>> scalable to any size. With more than 500 customers in 45 countries, >>>>>>>>> DataStax >>>>>>>>> is the database technology and transactional backbone of choice for >>>>>>>>> the >>>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and >>>>>>>>> eBay. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> >>>>> -- >>>>> >>>>> [image: datastax_logo.png] <http://www.datastax.com/> >>>>> >>>>> Ryan Svihla >>>>> >>>>> Solution Architect >>>>> >>>>> [image: twitter.png] <https://twitter.com/foundev> [image: >>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/> >>>>> >>>>> DataStax is the fastest, most scalable distributed database >>>>> technology, delivering Apache Cassandra to the world’s most innovative >>>>> enterprises. Datastax is built to be agile, always-on, and predictably >>>>> scalable to any size. With more than 500 customers in 45 countries, >>>>> DataStax >>>>> is the database technology and transactional backbone of choice for the >>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. >>>>> >>>>> >>>> >>