[ https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148280#comment-13148280 ]
Chris Goffinet commented on CASSANDRA-3483: ------------------------------------------- Some discussion from irc: {noformat} 23:43 < goffinet> has datastax ever had a customer add a new datacenter to an existing cluster? No docs or info on web suggest anyone has done this before 23:44 < driftx> yeah 23:44 < goffinet> how is it done? we are running a case where if i modify strategy options before adding nodes, writes will fail since no endpoints for DC have been added 23:44 < goffinet> we were expecting this might work because we want to bootstrap the new DC to the existing cluster 23:44 < goffinet> take on writes + stream data with RF factor 23:45 < driftx> general best practice is (jbellis can correct if I'm outdated) add the dc at rf:0, add the nodes/update snitch, repair 23:45 < driftx> err, update rf, repair 23:46 < goffinet> yeah mind if i open up a jira? that seems extreme to make the cluster do that .. ? 23:46 < goffinet> or is repair smart enough to just stream ranges instead of AES? 23:46 < driftx> 'instead of AES?' that's what repair is, but if just streams ranges 23:46 < driftx> s/if/it/ 23:47 < goffinet> right but AES builds merkle tree, scans through all data ? 23:47 < goffinet> isn't bootstrap a different operation? 23:47 < goffinet> when streaming just sstables 23:47 < driftx> yeah, it is 23:47 < goffinet> yeah thats more heavy. dont understand why we couldnt use that instead 23:47 < goffinet> like bootstrap 23:48 < stuhood> now that i think about it, it doesn't really make sense that a CL.ONE write fails if a DC isn't available 23:48 < stuhood> independent of the bootstrap case, that sounds like the real issue 23:49 < stuhood> goffinet: ^ 23:50 < driftx> hmm, yeah that doesn't 23:50 < driftx> but the problem with bootstrapping a dc is the first node you bootstrap gets everything 23:50 < goffinet> stuhood: yeah. it was complaining about not enough endpoints 23:50 < goffinet> driftx: why is that? if you are doubling the cluster, and assign the tokens manually ? 23:51 < driftx> still have to do them 2 mins apart, and they're probably going to be part of the same replica set which I think is troublesome too 23:51 < goffinet> driftx: maybe we can make repair a bit more intelligent? if no data exists on the node .. just stream the ranges instead of using AES 23:52 < driftx> problem is we're pushing AES to do the entire replica set (which is nearly does now) 23:52 < stuhood> goffinet: it shouldn't be as heavyweight as you're thinking 23:53 < goffinet> stuhood: but we have a way currently that is less heavy 23:53 < goffinet> i dont understand why we couldnt use that method 23:53 < stuhood> not implemented =) 23:53 < goffinet> don't cut corners :) 23:53 < stuhood> human time vs cpu time =P 23:54 < driftx> you could almost do something like #3452 and then have a jmx call to say 'ok, finish' 23:54 < CassBotJr> https://issues.apache.org/jira/browse/CASSANDRA-3452 : Create an 'infinite bootstrap' mode for sampling live traffic 23:54 < driftx> except the first one that tries is going to have every node pound it with all the writes 23:54 < goffinet> driftx: ill make a jira ticket so we can discuss there, it doesn't seem like it would be too much trouble to support this use case 23:54 < goffinet> we'd be happy to write the patch after some input 23:55 < driftx> trickier than it sounds I'll bet, but sgtm 23:57 < stuhood> alternatively, is now the right time to add back group bootstrap? 23:58 < stuhood> so you'd 1) add the dc to the strategy, 2) do a group bootstrap of the entire dc 23:58 < stuhood> would also have to fix the CL.ONE problem though. 23:59 < goffinet> how did group bootstrap work again? 23:59 < driftx> #2434 is relevant 23:59 < CassBotJr> https://issues.apache.org/jira/browse/CASSANDRA-2434 : range movements can violate consistency --- Day changed Fri Nov 11 2011 00:00 < stuhood> goffinet: bootstrapping many nodes at once without the 2 minute wait 00:01 < goffinet> why was it removed? 00:01 < stuhood> used zookeeper 00:01 < goffinet> oh. 00:01 < stuhood> but come to think of it, removing the 2 minute wait would seem to be relatively easy 00:02 < goffinet> stuhood, i thought the 2 minute wait was just waiting for ring state to settle? 00:02 < goffinet> before it streamed from nodes 00:02 < stuhood> goffinet: yea: you could form a "group" bootstrap by inverting things and waiting until you -hadn't- seen a new node in 2-10 minutes before you chose a token and started bootstrapping 00:03 < stuhood> so, not terribly simple, but. 00:04 < stuhood> you'd basically have a bunch of nodes sitting around waiting until no new nodes started, and then they have to deterministically choose tokens. 00:05 < goffinet> yes 00:05 < stuhood> well, alternatively, you wouldn't need a new way to deterministically choose tokens 00:05 < stuhood> (easier) 00:05 < stuhood> no… scratch that. you would need a way 00:05 < stuhood> for this DC case, all of the nodes are entering an empty ring 00:06 < stuhood> so the group would need to choose something balanced 00:06 < goffinet> empty ring? 00:06 < stuhood> yea, essentially… there are no tokens in that dc 00:06 < goffinet> but we were going to provide the tokens manually? 00:06 < goffinet> were you thinking of making it automatic? 00:07 < stuhood> yea. fixing bootstrapping groups of nodes would make automatic safe again 00:08 < stuhood> so… whatever state a node is in when it is sitting and waiting for enough information to choose a token, it should just stay that way and watch what other nodes enter that state 00:08 < goffinet> so i have a question about the 120 second window you have to wait.. 00:09 < stuhood> mm 00:09 < driftx> hmm, what if they started up at rf:0 but stayed in some dead state (hibernate might work) without doing anything until you changed the rf, then actually bootstrapped? 00:09 < goffinet> so imagine i startup all the nodes in DC2 at same time, does join_ring=false not grab gossip info at all? I was thinking it would be good if we could just start gossip on all nodes, but until operator says 'go' then i could bootstrap them all at same time 00:09 < goffinet> since i would only have to wait at most 120 seconds before kicking them all off 00:10 < stuhood> driftx: yea, that could work too… but you'd still need to choose tokens. (also, the rf=0 thing shouldn't be necessary, right? that's the CL.ONE bug) 00:11 < driftx> well, you really want to choose tokens anyway 00:11 < stuhood> goffinet: it does get gossip… i think that's basically equivalent to the pre-join state 00:11 < driftx> I guess you don't need rf=0 if all the nodes are in hibernate 00:12 < goffinet> yeah i think you do need hibernate in this case, because if i set tokens upfront, i want all nodes to know about ATL ones too 00:12 < goffinet> before i kick off bootstrap 00:12 < stuhood> driftx: i'm confused… what is the difference between rf=0 and not being there? 00:12 < stuhood> is that a workaround for the CL.ONE bug? 00:13 < driftx> you know there's a dc with rf:0, can add one with impacting anything 00:13 < driftx> err, without 00:14 ?? boaz__ (0819c319@gateway/web/freenode/ip.8.25.195.25) has joined #cassandra-dev 00:14 < stuhood> so what was the point of adding it? that's why i'm confused... 00:14 < goffinet> im fine with rf:0, its so you can add the nodes to the cluster before calling repair 00:14 < goffinet> before you add nodes 00:15 < driftx> because the dc is in the schema 00:15 < driftx> so you need it there to have nodes be in it 00:15 < stuhood> ah 00:16 < goffinet> driftx: any reason why we couldnt just fix that? so dc2:3 wont throw an error if nodes are down? 00:16 < goffinet> that way you would needed to do two steps 00:16 < goffinet> dc2:0, add nodes, dc2:3 00:16 < goffinet> wouldn't* 00:16 < driftx> I don't understand, you can already do that 00:17 < driftx> you just have to repair afterwards 00:17 < goffinet> it throws an error currently? if you set dc2:3 and no nodes exist for dc2 00:17 < goffinet> we'll double check on that 00:18 < goffinet> for writes 00:18 < driftx> oh, it does 00:19 < driftx> but only for writes 00:19 < goffinet> yeah 00:19 < goffinet> so thats fine, thats fixable 00:19 < goffinet> im just curious about a) how can we bootstrap nodes without 120s delays between N nodes b) stream from DC1 without AES 00:21 < stuhood> goffinet: if you figure out a, i don't think b is necessary? 00:22 < stuhood> assuming they are aware of the other joining nodes, and can all join the same range 00:22 < stuhood> that would be the keystone for some kind of group bootstrap 00:23 < goffinet> let me test out join_ring, because im curious. if join_ring=false still gossips but doesnt offically join.. it would be nice if node 2 in DC2 knew about that node too somehow? 00:23 < driftx> that's why I proposed cheating, add them all as non-members, then ask them to bootstrap 00:23 < goffinet> because then .. i could just run a command on each node at same time 00:23 < goffinet> since they all know about each other in a hibernate state 00:23 < goffinet> driftx: yes i like that 00:24 < driftx> private void joinTokenRing(int delay) throws IOException, org.apache.cassandra.config.ConfigurationException 00:24 < driftx> { 00:24 < driftx> logger_.info("Starting up server gossip"); 00:24 < driftx> they don't use gossip with join_ring off 00:24 < stuhood> but will that actually allow them to all join the same range? 00:24 < goffinet> okay cool, yeah we would need to make it join in that special state then 00:25 < stuhood> i think there is an edgecase here… if multiple nodes are joining the same range, and one of them fails, then should they all fail? 00:25 < driftx> no, it basically saves you server startup time that is not ring-related :) 00:25 < goffinet> stuhood, they all know the tokens ahead of time? 00:25 < goffinet> they just need to know the current global state of things 00:25 < stuhood> goffinet: right, but if they are streaming the range that they will be responsible for... 00:26 ?? mw1 (~Adium@8.25.195.29) has quit (Quit: Leaving.) 00:26 < stuhood> Joining nodes don't stick around if they fail 00:26 < goffinet> they shouldnt be allowed to do that until they joined ? 00:26 < stuhood> nah, you stream while you are joining… unless you are talking about repair 00:26 < goffinet> stuhood: was that removed? i thought u had to still remove the node 00:26 < goffinet> using the new options in 1.0 00:26 < stuhood> don't know about 1.0 00:27 < driftx> no, a failed non-member is just a fat client and disappears 00:27 < goffinet> but i thought there was a timeout for fat client ? 00:27 < goffinet> is it 30s or something? 00:27 < driftx> yes 00:28 < goffinet> so nodes that arent fat clients, why might we remove them ? if we didnt.. 00:28 < goffinet> and let the operator do it 00:28 < goffinet> or have a larger timeout 00:28 < goffinet> might make this a non-issue 00:28 < driftx> what does a larger timeout/keeping them around buy you? 00:29 < goffinet> because if they go away, and i bootstrap after they failed, wont my view of ring be skewed? 00:29 < stuhood> driftx: i guess in this case, the node would resume bootstrapping from where it left off 00:29 < driftx> it would've missed writes in the meantime and require a repair afterwards anyway 00:29 < stuhood> sorry… "resume" in the sense of "start over", but yea 00:31 < stuhood> that would be a pretty big change, but it might make sense 00:31 < goffinet> stuhood: what would you change 00:31 < stuhood> what you said, about nodes in joining staying in joining 00:31 < stuhood> so if the machine restarts, it begins joining at the same position again 00:33 < goffinet> if we supported that + letting nodes gossip in hibernate, would allow us to add capacity at operator control {noformat} > Support bringing up a new datacenter to existing cluster without repair > ----------------------------------------------------------------------- > > Key: CASSANDRA-3483 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3483 > Project: Cassandra > Issue Type: Bug > Affects Versions: 1.0.2 > Reporter: Chris Goffinet > > Was talking to Brandon in irc, and we ran into a case where we want to bring > up a new DC to an existing cluster. He suggested from jbellis the way to do > it currently was set strategy options of dc2:0, then add the nodes. After the > nodes are up, change the RF of dc2, and run repair. > I'd like to avoid a repair as it runs AES and is a bit more intense than how > bootstrap works currently by just streaming ranges from the SSTables. Would > it be possible to improve this functionality (adding a new DC to existing > cluster) than the proposed method? We'd be happy to do a patch if we got some > input on the best way to go about it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira