[jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair

Chris Goffinet (Commented) (JIRA) Thu, 10 Nov 2011 21:45:21 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148280#comment-13148280
 ]


Chris Goffinet commented on CASSANDRA-3483:
-------------------------------------------

Some discussion from irc:

{noformat}
23:43 < goffinet> has datastax ever had a customer add a new datacenter to an 
existing cluster? No docs or info on web suggest anyone has done this before
23:44 < driftx> yeah
23:44 < goffinet> how is it done? we are running a case where if i modify 
strategy options before adding nodes, writes will fail since no endpoints for 
DC have been added
23:44 < goffinet> we were expecting this might work because we want to 
bootstrap the new DC to the existing cluster
23:44 < goffinet> take on writes + stream data with RF factor
23:45 < driftx> general best practice is (jbellis can correct if I'm outdated) 
add the dc at rf:0, add the nodes/update snitch, repair
23:45 < driftx> err, update rf, repair
23:46 < goffinet> yeah mind if i open up a jira? that seems extreme to make the 
cluster do that .. ?
23:46 < goffinet> or is repair smart enough to just stream ranges instead of 
AES?
23:46 < driftx> 'instead of AES?' that's what repair is, but if just streams 
ranges
23:46 < driftx> s/if/it/
23:47 < goffinet> right but AES builds merkle tree, scans through all data ?
23:47 < goffinet> isn't bootstrap a different operation?
23:47 < goffinet> when streaming just sstables
23:47 < driftx> yeah, it is
23:47 < goffinet> yeah thats more heavy. dont understand why we couldnt use 
that instead
23:47 < goffinet> like bootstrap
23:48 < stuhood> now that i think about it, it doesn't really make sense that a 
CL.ONE write fails if a DC isn't available
23:48 < stuhood> independent of the bootstrap case, that sounds like the real 
issue
23:49 < stuhood> goffinet: ^
23:50 < driftx> hmm, yeah that doesn't
23:50 < driftx> but the problem with bootstrapping a dc is the first node you 
bootstrap gets everything
23:50 < goffinet> stuhood: yeah. it was complaining about not enough endpoints 
23:50 < goffinet> driftx: why is that? if you are doubling the cluster, and 
assign the tokens manually ?
23:51 < driftx> still have to do them 2 mins apart, and they're probably going 
to be part of the same replica set which I think is troublesome too
23:51 < goffinet> driftx: maybe we can make repair a bit more intelligent? if 
no data exists on the node .. just stream the ranges instead of using AES
23:52 < driftx> problem is we're pushing AES to do the entire replica set 
(which is nearly does now)
23:52 < stuhood> goffinet: it shouldn't be as heavyweight as you're thinking
23:53 < goffinet> stuhood: but we have a way currently that is less heavy
23:53 < goffinet> i dont understand why we couldnt use that method
23:53 < stuhood> not implemented =)
23:53 < goffinet> don't cut corners :)
23:53 < stuhood> human time vs cpu time =P
23:54 < driftx> you could almost do something like #3452 and then have a jmx 
call to say 'ok, finish'
23:54 < CassBotJr> https://issues.apache.org/jira/browse/CASSANDRA-3452 : 
Create an 'infinite bootstrap' mode for sampling live traffic
23:54 < driftx> except the first one that tries is going to have every node 
pound it with all the writes
23:54 < goffinet> driftx: ill make a jira ticket so we can discuss there, it 
doesn't seem like it would be too much trouble to support this use case
23:54 < goffinet> we'd be happy to write the patch after some input
23:55 < driftx> trickier than it sounds I'll bet, but sgtm
23:57 < stuhood> alternatively, is now the right time to add back group 
bootstrap?
23:58 < stuhood> so you'd 1) add the dc to the strategy, 2) do a group 
bootstrap of the entire dc
23:58 < stuhood> would also have to fix the CL.ONE problem though.
23:59 < goffinet> how did group bootstrap work again?
23:59 < driftx> #2434 is relevant
23:59 < CassBotJr> https://issues.apache.org/jira/browse/CASSANDRA-2434 : range 
movements can violate consistency
--- Day changed Fri Nov 11 2011
00:00 < stuhood> goffinet: bootstrapping many nodes at once without the 2 
minute wait
00:01 < goffinet> why was it removed?
00:01 < stuhood> used zookeeper
00:01 < goffinet> oh.
00:01 < stuhood> but come to think of it, removing the 2 minute wait would seem 
to be relatively easy
00:02 < goffinet> stuhood, i thought the 2 minute wait was just waiting for 
ring state to settle?
00:02 < goffinet> before it streamed from nodes
00:02 < stuhood> goffinet: yea: you could form a "group" bootstrap by inverting 
things and waiting until you -hadn't- seen a new node in 2-10 minutes before 
you chose a token and started bootstrapping
00:03 < stuhood> so, not terribly simple, but.
00:04 < stuhood> you'd basically have a bunch of nodes sitting around waiting 
until no new nodes started, and then they have to deterministically choose 
tokens.
00:05 < goffinet> yes
00:05 < stuhood> well, alternatively, you wouldn't need a new way to 
deterministically choose tokens
00:05 < stuhood> (easier)
00:05 < stuhood> no… scratch that. you would need a way
00:05 < stuhood> for this DC case, all of the nodes are entering an empty ring
00:06 < stuhood> so the group would need to choose something balanced
00:06 < goffinet> empty ring?
00:06 < stuhood> yea, essentially… there are no tokens in that dc
00:06 < goffinet> but we were going to provide the tokens manually?
00:06 < goffinet> were you thinking of making it automatic?
00:07 < stuhood> yea. fixing bootstrapping groups of nodes would make automatic 
safe again
00:08 < stuhood> so… whatever state a node is in when it is sitting and waiting 
for enough information to choose a token, it should just stay that way and 
watch what other nodes enter that state
00:08 < goffinet> so i have a question about the 120 second window you have to 
wait..
00:09 < stuhood> mm
00:09 < driftx> hmm, what if they started up at rf:0 but stayed in some dead 
state (hibernate might work) without doing anything until you changed the rf, 
then actually bootstrapped?
00:09 < goffinet> so imagine i startup all the nodes in DC2 at same time, does 
join_ring=false not grab gossip info at all? I was thinking it would be good if 
we could just start gossip on all nodes, but until operator says 'go' then i 
could bootstrap them all at same time
00:09 < goffinet> since i would only have to wait at most 120 seconds before 
kicking them all off
00:10 < stuhood> driftx: yea, that could work too… but you'd still need to 
choose tokens. (also, the rf=0 thing shouldn't be necessary, right? that's the 
CL.ONE bug)
00:11 < driftx> well, you really want to choose tokens anyway
00:11 < stuhood> goffinet: it does get gossip… i think that's basically 
equivalent to the pre-join state
00:11 < driftx> I guess you don't need rf=0 if all the nodes are in hibernate
00:12 < goffinet> yeah i think you do need hibernate in this case, because if i 
set tokens upfront, i want all nodes to know about ATL ones too
00:12 < goffinet> before i kick off bootstrap
00:12 < stuhood> driftx: i'm confused… what is the difference between rf=0 and 
not being there?
00:12 < stuhood> is that a workaround for the CL.ONE bug?
00:13 < driftx> you know there's a dc with rf:0, can add one with impacting 
anything
00:13 < driftx> err, without
00:14 ?? boaz__ (0819c319@gateway/web/freenode/ip.8.25.195.25) has joined 
#cassandra-dev
00:14 < stuhood> so what was the point of adding it? that's why i'm confused...
00:14 < goffinet> im fine with rf:0, its so you can add the nodes to the 
cluster before calling repair
00:14 < goffinet> before you add nodes
00:15 < driftx> because the dc is in the schema
00:15 < driftx> so you need it there to have nodes be in it
00:15 < stuhood> ah
00:16 < goffinet> driftx: any reason why we couldnt just fix that? so dc2:3 
wont throw an error if nodes are down?
00:16 < goffinet> that way you would needed to do two steps
00:16 < goffinet> dc2:0, add nodes, dc2:3
00:16 < goffinet> wouldn't*
00:16 < driftx> I don't understand, you can already do that
00:17 < driftx> you just have to repair afterwards
00:17 < goffinet> it throws an error currently? if you set dc2:3 and no nodes 
exist for dc2
00:17 < goffinet> we'll double check on that
00:18 < goffinet> for writes
00:18 < driftx> oh, it does
00:19 < driftx> but only for writes
00:19 < goffinet> yeah
00:19 < goffinet> so thats fine, thats fixable
00:19 < goffinet> im just curious about a) how can we bootstrap nodes without 
120s delays between N nodes b) stream from DC1 without AES
00:21 < stuhood> goffinet: if you figure out a, i don't think b is necessary?
00:22 < stuhood> assuming they are aware of the other joining nodes, and can 
all join the same range
00:22 < stuhood> that would be the keystone for some kind of group bootstrap
00:23 < goffinet> let me test out join_ring, because im curious. if 
join_ring=false still gossips but doesnt offically join.. it would be nice if 
node 2 in DC2 knew about that node too somehow?
00:23 < driftx> that's why I proposed cheating, add them all as non-members, 
then ask them to bootstrap
00:23 < goffinet> because then .. i could just run a command on each node at 
same time
00:23 < goffinet> since they all know about each other in a hibernate state
00:23 < goffinet> driftx: yes i like that
00:24 < driftx>     private void joinTokenRing(int delay) throws IOException, 
org.apache.cassandra.config.ConfigurationException
00:24 < driftx>     {
00:24 < driftx>         logger_.info("Starting up server gossip");
00:24 < driftx> they don't use gossip with join_ring off
00:24 < stuhood> but will that actually allow them to all join the same range?
00:24 < goffinet> okay cool, yeah we would need to make it join in that special 
state then
00:25 < stuhood> i think there is an edgecase here… if multiple nodes are 
joining the same range, and one of them fails, then should they all fail?
00:25 < driftx> no, it basically saves you server startup time that is not 
ring-related :)
00:25 < goffinet> stuhood, they all know the tokens ahead of time?
00:25 < goffinet> they just need to know the current global state of things
00:25 < stuhood> goffinet: right, but if they are streaming the range that they 
will be responsible for...
00:26 ?? mw1 (~Adium@8.25.195.29) has quit (Quit: Leaving.)
00:26 < stuhood> Joining nodes don't stick around if they fail
00:26 < goffinet> they shouldnt be allowed to do that until they joined ?
00:26 < stuhood> nah, you stream while you are joining… unless you are talking 
about repair
00:26 < goffinet> stuhood: was that removed? i thought u had to still remove 
the node
00:26 < goffinet> using the new options in 1.0
00:26 < stuhood> don't know about 1.0
00:27 < driftx> no, a failed non-member is just a fat client and disappears
00:27 < goffinet> but i thought there was a timeout for fat client ?
00:27 < goffinet> is it 30s or something?
00:27 < driftx> yes
00:28 < goffinet> so nodes that arent fat clients, why might we remove them ? 
if we didnt..
00:28 < goffinet> and let the operator do it
00:28 < goffinet> or have a larger timeout
00:28 < goffinet> might make this a non-issue
00:28 < driftx> what does a larger timeout/keeping them around buy you?
00:29 < goffinet> because if they go away, and i bootstrap after they failed, 
wont my view of ring be skewed?
00:29 < stuhood> driftx: i guess in this case, the node would resume 
bootstrapping from where it left off
00:29 < driftx> it would've missed writes in the meantime and require a repair 
afterwards anyway
00:29 < stuhood> sorry… "resume" in the sense of "start over", but yea
00:31 < stuhood> that would be a pretty big change, but it might make sense
00:31 < goffinet> stuhood: what would you change
00:31 < stuhood> what you said, about nodes in joining staying in joining
00:31 < stuhood> so if the machine restarts, it begins joining at the same 
position again
00:33 < goffinet> if we supported that + letting nodes gossip in hibernate, 
would allow us to add capacity at operator control
{noformat}
                
> Support bringing up a new datacenter to existing cluster without repair
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-3483
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3483
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.0.2
>            Reporter: Chris Goffinet
>
> Was talking to Brandon in irc, and we ran into a case where we want to bring 
> up a new DC to an existing cluster. He suggested from jbellis the way to do 
> it currently was set strategy options of dc2:0, then add the nodes. After the 
> nodes are up, change the RF of dc2, and run repair. 
> I'd like to avoid a repair as it runs AES and is a bit more intense than how 
> bootstrap works currently by just streaming ranges from the SSTables. Would 
> it be possible to improve this functionality (adding a new DC to existing 
> cluster) than the proposed method? We'd be happy to do a patch if we got some 
> input on the best way to go about it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair

Reply via email to