Adar Dembo has posted comments on this change.

Change subject: docs: design for handling permanent master failures
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3393/3/docs/design-docs/master-perm-failure-1.0.md
File docs/design-docs/master-perm-failure-1.0.md:

Line 52: 3. Copy the master's entire data/WAL directory from **X** to **Y**.
> hrm, this is odd -- I thought in step 2, X died. how are we going to copy i
To be honest I didn't delve into the various ways in which this condition (that 
X is "dead" but the data is salvageable) could be satisfied. Here are some 
possibilities:
1. X is super old and we'd like to decommission it. It'll be considered "dead" 
after the copy.
2. X has a bad DIMM that causes faulst rarely. Maybe we'll rip out the bad 
DIMM, boot, do the copy, then decommission it.
3. Some other piece of X's hardware is gone, in which case yes, we may move the 
disk.

Do you think these are too contrived? Should I just rewrite this to dispel any 
notion that today's Kudu can recover from some kinds of permanent failure?


Line 114: 2. Find new master machines, creating DNS cnames for all of them. 
Create a DNS
> how will this work in the context of a management tool like CM? wouldn't th
I haven't given much thought to CM since it's out of scope for the Kudu 
_project_, but yeah, we may need that. Is there a similar concept in HDFS?


Line 136: 2. Implement new command line tool to rewrite cmeta files.
> can we combine these two? something that leads you through the process?
I'd rather have both: a command line tool that can perform each (specific) task 
on its own, and a script that ties them together.

Now that I've implemented this, though, it's proving difficult to combine since 
different pieces of work happen on different machines:
1. On each new master, run new "format" command to create FS.
2. On each new master, run kudu-fs_dump "list_uuid" to get the FS's UUID.
3. On the old master, run new "cmeta rewrite" command with the new UUIDs, 
hostports, and existing UUID/hostport.
4. On each new master, run new "tablet copy" command to fetch the master tablet.

I guess it can be done with a shell script that uses ssh to get to each 
machine. That won't work in every environment, though.


-- 
To view, visit http://gerrit.cloudera.org:8080/3393
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f05c319c89cf37e2d71fdc4b7ec951b2932a2b2
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to