[kudu-CR] [docs] Document how to recover from a majority failed tablet
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. [docs] Document how to recover from a majority failed tablet This adds some docs on how to recover when a tablet can no longer find a majority due to the permanent failure of replicas. I tested this procedure by failing tablets in various ways: - deleting important bits like cmeta or tablet metadata - deleting entire data dirs - tombstoning 2/3 replicas (and disabling tombstoned voting) and I was always able to recover using these instructions. Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Reviewed-on: http://gerrit.cloudera.org:8080/8402 Reviewed-by: Mike PercyTested-by: Will Berkeley --- M docs/administration.adoc 1 file changed, 65 insertions(+), 0 deletions(-) Approvals: Mike Percy: Looks good to me, approved Will Berkeley: Verified -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 10 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alex Rodoni Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 9: Verified+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/8402/8/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/8/docs/administration.adoc@814 PS8, Line 814: . Only attem > s/ is possible// Done -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 9 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Alex Rodoni Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 13 Feb 2018 21:07:42 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 9: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 9 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Alex Rodoni Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 13 Feb 2018 20:41:43 + Gerrit-HasComments: No
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Hello Alex Rodoni, Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8402 to look at the new patch set (#9). Change subject: [docs] Document how to recover from a majority failed tablet .. [docs] Document how to recover from a majority failed tablet This adds some docs on how to recover when a tablet can no longer find a majority due to the permanent failure of replicas. I tested this procedure by failing tablets in various ways: - deleting important bits like cmeta or tablet metadata - deleting entire data dirs - tombstoning 2/3 replicas (and disabling tombstoned voting) and I was always able to recover using these instructions. Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 --- M docs/administration.adoc 1 file changed, 65 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/9 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 9 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Alex Rodoni Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 8: Code-Review+1 (1 comment) lgtm, only a nit http://gerrit.cloudera.org:8080/#/c/8402/8/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/8/docs/administration.adoc@814 PS8, Line 814: is possible s/ is possible// -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 8 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Alex Rodoni Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 13 Feb 2018 01:38:12 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Hello Alex Rodoni, Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8402 to look at the new patch set (#8). Change subject: [docs] Document how to recover from a majority failed tablet .. [docs] Document how to recover from a majority failed tablet This adds some docs on how to recover when a tablet can no longer find a majority due to the permanent failure of replicas. I tested this procedure by failing tablets in various ways: - deleting important bits like cmeta or tablet metadata - deleting entire data dirs - tombstoning 2/3 replicas (and disabling tombstoned voting) and I was always able to recover using these instructions. Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 --- M docs/administration.adoc 1 file changed, 65 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/8 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 8 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Alex Rodoni Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 7: (5 comments) http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@811 PS7, Line 811: majority > is there a quantifiable definition of "majority"? The normal definition of a majority, i.e. "50% plus one". http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@813 PS7, Line 813: and so > potentially resulting in permanent data loss. Done http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@840 PS7, Line 840: , > remove "," Done http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@844 PS7, Line 844: , > Remove , Done http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@844 PS7, Line 844: those > Those Done -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 7 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Alex Rodoni Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Thu, 08 Feb 2018 18:40:39 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Alex Rodoni has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 7: (5 comments) http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@811 PS7, Line 811: majority is there a quantifiable definition of "majority"? http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@813 PS7, Line 813: and so potentially resulting in permanent data loss. http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@840 PS7, Line 840: , remove "," http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@844 PS7, Line 844: , Remove , http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@844 PS7, Line 844: those Those -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 7 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Alex Rodoni Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 06 Feb 2018 23:37:08 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/8402/6/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/6/docs/administration.adoc@841 PS6, Line 841: .I > tiny nit: missing space Done -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 6 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Mon, 05 Feb 2018 17:58:39 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Hello Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8402 to look at the new patch set (#7). Change subject: [docs] Document how to recover from a majority failed tablet .. [docs] Document how to recover from a majority failed tablet This adds some docs on how to recover when a tablet can no longer find a majority due to the permanent failure of replicas. I tested this procedure by failing tablets in various ways: - deleting important bits like cmeta or tablet metadata - deleting entire data dirs - tombstoning 2/3 replicas (and disabling tombstoned voting) and I was always able to recover using these instructions. Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 --- M docs/administration.adoc 1 file changed, 65 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/7 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 7 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/8402/6/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/6/docs/administration.adoc@841 PS6, Line 841: .I tiny nit: missing space -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 6 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Thu, 25 Jan 2018 00:13:14 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Hello Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8402 to look at the new patch set (#6). Change subject: [docs] Document how to recover from a majority failed tablet .. [docs] Document how to recover from a majority failed tablet This adds some docs on how to recover when a tablet can no longer find a majority due to the permanent failure of replicas. I tested this procedure by failing tablets in various ways: - deleting important bits like cmeta or tablet metadata - deleting entire data dirs - tombstoning 2/3 replicas (and disabling tombstoned voting) and I was always able to recover using these instructions. Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 --- M docs/administration.adoc 1 file changed, 65 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/6 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 6 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 5: (4 comments) http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@809 PS5, Line 809: that's > style: I think it's easier to read "that has" Done http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@812 PS5, Line 812: Permanent data loss is : possible > I think this isn't quite clear that permanent data loss is possible _by fol Done http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@839 PS5, Line 839: To revive the tablet > maybe here say something like "to accept the potential data loss and restor Done http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@845 PS5, Line 845: r tserver-00 > nit: use `...` around hostnames Done -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 5 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Thu, 11 Jan 2018 18:53:52 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 5: (4 comments) This doc is nice. I wonder if we could automate the whole thing, though, into something like 'kudu tablet unsafe_promote_minority' or somesuch? (not that we shouldn't commit this in the meantime) http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@809 PS5, Line 809: that's style: I think it's easier to read "that has" http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@812 PS5, Line 812: Permanent data loss is : possible I think this isn't quite clear that permanent data loss is possible _by following these steps_. ie even if you run these steps, you may have lost the most recent edits from the tablet. The way it's written makes it sound "maybe these steps wont work" http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@839 PS5, Line 839: To revive the tablet maybe here say something like "to accept the potential data loss and restore from the remaining replica" http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@845 PS5, Line 845: r tserver-00 nit: use `...` around hostnames -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 5 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Thu, 11 Jan 2018 02:01:45 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@723 PS4, Line 723: + > The procedure works if the leader doesn't survive, but yes the chance of da Ah, that's right, the leader doesn't have to survive. http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@760 PS4, Line 760: > OK, if you're sure about this. I had a couple of situations in my testing w Are you sure? The only scenario I know of where automatic deletion doesn't work is if the tserver somehow changed its UUID. -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 5 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 05 Jan 2018 21:01:31 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 4: Rendering at https://github.com/wdberkeley/kudu/blob/majorityrecoverydocs/docs/administration.adoc#tablet_majority_down_recovery -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 4 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Thu, 04 Jan 2018 18:37:09 + Gerrit-HasComments: No
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Hello Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8402 to look at the new patch set (#5). Change subject: [docs] Document how to recover from a majority failed tablet .. [docs] Document how to recover from a majority failed tablet This adds some docs on how to recover when a tablet can no longer find a majority due to the permanent failure of replicas. I tested this procedure by failing tablets in various ways: - deleting important bits like cmeta or tablet metadata - deleting entire data dirs - tombstoning 2/3 replicas (and disabling tombstoned voting) and I was always able to recover using these instructions. Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 --- M docs/administration.adoc 1 file changed, 59 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/5 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 5 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 4: (7 comments) http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@709 PS4, Line 709: Reviving a tablet that's lost a majority of replicas > how about: Bringing a tablet that's lost a majority of replicas back online Done http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@711 PS4, Line 711: If a tablet has permanently lost a majority of its replicas, it cannot recover > It is critical to emphasize that in a majority-lost scenario, permanent dat Done http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@723 PS4, Line 723: 638a20403e3e4ae3b55d4d07d920e6de (tserver-00:7150): RUNNING [LEADER] > This is kind of a cool scenario but this whole thing only works if the lead The procedure works if the leader doesn't survive, but yes the chance of data loss is much higher then. http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@760 PS4, Line 760: $ kudu remote_replica delete tserver-01:7150 e822cab6c0584bc0858219d1539a17e6 "delete failed replica" > this is not actually required; the master should do it automatically once t OK, if you're sure about this. I had a couple of situations in my testing where I had to do the deletion manually, but they were mock situations that probably should be treated as disk failure if they actually happened, e.g. deleting consensus metadata files. http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@767 PS4, Line 767: [source,bash] : : $ kudu remote_replica unsafe_change_config ... : > I found this confusing. It seems like a command, I was trying to figure out Done http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@775 PS4, Line 775: [source,bash] > If you are going to put this in, at least mark it with a label like "Exampl Removed http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@777 PS4, Line 777: $ kudu remote_replica unsafe_change_config tserver-00:7150 e822cab6c0584bc0858219d1539a17e6 638a20403e3e4ae3b55d4d07d920e6de > Because having a long UUID for tablet_id and UUID for tablet server id can Done -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 4 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Thu, 04 Jan 2018 18:31:26 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 4: (7 comments) http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc File docs/administration.adoc: http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@709 PS4, Line 709: Reviving a tablet that's lost a majority of replicas how about: Bringing a tablet that's lost a majority of replicas back online http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@711 PS4, Line 711: If a tablet has permanently lost a majority of its replicas, it cannot recover It is critical to emphasize that in a majority-lost scenario, permanent data loss is likely, and in fact there is no guarantee that any data can be recovered. It may only be due to luck that they get some or all of their data back after this procedure. We should also emphasize that this procedure should only be performed if it is not possible to bring the majority back online. http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@723 PS4, Line 723: 638a20403e3e4ae3b55d4d07d920e6de (tserver-00:7150): RUNNING [LEADER] This is kind of a cool scenario but this whole thing only works if the leader survives. I think it's worth indicating how to handle this when the leader did not survive as well and a discussion around the implications of that. Actually, if the leader survives, the likelihood of losing data is much lower (although not zero, because it could have been an old, partitioned leader in some nasty cases) http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@760 PS4, Line 760: $ kudu remote_replica delete tserver-01:7150 e822cab6c0584bc0858219d1539a17e6 "delete failed replica" this is not actually required; the master should do it automatically once they get evicted when we do the unsafe config change http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@767 PS4, Line 767: [source,bash] : : $ kudu remote_replica unsafe_change_config ... : I found this confusing. It seems like a command, I was trying to figure out who uuid1 and uuid2 were and why we're changing the config to those two, etc. I think we need to pick one of the "prototype" or the "example" for the same command. I actually think the prototype (this example) is more useful than the one below, except that you indicate a "uuid2" which doesn't apply here. http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@775 PS4, Line 775: [source,bash] If you are going to put this in, at least mark it with a label like "Example:" http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@777 PS4, Line 777: $ kudu remote_replica unsafe_change_config tserver-00:7150 e822cab6c0584bc0858219d1539a17e6 638a20403e3e4ae3b55d4d07d920e6de Because having a long UUID for tablet_id and UUID for tablet server id can be confusing, and these example uuids are never going to actually be what a user would paste in, I think something that is sort of a compromise of what you wrote on line 770 and what is here on line 777 would be ideal: $ kudu remote_replica unsafe_change_config tserver-00:7150 explaining that tserver-000-uuid would be the tablet server UUID of the remaining replica on tserver-00 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 4 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 15 Dec 2017 22:46:36 + Gerrit-HasComments: Yes
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 4: > Mind pushing a rev of this to your personal GitHub so we can read > it rendered? Check out https://github.com/wdberkeley/kudu/blob/showdocs/docs/administration.adoc#tablet_majority_down_recovery -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 4 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 15 Dec 2017 18:45:51 + Gerrit-HasComments: No
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 4: Mind pushing a rev of this to your personal GitHub so we can read it rendered? -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 4 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 15 Dec 2017 01:25:35 + Gerrit-HasComments: No
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Mike Percy has removed a vote on this change. Change subject: [docs] Document how to recover from a majority failed tablet .. Removed Verified-1 by Kudu Jenkins (120) -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteVote Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 4 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8402 ) Change subject: [docs] Document how to recover from a majority failed tablet .. Patch Set 4: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 4 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 14 Dec 2017 19:05:17 + Gerrit-HasComments: No
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8402 to look at the new patch set (#4). Change subject: [docs] Document how to recover from a majority failed tablet .. [docs] Document how to recover from a majority failed tablet This adds some docs on how to recover when a tablet can no longer find a majority due to the permanent failure of replicas. Manual intervention is required, and basically boils down to 1. Tombstone the failed replicas. This deletes their data and allows Kudu to overwrite the failed replicas, if necessary. Failing to do this in certain situations prevents the automatic recovery of the tablet after step 2. 2. Eject the failed replicas from the consensus configuration, so the remaning healthy replicas can elect a leader. From this point, the master orchestrates automatic re-replication of the tablet. I tested this procedure by failing tablets in various ways: - deleting important bits like cmeta or tablet metadata - deleting entire data dirs - tombstoning 2/3 replicas (and disabling tombstoned voting) and I was always able to recover using these instructions. Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 --- M docs/administration.adoc 1 file changed, 80 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/4 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 4 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Kudu Jenkins
[kudu-CR] [docs] Document how to recover from a majority failed tablet
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8402 to look at the new patch set (#3). Change subject: [docs] Document how to recover from a majority failed tablet .. [docs] Document how to recover from a majority failed tablet This adds some docs on how to recover when a tablet can no longer find a majority due to the permanent failure of replicas. Manual intervention is required, and basically boils down to 1. copy the data from a healthy replica to where the revived replicas will be 2. set the consensus configuration of the tablet so it matches the new locations of replicas Step 2 requires downtime even for healthy replicas, since new servers can't be added to consensus configs without either rewriting the on-disk cmeta or having a majority available. It might be worth allowing a tool to bypass this restriction so that healthy tablet servers don't need to be shut down in order to recover tablet on unhealthy ones. I tested this procedure by failing tablets in various ways: - deleting important bits like cmeta or tablet metadata - deleting entire data dirs - tombstoning 2/3 replicas (and disabling tombstoned voting) and I was always able to recover using these instructions. Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 --- M docs/administration.adoc 1 file changed, 104 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/3 -- To view, visit http://gerrit.cloudera.org:8080/8402 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215 Gerrit-Change-Number: 8402 Gerrit-PatchSet: 3 Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Kudu Jenkins