[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-13 Thread Will Berkeley (Code Review)
Will Berkeley has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..

[docs] Document how to recover from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Reviewed-on: http://gerrit.cloudera.org:8080/8402
Reviewed-by: Mike Percy 
Tested-by: Will Berkeley 
---
M docs/administration.adoc
1 file changed, 65 insertions(+), 0 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Will Berkeley: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 10
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-13 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 9: Verified+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8402/8/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/8/docs/administration.adoc@814
PS8, Line 814: . Only attem
> s/ is possible//
Done



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 9
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 13 Feb 2018 21:07:42 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-13 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 9: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 9
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 13 Feb 2018 20:41:43 +
Gerrit-HasComments: No


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-13 Thread Will Berkeley (Code Review)
Hello Alex Rodoni, Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8402

to look at the new patch set (#9).

Change subject: [docs] Document how to recover from a majority failed tablet
..

[docs] Document how to recover from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 65 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/9
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 9
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-12 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 8: Code-Review+1

(1 comment)

lgtm, only a nit

http://gerrit.cloudera.org:8080/#/c/8402/8/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/8/docs/administration.adoc@814
PS8, Line 814:  is possible
s/ is possible//



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 8
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 13 Feb 2018 01:38:12 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-08 Thread Will Berkeley (Code Review)
Hello Alex Rodoni, Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8402

to look at the new patch set (#8).

Change subject: [docs] Document how to recover from a majority failed tablet
..

[docs] Document how to recover from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 65 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/8
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 8
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-08 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@811
PS7, Line 811: majority
> is there a quantifiable definition of "majority"?
The normal definition of a majority, i.e. "50% plus one".


http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@813
PS7, Line 813: and so
> potentially resulting in permanent data loss.
Done


http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@840
PS7, Line 840: ,
> remove ","
Done


http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@844
PS7, Line 844: ,
> Remove ,
Done


http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@844
PS7, Line 844: those
> Those
Done



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 7
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Thu, 08 Feb 2018 18:40:39 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-06 Thread Alex Rodoni (Code Review)
Alex Rodoni has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@811
PS7, Line 811: majority
is there a quantifiable definition of "majority"?


http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@813
PS7, Line 813: and so
potentially resulting in permanent data loss.


http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@840
PS7, Line 840: ,
remove ","


http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@844
PS7, Line 844: ,
Remove ,


http://gerrit.cloudera.org:8080/#/c/8402/7/docs/administration.adoc@844
PS7, Line 844: those
Those



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 7
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 06 Feb 2018 23:37:08 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-05 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8402/6/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/6/docs/administration.adoc@841
PS6, Line 841: .I
> tiny nit: missing space
Done



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 6
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Mon, 05 Feb 2018 17:58:39 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-02-05 Thread Will Berkeley (Code Review)
Hello Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8402

to look at the new patch set (#7).

Change subject: [docs] Document how to recover from a majority failed tablet
..

[docs] Document how to recover from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 65 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/7
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 7
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-01-24 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8402/6/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/6/docs/administration.adoc@841
PS6, Line 841: .I
tiny nit: missing space



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 6
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Thu, 25 Jan 2018 00:13:14 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-01-11 Thread Will Berkeley (Code Review)
Hello Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8402

to look at the new patch set (#6).

Change subject: [docs] Document how to recover from a majority failed tablet
..

[docs] Document how to recover from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 65 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/6
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 6
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-01-11 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@809
PS5, Line 809: that's
> style: I think it's easier to read "that has"
Done


http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@812
PS5, Line 812:  Permanent data loss is
 : possible
> I think this isn't quite clear that permanent data loss is possible _by fol
Done


http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@839
PS5, Line 839: To revive the tablet
> maybe here say something like "to accept the potential data loss and restor
Done


http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@845
PS5, Line 845: r tserver-00
> nit: use `...` around hostnames
Done



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 5
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Thu, 11 Jan 2018 18:53:52 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-01-10 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 5:

(4 comments)

This doc is nice. I wonder if we could automate the whole thing, though, into 
something like 'kudu tablet unsafe_promote_minority' or somesuch? (not that we 
shouldn't commit this in the meantime)

http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@809
PS5, Line 809: that's
style: I think it's easier to read "that has"


http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@812
PS5, Line 812:  Permanent data loss is
 : possible
I think this isn't quite clear that permanent data loss is possible _by 
following these steps_. ie even if you run these steps, you may have lost the 
most recent edits from the tablet. The way it's written makes it sound "maybe 
these steps wont work"


http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@839
PS5, Line 839: To revive the tablet
maybe here say something like "to accept the potential data loss and restore 
from the remaining replica"


http://gerrit.cloudera.org:8080/#/c/8402/5/docs/administration.adoc@845
PS5, Line 845: r tserver-00
nit: use `...` around hostnames



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 5
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Thu, 11 Jan 2018 02:01:45 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-01-05 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@723
PS4, Line 723: +
> The procedure works if the leader doesn't survive, but yes the chance of da
Ah, that's right, the leader doesn't have to survive.


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@760
PS4, Line 760:
> OK, if you're sure about this. I had a couple of situations in my testing w
Are you sure? The only scenario I know of where automatic deletion doesn't work 
is if the tserver somehow changed its UUID.



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 5
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Fri, 05 Jan 2018 21:01:31 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-01-04 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 4:

Rendering at 
https://github.com/wdberkeley/kudu/blob/majorityrecoverydocs/docs/administration.adoc#tablet_majority_down_recovery


--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Thu, 04 Jan 2018 18:37:09 +
Gerrit-HasComments: No


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-01-04 Thread Will Berkeley (Code Review)
Hello Mike Percy, Jean-Daniel Cryans, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8402

to look at the new patch set (#5).

Change subject: [docs] Document how to recover from a majority failed tablet
..

[docs] Document how to recover from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 59 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/5
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 5
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2018-01-04 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@709
PS4, Line 709: Reviving a tablet that's lost a majority of replicas
> how about: Bringing a tablet that's lost a majority of replicas back online
Done


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@711
PS4, Line 711: If a tablet has permanently lost a majority of its replicas, it 
cannot recover
> It is critical to emphasize that in a majority-lost scenario, permanent dat
Done


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@723
PS4, Line 723:   638a20403e3e4ae3b55d4d07d920e6de (tserver-00:7150): RUNNING 
[LEADER]
> This is kind of a cool scenario but this whole thing only works if the lead
The procedure works if the leader doesn't survive, but yes the chance of data 
loss is much higher then.


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@760
PS4, Line 760: $ kudu remote_replica delete tserver-01:7150 
e822cab6c0584bc0858219d1539a17e6 "delete failed replica"
> this is not actually required; the master should do it automatically once t
OK, if you're sure about this. I had a couple of situations in my testing where 
I had to do the deletion manually, but they were mock situations that probably 
should be treated as disk failure if they actually happened, e.g. deleting 
consensus metadata files.


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@767
PS4, Line 767: [source,bash]
 : 
 : $ kudu remote_replica unsafe_change_config  
   ...
 : 
> I found this confusing. It seems like a command, I was trying to figure out
Done


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@775
PS4, Line 775: [source,bash]
> If you are going to put this in, at least mark it with a label like "Exampl
Removed


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@777
PS4, Line 777: $ kudu remote_replica unsafe_change_config tserver-00:7150 
e822cab6c0584bc0858219d1539a17e6 638a20403e3e4ae3b55d4d07d920e6de
> Because having a long UUID for tablet_id and UUID for tablet server id can
Done



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Thu, 04 Jan 2018 18:31:26 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2017-12-15 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc
File docs/administration.adoc:

http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@709
PS4, Line 709: Reviving a tablet that's lost a majority of replicas
how about: Bringing a tablet that's lost a majority of replicas back online


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@711
PS4, Line 711: If a tablet has permanently lost a majority of its replicas, it 
cannot recover
It is critical to emphasize that in a majority-lost scenario, permanent data 
loss is likely, and in fact there is no guarantee that any data can be 
recovered. It may only be due to luck that they get some or all of their data 
back after this procedure. We should also emphasize that this procedure should 
only be performed if it is not possible to bring the majority back online.


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@723
PS4, Line 723:   638a20403e3e4ae3b55d4d07d920e6de (tserver-00:7150): RUNNING 
[LEADER]
This is kind of a cool scenario but this whole thing only works if the leader 
survives. I think it's worth indicating how to handle this when the leader did 
not survive as well and a discussion around the implications of that. Actually, 
if the leader survives, the likelihood of losing data is much lower (although 
not zero, because it could have been an old, partitioned leader in some nasty 
cases)


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@760
PS4, Line 760: $ kudu remote_replica delete tserver-01:7150 
e822cab6c0584bc0858219d1539a17e6 "delete failed replica"
this is not actually required; the master should do it automatically once they 
get evicted when we do the unsafe config change


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@767
PS4, Line 767: [source,bash]
 : 
 : $ kudu remote_replica unsafe_change_config  
   ...
 : 
I found this confusing. It seems like a command, I was trying to figure out who 
uuid1 and uuid2 were and why we're changing the config to those two, etc. I 
think we need to pick one of the "prototype" or the "example" for the same 
command. I actually think the prototype (this example) is more useful than the 
one below, except that you indicate a "uuid2" which doesn't apply here.


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@775
PS4, Line 775: [source,bash]
If you are going to put this in, at least mark it with a label like "Example:"


http://gerrit.cloudera.org:8080/#/c/8402/4/docs/administration.adoc@777
PS4, Line 777: $ kudu remote_replica unsafe_change_config tserver-00:7150 
e822cab6c0584bc0858219d1539a17e6 638a20403e3e4ae3b55d4d07d920e6de
Because having a long UUID for tablet_id and UUID for tablet server id can be 
confusing, and these example uuids are never going to actually be what a user 
would paste in, I think something that is sort of a compromise of what you 
wrote on line 770 and what is here on line 777 would be ideal:

$ kudu remote_replica unsafe_change_config tserver-00:7150  


explaining that tserver-000-uuid would be the tablet server UUID of the 
remaining replica on tserver-00



--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Fri, 15 Dec 2017 22:46:36 +
Gerrit-HasComments: Yes


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2017-12-15 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 4:

> Mind pushing a rev of this to your personal GitHub so we can read
 > it rendered?

Check out 
https://github.com/wdberkeley/kudu/blob/showdocs/docs/administration.adoc#tablet_majority_down_recovery


--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Fri, 15 Dec 2017 18:45:51 +
Gerrit-HasComments: No


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2017-12-14 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 4:

Mind pushing a rev of this to your personal GitHub so we can read it rendered?


--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 15 Dec 2017 01:25:35 +
Gerrit-HasComments: No


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2017-12-14 Thread Mike Percy (Code Review)
Mike Percy has removed a vote on this change.

Change subject: [docs] Document how to recover from a majority failed tablet
..


Removed Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2017-12-14 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8402 )

Change subject: [docs] Document how to recover from a majority failed tablet
..


Patch Set 4: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 14 Dec 2017 19:05:17 +
Gerrit-HasComments: No


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2017-10-27 Thread Will Berkeley (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8402

to look at the new patch set (#4).

Change subject: [docs] Document how to recover from a majority failed tablet
..

[docs] Document how to recover from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas. Manual
intervention is required, and basically boils down to

1. Tombstone the failed replicas. This deletes their data and
allows Kudu to overwrite the failed replicas, if necessary. Failing
to do this in certain situations prevents the automatic recovery of the
tablet after step 2.
2. Eject the failed replicas from the consensus configuration, so the
remaning healthy replicas can elect a leader. From this point, the
master orchestrates automatic re-replication of the tablet.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 80 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/4
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] [docs] Document how to recover from a majority failed tablet

2017-10-26 Thread Will Berkeley (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8402

to look at the new patch set (#3).

Change subject: [docs] Document how to recover from a majority failed tablet
..

[docs] Document how to recover from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas. Manual
intervention is required, and basically boils down to

1. copy the data from a healthy replica to where the revived replicas
will be
2. set the consensus configuration of the tablet so it matches the new
locations of replicas

Step 2 requires downtime even for healthy replicas, since new servers
can't be added to consensus configs without either rewriting the on-disk
cmeta or having a majority available. It might be worth allowing a tool
to bypass this restriction so that healthy tablet servers don't need to
be shut down in order to recover tablet on unhealthy ones.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 104 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/3
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Kudu Jenkins