[kudu-CR] disk failure: coordinate error handling

2017-07-31 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: disk failure: coordinate error handling
..


disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager, is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Reviewed-on: http://gerrit.cloudera.org:8080/7029
Reviewed-by: Adar Dembo 
Tested-by: Adar Dembo 
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
15 files changed, 184 insertions(+), 28 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 27
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-07-31 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 26: Code-Review+2 Verified+1

Overriding Jenkins, I agree with Andrew that the failure looks unrelated to 
this change.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] disk failure: coordinate error handling

2017-07-31 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 26:

Failure seems unrelated.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] disk failure: coordinate error handling

2017-07-31 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 26:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/25/src/kudu/fs/block_manager-stress-test.cc
File src/kudu/fs/block_manager-stress-test.cc:

Line 241:   unique_ptr test_error_manager_;
> Nit: can drop std:: prefix; we have "using std::unique_ptr" at the top of t
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-07-31 Thread Andrew Wong (Code Review)
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#26).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager, is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
15 files changed, 184 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/26
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-07-27 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 25: Code-Review+1

lgtm. I know Adar was looking at this at one point, so not sure if he has any 
further comments.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] disk failure: coordinate error handling

2017-07-27 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 25:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 272: 
> missing a word?
You're right that this belongs in the next patch.


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS24, Line 125:   // relevant DataDir's UUID as its input parameter.
> I think it's clearer to have this information with the typedef of ErrorNoti
Done


Line 130:   // This must be called before the callback's callee is destroyed. 
Calls to
> nit: can you add whether this is idempotent? safe to call if nothing is set
Done


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS24, Line 169: d 'error_
> what's this mean? I don't think this line adds much value since the fact th
Fair point, had it for symmetry with file_block_manager. I'll just remove these 
lines and leave the below as notes.


Line 195:   FRIEND_TEST(LogBlockManagerTest, TestMetadataTruncation);
> is this used? tried grepping for it and didn't see any call sites, but mayb
Yeah, used in a future patch. Moved.


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 189:   void FailTabletsInDataDir(const std::string& uuid);
> nit: std::string in header
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-07-27 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#25).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
15 files changed, 184 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/25
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-07-26 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 24:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 272:   // Exposes the FsErrorManager used to fs errors.
missing a word?


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS24, Line 125:   // The input to the callback is expected to be the UUID of a 
failed DataDir.
I think it's clearer to have this information with the typedef of 
ErrorNotificationCb instead. Here, it sounds like it's describing some input to 
this call, not the parameter passed when the callback is invoked.

Maybe: "If a disk error is detected, this callback will be invoked with the 
relevant DataDir UUID as its parameter." or something?


Line 130:   // This must be called before the callback's callee is destroyed.
nit: can you add whether this is idempotent? safe to call if nothing is set yet?


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS24, Line 169: in-memory
what's this mean? I don't think this line adds much value since the fact that 
it's a constructor already sort of tells us that it creates an instance


Line 195:   FsErrorManager* error_manager() override { return error_manager_; }
is this used? tried grepping for it and didn't see any call sites, but maybe 
it's coming in the next patch


http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 189:   void FailTabletsInDataDir(const string& uuid);
nit: std::string in header


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-07-25 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 22:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/22/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 118: using std::set;
> warning: using decl 'set' is unused [misc-unused-using-decls]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 22
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-07-24 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#23).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 194 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/23
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 23
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-07-24 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#22).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 196 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/22
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 22
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-07-19 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#21).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

IO errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 202 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/21
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-07-18 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 20:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1072: void TSTabletManager::FailTabletsInDataDir(const string& uuid) {
Why don't we just have this be a thunk that says "something failed" and then 
have the TSTabletManager do something like this:

  for (const auto& tablet_id : dd_mgr->GetFailedTablets()) {
Tablet* t = Lookup(tablet_id);
t->Shutdown();
  }

Maybe I'm missing something for this part, but we need a lock somewhere to 
track which data dirs are "broken" so that we can ensure that asynchronous 
things like tablet copy will also abort if it's in the process of copying a 
tablet that has blocks now on a failed disk.


Line 1081: LOG(INFO) << "Shutting down tablet (not implemented in this 
patch): " << tablet_id;
If this doesn't do anything, why not just have an empty body of the function?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-07-18 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 20:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 297:   uint16_t* GetUuidIdxForUuid(const std::string& uuid) {
I looked through here and I believe this is safe.

That said, this API is a bit scary at first glance. What lifetime guarantees do 
we make about the pointee here? (To me it looks like the ptr is good as long as 
the DataDirManager has been opened and was not yet destroyed.) We should at 
least document them.

Still, I think it would be a little more intuitive if this returned a bool or a 
Status and took a uint16_t* out-parameter.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 34: typedef Callback ErrorNotificationCb;
doc callback argument


Line 57:   void RunErrorNotificationCb(const std::string& uuid) const {
nit: arguments need docs


Line 58: notify_cb_.Run(uuid);
It makes me a little nervous not to submit this to a thread pool. We'd have to 
be careful to release any locks held which may not be easy to do at a low level 
if we are calling back into a higher level class. Although I guess if we don't 
need it then "YAGNI"


Line 61:   void RunErrorNotificationCb(const DataDir* dir) const {
Can we remove this one?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-07-18 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 16:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1067:   uint16_t* uuid_idx = 
fs_manager_->dd_manager()->GetUuidIdxForUuid(uuid);
> Maybe pull out fs_manager_->dd_manager() into a local variable so you don't
Did you miss this?


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS20, Line 1074: (const st
Didn't think this case was possible.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 188: 
I still think it'd be OK to make this private and wire/unwire in Init/Shutdown 
(as opposed to the ctor/dtor). Did Todd object to that specific kind of wiring? 
I thought the 'weirdness' he felt had to do with side effects in the ctor.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-07-18 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#20).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
ErrorManager is added to coordinate error handling.

Callbacks are registered with the ErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the ErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

This is a part of a series of patches to handle disk failure. See
section 2.2 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
15 files changed, 188 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/20
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-07-14 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#19).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
ErrorManager is added to coordinate error handling.

Callbacks are registered with the ErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the ErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
12 files changed, 163 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/19
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-07-07 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#17).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
ErrorManager is added to coordinate error handling.

Callbacks are registered with the ErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the ErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 184 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/17
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-30 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 109:   
fs_manager_->SetErrorNotificationCb(Bind(::FailTabletsInDataDir,
> The idea here is to keep all the "wiring" at as high a level as possible. T
I can't speak for Todd, but I think his beef was primarily with the ctor/dtor 
aspect of the wiring. Init/Shutdown within TSTabletManager seems like it 
respects that, and it has the added bonus that the dependency affects two 
modules (FsManager and TSTabletManager) rather than three (FsManager, 
TSTabletManager, and TabletServer).

Your second point is valid, though if that happens we can always refactor.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-06-30 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 109:   
fs_manager_->SetErrorNotificationCb(Bind(::FailTabletsInDataDir,
> Why not do this in TSTabletManager::Init (and unset it in TSTabletManager::
The idea here is to keep all the "wiring" at as high a level as possible. Todd 
pointed out that it seems a bit strange for the initialization of a 
TSTabletManager to have an effect on the FsManager (these were originally in 
the c'tor and d'tor of TSTabletManager).

Another thought might be that if the error manager is ever used for other types 
of errors, setting these callbacks in one place may be favorable vs spreading 
them across multiple Init()s.


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-06-30 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 16:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 109:   
fs_manager_->SetErrorNotificationCb(Bind(::FailTabletsInDataDir,
Why not do this in TSTabletManager::Init (and unset it in 
TSTabletManager::Shutdown)? Then FailTabletsInDataDir() could be private.


http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1067:   uint16_t* uuid_idx = 
fs_manager_->dd_manager()->GetUuidIdxForUuid(uuid);
Maybe pull out fs_manager_->dd_manager() into a local variable so you don't 
have to repeat it?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-06-30 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 16:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 199: 
> Since this is a raw pointer, you should say something about lifespans here.
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 32: typedef Callback ErrorNotificationCb;
> Hmm, why must this be defined here and in fs_manager.h? Surely we could def
Done, keeping it here and putting error_manager.h in fs_manager.h.


Line 36: // multiple layers, many of which we prefer to keep separate. The 
FsErrorManager
> These macros probably don't belong in this patch; I don't see them used at 
Done


PS15, Line 62: 
> What does "blocks" mean here? If this is ReadableBlock/WritableBlock, there
Done


Line 66: }  // namespace kudu
> Nit: separate on each line.
Done


Line 82
> This is a pretty meaty method; why not move it to a .cc file?
Moved all the logic out of the error manager.


Line 100
> Should this be a CHECK/DCHECK? Under what circumstances would you expect th
You're right that this should never happen. Done


PS15, Line 106: 
  : 
  : 
> This is too much cross-layer mingling. Reword it so it's scoped to only how
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 388:   FileReadableBlock(const FileBlockManager* block_manager, BlockId 
block_id,
> Why did you have to change this? And below?
For a future patch, it was needed to set the DataDirManager in the 
FsErrorManager. This isn't a requirement anymore so I'll revert it.


Line 544:"file_block_manager",
> I don't particularly care for this. Why not hoist the DataDirManager out of
I've restructured this a bit so the error manager doesn't need to know about 
any external state other than the callbacks.

I do intend on moving the DataDirManager out of the BlockManager (and also 
likely renaming it to DirectoryManager or something), but in another patch.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.h
File src/kudu/fs/file_block_manager.h:

Line 106
> FWIW, this is a style that I employ, reminiscent of Javadoc, where the logi
I see, I've seen this style around, just not with such short sections. Reverting


Line 98: 
> Declare this virtual in BlockManager and override it here. Same for log_blo
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

Line 49: 
> I've only finished reviewing this patch (and not the rest) so I don't even 
The key, I think, was pushing all of the logic into the clearly-defined layers 
(e.g. fail the tablets in the TSTabletManager instead of here).

Doing so also opens it up for a bit more flexibility (using a string instead of 
set as an input to this cb).


Line 49: 
> did you consider a design where the callback just notifies the failure by U
Done


Line 124:   // This callback will be run when block operations run into disk 
failures.
> Should also explain what this means. When will this callback be invoked? Wh
Done


PS15, Line 125: e input to the callback
> think this would be better SetErrorNotificationCallback to match the type
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

Line 365: 
> Add a little comment here.
Done


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS15, Line 164:   // Search for tablets in the metadata dir.
  :   vector tablet_ids;
> it seems a bit messy to have this being "wired" from the constructor. usual
Done, moved to the wiring to TabletServer Start()/Shutdown()


PS15, Line 175:   int loaded_count = 0;
  :   for (const string& tablet_id : tablet_ids) {
> same, seems a little not-quite-right here
Done


Line 611:   TabletDataState data_state = 
replica->tablet_metadata()->tablet_data_state();
> Did you intend for this decomposition to be in this patch? Since there's st
Hrm, yeah originally there was a usage that was very specific to disk failure 
handling that's in a separate patch. Will revert.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS15, Line 132: scoped_refptr* 
replica) const;
  : 
  :   // Same as LookupTablet but doesn't acquired the shared lock.
  :   bool LookupTabletUnlocked(const std::string& tablet_id,
  : 
scoped_refptr* replica) const;
  : 
  :   virtual Status 

[kudu-CR] disk failure: coordinate error handling

2017-06-30 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#16).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
ErrorManager is added to coordinate error handling.

Callbacks are registered with the ErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the ErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/data_dirs.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
16 files changed, 184 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/16
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-26 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 15:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 199:   // Manager for coordinating error-handling.
Since this is a raw pointer, you should say something about lifespans here. I 
bet the FsErrorManager has to outlive the BlockManager, right?

Actually, since this isn't optional, I'd add it as a standalone argument to the 
FBM/LBM constructors instead.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 32: typedef Callback 
ErrorNotificationCallback;
Hmm, why must this be defined here and in fs_manager.h? Surely we could define 
it in the "dependent" header and have the "depending" header include it?


Line 36: // Evaluates the expression and handles it if it results in an error.
These macros probably don't belong in this patch; I don't see them used at all.


PS15, Line 62: blocks
What does "blocks" mean here? If this is ReadableBlock/WritableBlock, there's 
not enough context to understand why they'd need to call the callback.


Line 66:   FsErrorManager() : dd_manager_(nullptr), 
notify_cb_(Bind()) {}
Nit: separate on each line.


Line 82:   void FailTabletsInDataDir(DataDir* dir) {
This is a pretty meaty method; why not move it to a .cc file?


Line 100: LOG(ERROR) << strings::Substitute("Dir $0 not tracked by 
DataDirManager", dir->dir());
Should this be a CHECK/DCHECK? Under what circumstances would you expect this 
to fire in production?


PS15, Line 106:   // Callback that fails the TSTabletManager's tablet peers.
  :   // The referenced TSTabletManager may be deleted before the 
FsErrorManager,
  :   // so this may be null.
This is too much cross-layer mingling. Reword it so it's scoped to only how 
FsErrorManager interacts with it, and how registrants should behave when it's 
called.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 388:   FileReadableBlock(FileBlockManager* block_manager, BlockId block_id,
Why did you have to change this? And below?


Line 544:   error_manager_->SetDataDirManager(_manager_);
I don't particularly care for this. Why not hoist the DataDirManager out of the 
block managers and let it be owned by the FsManager? At that point the 
FsManager creates the BlockManager, DataDirManager, and FsErrorManager, and it 
can order construction so that everyone gets the right pointer. Probably 
DataDirManager first, then FsErrorManager, then BlockManager?


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.h
File src/kudu/fs/file_block_manager.h:

Line 106
FWIW, this is a style that I employ, reminiscent of Javadoc, where the logical 
sections of the function's description are separated by empty lines to 
emphasize their differences.


Line 98:   FsErrorManager* error_manager() { return error_manager_; }
Declare this virtual in BlockManager and override it here. Same for 
log_block_manager.h.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

Line 49: typedef Callback 
ErrorNotificationCallback;
> did you consider a design where the callback just notifies the failure by U
I've only finished reviewing this patch (and not the rest) so I don't even know 
what set means (could use a comment!). But I agree with Todd that fewer 
cross-references between these classes would probably lead to a clearer 
implementation.


Line 124:   // Registers a callback with the FsErrorManager.
Should also explain what this means. When will this callback be invoked? When 
invoked, what's the significance of its argument? Is it important to unregister 
it?


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

Line 365:   FsErrorManager* error_manager_;
Add a little comment here.


http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 611: Status TSTabletManager::TransitionTabletState(
Did you intend for this decomposition to be in this patch? Since there's still 
just one call into TransitionTabletState(), it's not clear why bother to 
decompose here. Todd was sort of getting at the same thing when he asked why 
this is public; presumably its for a different patch?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 

[kudu-CR] disk failure: coordinate error handling

2017-06-23 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7029/14/src/kudu/fs/block_manager-stress-test.cc
File src/kudu/fs/block_manager-stress-test.cc:

Line 29: #include "kudu/fs/log_block_manager-test-util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


http://gerrit.cloudera.org:8080/#/c/7029/14/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

Line 28: #include "kudu/fs/log_block_manager-test-util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-06-22 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#14).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 233 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/14
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-21 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#12).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tablet/mvcc.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 231 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/12
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-21 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#11).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Actual handling and
testing will come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 260 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-20 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 3:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS8, Line 60: LOG(WARNING) << strings::Substitute("Dir $0 failed, marking 
for failure", dir->dir());
: uint16_t uuid_idx;
: if (dd_manager_->FindUuidIndexByDataDir(dir, _idx)) {
:   // If the directory is registered as failed, ignore it.
:   if (dd_manager_->IsDataDirFailed(
> hm, I dont think "fallthrough intended" is really necessary here. Usually s
Done


PS8, Line 74: 
:   return;
> not sure what this means
Done


PS8, Line 80: 
> maybe use a typedef for this?
Done


Line 81:   // Must outlive the eio error manager.
> std::move(cb)?
Done


Line 125
> nit: I think it's better to keep the specifics of what the callback _does_ 
Done. I agree the error_manager is meant to handle more than just disk failures.

I was thinking of having something like (failure type => get_observers), 
(failure type => notify_cb), (failure type => do other stuff) might be 
warranted later on. Or just leave it completely open-ended (failure type => 
handle error). The question is: design that now or when we have more handleable 
errors?


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS8, Line 121: *
> why is this taking a pointer to a callback instead of just taking a callbac
Done


Line 214:   void SetTabletFailedCallback(Callback* 
cb);
> same
Done


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 684: LOG(WARNING) << s.ToString();
> not quite following the purpose of this parameter. In the case that delete_
Done. Split out the above code to transition tablet state. Only that and 
replica->Shutdown() are needed from what I can tell.


Line 804:   MonoTime start(MonoTime::Now());
> is there a DCHECK we could add here that the replica is already marked fail
Moving this logic out of this patch, but done.


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 131: 
> can you update the doc as to why you would want to DeleteTablet without del
Took your other suggestion of bring out the needed logic out into its own 
function.


PS8, Line 190:   // TODO(awong): rather than deleting the tablet, don't do 
anything to the
 :   // tablet metadata and just store disk failure state instead.
 :   void FailTabletReplicas(const std::set& 
tablet_ids);
 : 
> the docs don't seem to match the name here. Is this shutting it down or try
Done


Line 294: 
> I think if you made the other classes take the callback by value instead of
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-06-20 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#10).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/master/sys_catalog.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
13 files changed, 267 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-20 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#9).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
12 files changed, 269 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-18 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 8:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS8, Line 60: case EIO: // Fallthrough intended
: case ENODEV: // Fallthrough intended
: case ENOSPC: // Fallthrough intended
: case ENXIO: // Fallthrough intended
: case EROFS: // Fallthrough intended
hm, I dont think "fallthrough intended" is really necessary here. Usually such 
a comment would be used when there is actual code inside of a 'case' label and 
then falls through to the below one, like:

switch (foo) {
  case A:
DoSomething();
FALLTHROUGH_INTENDED;
  case B:
DoSomethingElse();
break;
}

(see gutil/macros.h for the above macro and example usage -- it actually has 
some effect on clang warnings as well)


PS8, Line 74:  that blocks known to the
: // BlockManager can call.
not sure what this means


PS8, Line 80: Callback* cb
maybe use a typedef for this?


Line 81: shutdown_replicas_cb_ = cb;
std::move(cb)?


Line 125:   Callback* shutdown_replicas_cb_;
nit: I think it's better to keep the specifics of what the callback _does_ out 
of this class. i.e from this class's perspectives, it's not necessary that it 
is going to shutdown replicas. Instead maybe just something like notify_cb_?


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS8, Line 121: *
why is this taking a pointer to a callback instead of just taking a callback?


Line 214:   void SetTabletFailedCallback(Callback* 
cb);
same


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 684:   if (delete_tablet_data) {
not quite following the purpose of this parameter. In the case that 
delete_tablet_data is false, then how much of the rest of this function is 
still relevant? eg opt_last_logged_opid won't be used, the above CAS-related 
stuff isn't used, right? Maybe refactoring the function so the common bits can 
be reused, but introducing a new function like MarkFailedAndShutdown(tablet_id) 
would be clearer?


Line 804:   // Disk failures should have been handled.
is there a DCHECK we could add here that the replica is already marked failed 
if not?


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 131:   bool delete_tablet_data = true);
can you update the doc as to why you would want to DeleteTablet without 
deleting the data?


PS8, Line 190:   // Asynchronously shut down a tablet replica by deleting the 
tablet.
 :   // TODO(awong): rather than deleting the tablet, don't do 
anything to the
 :   // tablet metadata and just store disk failure state instead.
 :   void FailTabletReplicas(const std::set& 
tablet_ids);
the docs don't seem to match the name here. Is this shutting it down or trying 
to delete on-disk data, etc?


Line 294:   Callback shutdown_replicas_cb_;
I think if you made the other classes take the callback by value instead of by 
pointer, you wouldn't need this member


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: coordinate error handling

2017-06-16 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#8).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 236 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-16 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#7).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 238 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-16 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#6).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 239 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-16 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#5).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 238 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: coordinate error handling

2017-06-16 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
..


Patch Set 4:

For now I'm keeping usage and testing separate since placement and lifespan are 
discussion points that can be localized to a single patch.

-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] disk failure: coordinate error handling

2017-06-16 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7029

to look at the new patch set (#4).

Change subject: disk failure: coordinate error handling
..

disk failure: coordinate error handling

I/O errors are spawned when calls to env functions fail. The response to
these errors may vary and may span many layers of abstraction.

For instance, if an EIO were triggered while writing a block, one
response might be to shut down the tablet whose operation just failed.
In order to do this without messing up the layering, a new entity, the
FsErrorManager is added to coordinate error handling.

Callbacks are registered with the FsErrorManager as responses to errors,
and callers to at-risk functions trigger these callbacks as errors are
spawned.

This patch includes the FsErrorManager and its placement w.r.t. the
BlockManager, FsManager, TSTabletManager, etc. Usage and testing will
come in a later patch.

Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
---
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/block_manager.h
A src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/file_block_manager.h
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 233 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon