[kudu-CR] disk failure: coordinate error handling
Adar Dembo has submitted this change and it was merged. Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling IO errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager, is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. This is a part of a series of patches to handle disk failure. See section 2.2 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Reviewed-on: http://gerrit.cloudera.org:8080/7029 Reviewed-by: Adar DemboTested-by: Adar Dembo --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 15 files changed, 184 insertions(+), 28 deletions(-) Approvals: Adar Dembo: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 27 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Adar Dembo has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 26: Code-Review+2 Verified+1 Overriding Jenkins, I agree with Andrew that the failure looks unrelated to this change. -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 26 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 26: Failure seems unrelated. -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 26 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 26: (1 comment) http://gerrit.cloudera.org:8080/#/c/7029/25/src/kudu/fs/block_manager-stress-test.cc File src/kudu/fs/block_manager-stress-test.cc: Line 241: unique_ptr test_error_manager_; > Nit: can drop std:: prefix; we have "using std::unique_ptr" at the top of t Done -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 26 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Hello Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#26). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling IO errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager, is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. This is a part of a series of patches to handle disk failure. See section 2.2 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 15 files changed, 184 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/26 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 26 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Todd Lipcon has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 25: Code-Review+1 lgtm. I know Adar was looking at this at one point, so not sure if he has any further comments. -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 25 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 25: (6 comments) http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: Line 272: > missing a word? You're right that this belongs in the next patch. http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: PS24, Line 125: // relevant DataDir's UUID as its input parameter. > I think it's clearer to have this information with the typedef of ErrorNoti Done Line 130: // This must be called before the callback's callee is destroyed. Calls to > nit: can you add whether this is idempotent? safe to call if nothing is set Done http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/log_block_manager.h File src/kudu/fs/log_block_manager.h: PS24, Line 169: d 'error_ > what's this mean? I don't think this line adds much value since the fact th Fair point, had it for symmetry with file_block_manager. I'll just remove these lines and leave the below as notes. Line 195: FRIEND_TEST(LogBlockManagerTest, TestMetadataTruncation); > is this used? tried grepping for it and didn't see any call sites, but mayb Yeah, used in a future patch. Moved. http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 189: void FailTabletsInDataDir(const std::string& uuid); > nit: std::string in header Done -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 25 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#25). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling IO errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. This is a part of a series of patches to handle disk failure. See section 2.2 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 15 files changed, 184 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/25 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 25 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Todd Lipcon has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 24: (6 comments) http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: Line 272: // Exposes the FsErrorManager used to fs errors. missing a word? http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: PS24, Line 125: // The input to the callback is expected to be the UUID of a failed DataDir. I think it's clearer to have this information with the typedef of ErrorNotificationCb instead. Here, it sounds like it's describing some input to this call, not the parameter passed when the callback is invoked. Maybe: "If a disk error is detected, this callback will be invoked with the relevant DataDir UUID as its parameter." or something? Line 130: // This must be called before the callback's callee is destroyed. nit: can you add whether this is idempotent? safe to call if nothing is set yet? http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/fs/log_block_manager.h File src/kudu/fs/log_block_manager.h: PS24, Line 169: in-memory what's this mean? I don't think this line adds much value since the fact that it's a constructor already sort of tells us that it creates an instance Line 195: FsErrorManager* error_manager() override { return error_manager_; } is this used? tried grepping for it and didn't see any call sites, but maybe it's coming in the next patch http://gerrit.cloudera.org:8080/#/c/7029/24/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 189: void FailTabletsInDataDir(const string& uuid); nit: std::string in header -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 24 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 22: (1 comment) http://gerrit.cloudera.org:8080/#/c/7029/22/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 118: using std::set; > warning: using decl 'set' is unused [misc-unused-using-decls] Done -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 22 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#23). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling IO errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. This is a part of a series of patches to handle disk failure. See section 2.2 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 16 files changed, 194 insertions(+), 30 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/23 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 23 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#22). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling IO errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. This is a part of a series of patches to handle disk failure. See section 2.2 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 16 files changed, 196 insertions(+), 30 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/22 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 22 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#21). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling IO errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. This is a part of a series of patches to handle disk failure. See section 2.2 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 16 files changed, 202 insertions(+), 30 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/21 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 21 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Mike Percy has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 20: (2 comments) http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 1072: void TSTabletManager::FailTabletsInDataDir(const string& uuid) { Why don't we just have this be a thunk that says "something failed" and then have the TSTabletManager do something like this: for (const auto& tablet_id : dd_mgr->GetFailedTablets()) { Tablet* t = Lookup(tablet_id); t->Shutdown(); } Maybe I'm missing something for this part, but we need a lock somewhere to track which data dirs are "broken" so that we can ensure that asynchronous things like tablet copy will also abort if it's in the process of copying a tablet that has blocks now on a failed disk. Line 1081: LOG(INFO) << "Shutting down tablet (not implemented in this patch): " << tablet_id; If this doesn't do anything, why not just have an empty body of the function? -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 20 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Mike Percy has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 20: (5 comments) http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: Line 297: uint16_t* GetUuidIdxForUuid(const std::string& uuid) { I looked through here and I believe this is safe. That said, this API is a bit scary at first glance. What lifetime guarantees do we make about the pointee here? (To me it looks like the ptr is good as long as the DataDirManager has been opened and was not yet destroyed.) We should at least document them. Still, I think it would be a little more intuitive if this returned a bool or a Status and took a uint16_t* out-parameter. http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 34: typedef CallbackErrorNotificationCb; doc callback argument Line 57: void RunErrorNotificationCb(const std::string& uuid) const { nit: arguments need docs Line 58: notify_cb_.Run(uuid); It makes me a little nervous not to submit this to a thread pool. We'd have to be careful to release any locks held which may not be easy to do at a low level if we are calling back into a higher level class. Although I guess if we don't need it then "YAGNI" Line 61: void RunErrorNotificationCb(const DataDir* dir) const { Can we remove this one? -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 20 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Adar Dembo has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 16: (3 comments) http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 1067: uint16_t* uuid_idx = fs_manager_->dd_manager()->GetUuidIdxForUuid(uuid); > Maybe pull out fs_manager_->dd_manager() into a local variable so you don't Did you miss this? http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS20, Line 1074: (const st Didn't think this case was possible. http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 188: I still think it'd be OK to make this private and wire/unwire in Init/Shutdown (as opposed to the ctor/dtor). Did Todd object to that specific kind of wiring? I thought the 'weirdness' he felt had to do with side effects in the ctor. -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#20). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the ErrorManager is added to coordinate error handling. Callbacks are registered with the ErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the ErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. This is a part of a series of patches to handle disk failure. See section 2.2 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 15 files changed, 188 insertions(+), 24 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/20 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 20 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#19). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the ErrorManager is added to coordinate error handling. Callbacks are registered with the ErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the ErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h 12 files changed, 163 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/19 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 19 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#17). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the ErrorManager is added to coordinate error handling. Callbacks are registered with the ErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the ErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tablet/mvcc.cc M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 16 files changed, 184 insertions(+), 26 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/17 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Adar Dembo has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 16: (1 comment) http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc File src/kudu/tserver/tablet_server.cc: Line 109: fs_manager_->SetErrorNotificationCb(Bind(::FailTabletsInDataDir, > The idea here is to keep all the "wiring" at as high a level as possible. T I can't speak for Todd, but I think his beef was primarily with the ctor/dtor aspect of the wiring. Init/Shutdown within TSTabletManager seems like it respects that, and it has the added bonus that the dependency affects two modules (FsManager and TSTabletManager) rather than three (FsManager, TSTabletManager, and TabletServer). Your second point is valid, though if that happens we can always refactor. -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 16: (1 comment) http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc File src/kudu/tserver/tablet_server.cc: Line 109: fs_manager_->SetErrorNotificationCb(Bind(::FailTabletsInDataDir, > Why not do this in TSTabletManager::Init (and unset it in TSTabletManager:: The idea here is to keep all the "wiring" at as high a level as possible. Todd pointed out that it seems a bit strange for the initialization of a TSTabletManager to have an effect on the FsManager (these were originally in the c'tor and d'tor of TSTabletManager). Another thought might be that if the error manager is ever used for other types of errors, setting these callbacks in one place may be favorable vs spreading them across multiple Init()s. -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Adar Dembo has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 16: (2 comments) http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/tablet_server.cc File src/kudu/tserver/tablet_server.cc: Line 109: fs_manager_->SetErrorNotificationCb(Bind(::FailTabletsInDataDir, Why not do this in TSTabletManager::Init (and unset it in TSTabletManager::Shutdown)? Then FailTabletsInDataDir() could be private. http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 1067: uint16_t* uuid_idx = fs_manager_->dd_manager()->GetUuidIdxForUuid(uuid); Maybe pull out fs_manager_->dd_manager() into a local variable so you don't have to repeat it? -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 16: (22 comments) http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: Line 199: > Since this is a raw pointer, you should say something about lifespans here. Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 32: typedef CallbackErrorNotificationCb; > Hmm, why must this be defined here and in fs_manager.h? Surely we could def Done, keeping it here and putting error_manager.h in fs_manager.h. Line 36: // multiple layers, many of which we prefer to keep separate. The FsErrorManager > These macros probably don't belong in this patch; I don't see them used at Done PS15, Line 62: > What does "blocks" mean here? If this is ReadableBlock/WritableBlock, there Done Line 66: } // namespace kudu > Nit: separate on each line. Done Line 82 > This is a pretty meaty method; why not move it to a .cc file? Moved all the logic out of the error manager. Line 100 > Should this be a CHECK/DCHECK? Under what circumstances would you expect th You're right that this should never happen. Done PS15, Line 106: : : > This is too much cross-layer mingling. Reword it so it's scoped to only how Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: Line 388: FileReadableBlock(const FileBlockManager* block_manager, BlockId block_id, > Why did you have to change this? And below? For a future patch, it was needed to set the DataDirManager in the FsErrorManager. This isn't a requirement anymore so I'll revert it. Line 544:"file_block_manager", > I don't particularly care for this. Why not hoist the DataDirManager out of I've restructured this a bit so the error manager doesn't need to know about any external state other than the callbacks. I do intend on moving the DataDirManager out of the BlockManager (and also likely renaming it to DirectoryManager or something), but in another patch. http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.h File src/kudu/fs/file_block_manager.h: Line 106 > FWIW, this is a style that I employ, reminiscent of Javadoc, where the logi I see, I've seen this style around, just not with such short sections. Reverting Line 98: > Declare this virtual in BlockManager and override it here. Same for log_blo Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: Line 49: > I've only finished reviewing this patch (and not the rest) so I don't even The key, I think, was pushing all of the logic into the clearly-defined layers (e.g. fail the tablets in the TSTabletManager instead of here). Doing so also opens it up for a bit more flexibility (using a string instead of set as an input to this cb). Line 49: > did you consider a design where the callback just notifies the failure by U Done Line 124: // This callback will be run when block operations run into disk failures. > Should also explain what this means. When will this callback be invoked? Wh Done PS15, Line 125: e input to the callback > think this would be better SetErrorNotificationCallback to match the type Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/log_block_manager.h File src/kudu/fs/log_block_manager.h: Line 365: > Add a little comment here. Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS15, Line 164: // Search for tablets in the metadata dir. : vector tablet_ids; > it seems a bit messy to have this being "wired" from the constructor. usual Done, moved to the wiring to TabletServer Start()/Shutdown() PS15, Line 175: int loaded_count = 0; : for (const string& tablet_id : tablet_ids) { > same, seems a little not-quite-right here Done Line 611: TabletDataState data_state = replica->tablet_metadata()->tablet_data_state(); > Did you intend for this decomposition to be in this patch? Since there's st Hrm, yeah originally there was a usage that was very specific to disk failure handling that's in a separate patch. Will revert. http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: PS15, Line 132: scoped_refptr* replica) const; : : // Same as LookupTablet but doesn't acquired the shared lock. : bool LookupTabletUnlocked(const std::string& tablet_id, : scoped_refptr* replica) const; : : virtual Status
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#16). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the ErrorManager is added to coordinate error handling. Callbacks are registered with the ErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the ErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h M src/kudu/fs/data_dirs.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tablet/mvcc.cc M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 16 files changed, 184 insertions(+), 26 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/16 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Adar Dembo has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 15: (16 comments) http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: Line 199: // Manager for coordinating error-handling. Since this is a raw pointer, you should say something about lifespans here. I bet the FsErrorManager has to outlive the BlockManager, right? Actually, since this isn't optional, I'd add it as a standalone argument to the FBM/LBM constructors instead. http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 32: typedef CallbackErrorNotificationCallback; Hmm, why must this be defined here and in fs_manager.h? Surely we could define it in the "dependent" header and have the "depending" header include it? Line 36: // Evaluates the expression and handles it if it results in an error. These macros probably don't belong in this patch; I don't see them used at all. PS15, Line 62: blocks What does "blocks" mean here? If this is ReadableBlock/WritableBlock, there's not enough context to understand why they'd need to call the callback. Line 66: FsErrorManager() : dd_manager_(nullptr), notify_cb_(Bind()) {} Nit: separate on each line. Line 82: void FailTabletsInDataDir(DataDir* dir) { This is a pretty meaty method; why not move it to a .cc file? Line 100: LOG(ERROR) << strings::Substitute("Dir $0 not tracked by DataDirManager", dir->dir()); Should this be a CHECK/DCHECK? Under what circumstances would you expect this to fire in production? PS15, Line 106: // Callback that fails the TSTabletManager's tablet peers. : // The referenced TSTabletManager may be deleted before the FsErrorManager, : // so this may be null. This is too much cross-layer mingling. Reword it so it's scoped to only how FsErrorManager interacts with it, and how registrants should behave when it's called. http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: Line 388: FileReadableBlock(FileBlockManager* block_manager, BlockId block_id, Why did you have to change this? And below? Line 544: error_manager_->SetDataDirManager(_manager_); I don't particularly care for this. Why not hoist the DataDirManager out of the block managers and let it be owned by the FsManager? At that point the FsManager creates the BlockManager, DataDirManager, and FsErrorManager, and it can order construction so that everyone gets the right pointer. Probably DataDirManager first, then FsErrorManager, then BlockManager? http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.h File src/kudu/fs/file_block_manager.h: Line 106 FWIW, this is a style that I employ, reminiscent of Javadoc, where the logical sections of the function's description are separated by empty lines to emphasize their differences. Line 98: FsErrorManager* error_manager() { return error_manager_; } Declare this virtual in BlockManager and override it here. Same for log_block_manager.h. http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: Line 49: typedef Callback ErrorNotificationCallback; > did you consider a design where the callback just notifies the failure by U I've only finished reviewing this patch (and not the rest) so I don't even know what set means (could use a comment!). But I agree with Todd that fewer cross-references between these classes would probably lead to a clearer implementation. Line 124: // Registers a callback with the FsErrorManager. Should also explain what this means. When will this callback be invoked? When invoked, what's the significance of its argument? Is it important to unregister it? http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/log_block_manager.h File src/kudu/fs/log_block_manager.h: Line 365: FsErrorManager* error_manager_; Add a little comment here. http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 611: Status TSTabletManager::TransitionTabletState( Did you intend for this decomposition to be in this patch? Since there's still just one call into TransitionTabletState(), it's not clear why bother to decompose here. Todd was sort of getting at the same thing when he asked why this is public; presumably its for a different patch? -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 8: (2 comments) http://gerrit.cloudera.org:8080/#/c/7029/14/src/kudu/fs/block_manager-stress-test.cc File src/kudu/fs/block_manager-stress-test.cc: Line 29: #include "kudu/fs/log_block_manager-test-util.h" > warning: #includes are not sorted properly [llvm-include-order] Done http://gerrit.cloudera.org:8080/#/c/7029/14/src/kudu/fs/log_block_manager-test.cc File src/kudu/fs/log_block_manager-test.cc: Line 28: #include "kudu/fs/log_block_manager-test-util.h" > warning: #includes are not sorted properly [llvm-include-order] Done -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#14). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tablet/mvcc.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 14 files changed, 233 insertions(+), 34 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/14 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 14 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#12). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tablet/mvcc.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 14 files changed, 231 insertions(+), 34 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/12 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#11). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Actual handling and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/master/sys_catalog.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 14 files changed, 260 insertions(+), 33 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/11 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 3: (12 comments) http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: PS8, Line 60: LOG(WARNING) << strings::Substitute("Dir $0 failed, marking for failure", dir->dir()); : uint16_t uuid_idx; : if (dd_manager_->FindUuidIndexByDataDir(dir, _idx)) { : // If the directory is registered as failed, ignore it. : if (dd_manager_->IsDataDirFailed( > hm, I dont think "fallthrough intended" is really necessary here. Usually s Done PS8, Line 74: : return; > not sure what this means Done PS8, Line 80: > maybe use a typedef for this? Done Line 81: // Must outlive the eio error manager. > std::move(cb)? Done Line 125 > nit: I think it's better to keep the specifics of what the callback _does_ Done. I agree the error_manager is meant to handle more than just disk failures. I was thinking of having something like (failure type => get_observers), (failure type => notify_cb), (failure type => do other stuff) might be warranted later on. Or just leave it completely open-ended (failure type => handle error). The question is: design that now or when we have more handleable errors? http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: PS8, Line 121: * > why is this taking a pointer to a callback instead of just taking a callbac Done Line 214: void SetTabletFailedCallback(Callback* cb); > same Done http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 684: LOG(WARNING) << s.ToString(); > not quite following the purpose of this parameter. In the case that delete_ Done. Split out the above code to transition tablet state. Only that and replica->Shutdown() are needed from what I can tell. Line 804: MonoTime start(MonoTime::Now()); > is there a DCHECK we could add here that the replica is already marked fail Moving this logic out of this patch, but done. http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 131: > can you update the doc as to why you would want to DeleteTablet without del Took your other suggestion of bring out the needed logic out into its own function. PS8, Line 190: // TODO(awong): rather than deleting the tablet, don't do anything to the : // tablet metadata and just store disk failure state instead. : void FailTabletReplicas(const std::set& tablet_ids); : > the docs don't seem to match the name here. Is this shutting it down or try Done Line 294: > I think if you made the other classes take the callback by value instead of Done -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#10). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Usage and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/master/sys_catalog.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 13 files changed, 267 insertions(+), 37 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/10 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#9). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Usage and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 12 files changed, 269 insertions(+), 37 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/9 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Todd Lipcon has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 8: (12 comments) http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: PS8, Line 60: case EIO: // Fallthrough intended : case ENODEV: // Fallthrough intended : case ENOSPC: // Fallthrough intended : case ENXIO: // Fallthrough intended : case EROFS: // Fallthrough intended hm, I dont think "fallthrough intended" is really necessary here. Usually such a comment would be used when there is actual code inside of a 'case' label and then falls through to the below one, like: switch (foo) { case A: DoSomething(); FALLTHROUGH_INTENDED; case B: DoSomethingElse(); break; } (see gutil/macros.h for the above macro and example usage -- it actually has some effect on clang warnings as well) PS8, Line 74: that blocks known to the : // BlockManager can call. not sure what this means PS8, Line 80: Callback* cb maybe use a typedef for this? Line 81: shutdown_replicas_cb_ = cb; std::move(cb)? Line 125: Callback * shutdown_replicas_cb_; nit: I think it's better to keep the specifics of what the callback _does_ out of this class. i.e from this class's perspectives, it's not necessary that it is going to shutdown replicas. Instead maybe just something like notify_cb_? http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: PS8, Line 121: * why is this taking a pointer to a callback instead of just taking a callback? Line 214: void SetTabletFailedCallback(Callback * cb); same http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 684: if (delete_tablet_data) { not quite following the purpose of this parameter. In the case that delete_tablet_data is false, then how much of the rest of this function is still relevant? eg opt_last_logged_opid won't be used, the above CAS-related stuff isn't used, right? Maybe refactoring the function so the common bits can be reused, but introducing a new function like MarkFailedAndShutdown(tablet_id) would be clearer? Line 804: // Disk failures should have been handled. is there a DCHECK we could add here that the replica is already marked failed if not? http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 131: bool delete_tablet_data = true); can you update the doc as to why you would want to DeleteTablet without deleting the data? PS8, Line 190: // Asynchronously shut down a tablet replica by deleting the tablet. : // TODO(awong): rather than deleting the tablet, don't do anything to the : // tablet metadata and just store disk failure state instead. : void FailTabletReplicas(const std::set& tablet_ids); the docs don't seem to match the name here. Is this shutting it down or trying to delete on-disk data, etc? Line 294: Callback shutdown_replicas_cb_; I think if you made the other classes take the callback by value instead of by pointer, you wouldn't need this member -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#8). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Usage and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 11 files changed, 236 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/8 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#7). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Usage and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 11 files changed, 238 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/7 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#6). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Usage and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 11 files changed, 239 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/6 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#5). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Usage and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 11 files changed, 238 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/5 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: coordinate error handling
Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling .. Patch Set 4: For now I'm keeping usage and testing separate since placement and lifespan are discussion points that can be localized to a single patch. -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] disk failure: coordinate error handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7029 to look at the new patch set (#4). Change subject: disk failure: coordinate error handling .. disk failure: coordinate error handling I/O errors are spawned when calls to env functions fail. The response to these errors may vary and may span many layers of abstraction. For instance, if an EIO were triggered while writing a block, one response might be to shut down the tablet whose operation just failed. In order to do this without messing up the layering, a new entity, the FsErrorManager is added to coordinate error handling. Callbacks are registered with the FsErrorManager as responses to errors, and callers to at-risk functions trigger these callbacks as errors are spawned. This patch includes the FsErrorManager and its placement w.r.t. the BlockManager, FsManager, TSTabletManager, etc. Usage and testing will come in a later patch. Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 --- M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager.h A src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/file_block_manager.h M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager.cc M src/kudu/fs/log_block_manager.h M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 11 files changed, 233 insertions(+), 26 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/29/7029/4 -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon