[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. fs: defer failure from metadata load to bootstrap when data dir is missing An upcoming patch adds a CLI tool action to remove data directories. When a data dir is removed, all tablets with data on it will fail. Today that failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make that a little bit more graceful. This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a failure when a data dir is missing. It further changes TabletMetadata to treat such failures as non-fatal, and adds checks to TabletBootstrap so that the failures manifest there instead. No tests in this patch because: 1. Andrew has already merged tablet-level tests for failed disks, and 2. The CLI tool patch adds coverage at the itest-level. Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Reviewed-on: http://gerrit.cloudera.org:8080/8376 Reviewed-by: Todd Lipcon Reviewed-by: Andrew Wong Tested-by: Kudu Jenkins --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_metadata.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/ts_tablet_manager.cc 10 files changed, 136 insertions(+), 69 deletions(-) Approvals: Todd Lipcon: Looks good to me, approved Andrew Wong: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 10 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 9: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 9 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 20 Jan 2018 01:14:22 + Gerrit-HasComments: No
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 9: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc File src/kudu/tablet/tablet_bootstrap.cc: http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@557 PS8, Line 557: // Ensure the tablet's data dirs are present and healthy before it is opened. > The old location was in TSTabletManager::OpenTablet, which is run out of th k... not in love with it but you're right it seems potentially tricky -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 9 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 20 Jan 2018 00:51:13 + Gerrit-HasComments: Yes
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Hello Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8376 to look at the new patch set (#9). Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. fs: defer failure from metadata load to bootstrap when data dir is missing An upcoming patch adds a CLI tool action to remove data directories. When a data dir is removed, all tablets with data on it will fail. Today that failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make that a little bit more graceful. This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a failure when a data dir is missing. It further changes TabletMetadata to treat such failures as non-fatal, and adds checks to TabletBootstrap so that the failures manifest there instead. No tests in this patch because: 1. Andrew has already merged tablet-level tests for failed disks, and 2. The CLI tool patch adds coverage at the itest-level. Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_metadata.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/ts_tablet_manager.cc 10 files changed, 136 insertions(+), 69 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/9 -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 9 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc@327 PS8, Line 327: FindOrDie(uuid_by_uuid_idx, uuid_idx) > is this not just 'uuid' above? Yeah, not sure what happened here. http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc@330 PS8, Line 330: pb->Swap(&group); > I think now that we are on newish protobuf we could do '*pb = std::move(gro Done http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc File src/kudu/tablet/tablet_bootstrap.cc: http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@557 PS8, Line 557: // Ensure the tablet's data dirs are present and healthy before it is opened. > not 100% sure I follow why you defer it all the way until bootstrap time vs The old location was in TSTabletManager::OpenTablet, which is run out of the bootstrap threadpool too. If you're suggesting I do it in TabletMetadata::LoadFromSuperBlock (called synchronously for each tablet in TSTabletManager::Init), it'd require more plumbing. Why did I move it at all? My preference is for TSTabletManager to hold tserver-specific behavior only, and while the "disk failure handling" logic isn't particularly useful for the master, it could be in the future, so locating it in TabletBootstrap is an easy way to ensure it makes it into every place that uses a tablet. Tablet::Open would be equally reasonable, I think. http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@561 PS8, Line 561: error retrieving tablet data dir group > will this be easy enough for an operator to understand? perhaps we should s Done http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@563 PS8, Line 563: return Status::IOError("tablet data is in a failed directory"); > perhaps more accurate to say that _some_ tablet data is in a failed directo I'll clarify with 'some', but if you follow the return chain up the stack you'll see that TSTabletManager::OpenTablet prefixes the message with the tablet ID before it logs it. -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 8 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 18 Jan 2018 22:57:43 + Gerrit-HasComments: Yes
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc@327 PS8, Line 327: FindOrDie(uuid_by_uuid_idx, uuid_idx) is this not just 'uuid' above? http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/fs/data_dirs.cc@330 PS8, Line 330: pb->Swap(&group); I think now that we are on newish protobuf we could do '*pb = std::move(group);' which reads a bit better http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc File src/kudu/tablet/tablet_bootstrap.cc: http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@557 PS8, Line 557: // Ensure the tablet's data dirs are present and healthy before it is opened. not 100% sure I follow why you defer it all the way until bootstrap time vs marking it FAILED in TSTabletManager::Init immediately after it opens the TabletMeta (the old behavior)? Seems that way we'd have the correct failed status for missing tablets as early as possible vs the possibility that those missing tablets end up waiting on the bootstrap threadpool queue for many minutes behind non-failed tablets doing their log replay http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@561 PS8, Line 561: error retrieving tablet data dir group will this be easy enough for an operator to understand? perhaps we should say something explanatory like "(data directory likely removed)" or somesuch? http://gerrit.cloudera.org:8080/#/c/8376/8/src/kudu/tablet/tablet_bootstrap.cc@563 PS8, Line 563: return Status::IOError("tablet data is in a failed directory"); perhaps more accurate to say that _some_ tablet data is in a failed directory? Do you think we could include the uuid in the message for shits and giggles? -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 8 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 18 Jan 2018 18:51:48 + Gerrit-HasComments: Yes
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Hello Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8376 to look at the new patch set (#8). Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. fs: defer failure from metadata load to bootstrap when data dir is missing An upcoming patch adds a CLI tool action to remove data directories. When a data dir is removed, all tablets with data on it will fail. Today that failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make that a little bit more graceful. This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a failure when a data dir is missing. It further changes TabletMetadata to treat such failures as non-fatal, and adds checks to TabletBootstrap so that the failures manifest there instead. No tests in this patch because: 1. Andrew has already merged tablet-level tests for failed disks, and 2. The CLI tool patch adds coverage at the itest-level. Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_metadata.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/ts_tablet_manager.cc 10 files changed, 136 insertions(+), 69 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/8 -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 8 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG@9 PS6, Line 9: remove data di > nit: just remove at this point Done http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG@20 PS6, Line 20: has already merged > nit: has already worked Done -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 7 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 17 Jan 2018 22:32:09 + Gerrit-HasComments: Yes
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Hello Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8376 to look at the new patch set (#7). Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. fs: defer failure from metadata load to bootstrap when data dir is missing An upcoming patch adds CLI tool actions to remove data directories. When a data dir is removed, all tablets with data on it will fail. Today that failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make that a little bit more graceful. This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a failure when a data dir is missing. It further changes TabletMetadata to treat such failures as non-fatal, and adds checks to TabletBootstrap so that the failures manifest there instead. No tests in this patch because: 1. Andrew has already merged tablet-level tests for failed disks, and 2. The CLI tool patch adds coverage at the itest-level. Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_metadata.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/ts_tablet_manager.cc 10 files changed, 136 insertions(+), 69 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/7 -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 7 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 6: Code-Review+1 (2 comments) http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG@9 PS6, Line 9: add and remove nit: just remove at this point http://gerrit.cloudera.org:8080/#/c/8376/6//COMMIT_MSG@20 PS6, Line 20: is already working nit: has already worked -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 6 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 17 Jan 2018 19:38:31 + Gerrit-HasComments: Yes
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 13 Jan 2018 03:28:38 + Gerrit-HasComments: No
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Hello Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8376 to look at the new patch set (#3). Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. fs: defer failure from metadata load to bootstrap when data dir is missing An upcoming patch adds CLI tool actions to add and remove data directories. When a data dir is removed, all tablets with data on it will fail. Today that failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make that a little bit more graceful. This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a failure when a data dir is missing. It further changes TabletMetadata to treat such failures as non-fatal, and adds checks to TabletBootstrap so that the failures manifest there instead. No tests in this patch because: 1. Andrew is already working on tablet-level tests for failed disks, and 2. The CLI tool patch adds coverage at the itest-level. Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_metadata.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/ts_tablet_manager.cc 10 files changed, 136 insertions(+), 69 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/3 -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 3 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@81 PS2, Line 81: Status LoadFromPB(const UuidIndexByUuidMap& uuid_idx_by_uuid, > worth short docs explaining the cases of bad-status here. Done http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@84 PS2, Line 84: Status CopyToPB(const UuidByUuidIndexMap& uuid_by_uuid_idx, > and here Done http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@276 PS2, Line 276: data dir : // is missing. > only looked at the headers so far (not the impl) but this left me a little Given my response to your other comment (in the impl), is this still relevant? http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@318 PS2, Line 318: if (!FindCopy(uuid_by_uuid_idx, uuid_idx, &uuid)) { : return Status::NotFound(Substitute( : "could not find data dir with uuid index $0", uuid_idx)); > would this not be a programmer error? how would you end up with a uuid_idx It is, but I thought the symmetry with LoadFromPB was worthwhile because it makes the class' behavior more predictable. http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@831 PS2, Line 831: RETURN_NOT_OK(group->CopyToPB(uuid_by_idx_, pb)); > I guess my confusion in the header was wrong, but per my comments above, I Indeed, but since I want CopyToPB to be symmetric with LoadFromPB, a RETURN_NOT_OK here seemed more appropriate. http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc@440 PS2, Line 440: DeleteOrphanedBlocks(orphaned_blocks); > pondering this a bit more, I wonder whether we need to actually start stori This was a legitimate issue that Andrew has since fixed in commit 47b81c452194e75da7fd966f07766de4bdcdeab0. -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 2 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 09 Jan 2018 20:02:40 + Gerrit-HasComments: Yes
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Adar Dembo has restored this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Restored -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: restore Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 2 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Adar Dembo has abandoned this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Abandoned It's too difficult to do this safely, due to the various ways in which block IDs can be reused. KUDU-2202 links to this patch and discusses the problem further. -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 2 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc@440 PS2, Line 440: DeleteOrphanedBlocks(orphaned_blocks); > I think we discussed this elsewhere, but I"m concerned about this interacti pondering this a bit more, I wonder whether we need to actually start storing a separate "orphaned blocks" list for each data dir -- that way when we delete the failed tablet we can safely just delete the blocks that are known to be on the remaining disks? Alternatively we could feed the orphaned-blocks list from each tablet into the LBM at startup time to ensure that we don't reuse block IDs, but that's getting into relatively complex layer-crossing territory -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 2 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 27 Oct 2017 00:08:46 + Gerrit-HasComments: Yes
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 ) Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@81 PS2, Line 81: Status LoadFromPB(const UuidIndexByUuidMap& uuid_idx_by_uuid, worth short docs explaining the cases of bad-status here. http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@84 PS2, Line 84: Status CopyToPB(const UuidByUuidIndexMap& uuid_by_uuid_idx, and here http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@276 PS2, Line 276: data dir : // is missing. only looked at the headers so far (not the impl) but this left me a little confused. This seems to indicate that the above LoadDataDirGroupFromPB actually has a side effect even if it returns a bad status that indicates a missing dir? Does this return a 'pb' even with a bad status? http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@318 PS2, Line 318: if (!FindCopy(uuid_by_uuid_idx, uuid_idx, &uuid)) { : return Status::NotFound(Substitute( : "could not find data dir with uuid index $0", uuid_idx)); would this not be a programmer error? how would you end up with a uuid_idx in a DataDirGroup which doesn't point to a valid entry in the UUIDs array? I would have expected a FindOrDie or LOG(FATAL) here http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@831 PS2, Line 831: RETURN_NOT_OK(group->CopyToPB(uuid_by_idx_, pb)); I guess my confusion in the header was wrong, but per my comments above, I think any failure of this code would be indicative of a bug, no? http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc@440 PS2, Line 440: DeleteOrphanedBlocks(orphaned_blocks); I think we discussed this elsewhere, but I"m concerned about this interacting with blockid re-use. If we have the following scenario: - block 1-5 on disk A - block 6-10 on disk B tablet has orphaned block #6 in its metadata disk B is failed or removed, and we restart the tserver - LBM sees '5' as max_block_id - allocates a new block 6 on disk A - we open the tablet, and it decides to delete block 6 -- incorrectly deletes the new block 6 instead of the old block 6 ie I think it's important not to go to proceed on a tablet when we have it pointing to a drive that has been removed. -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 2 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 26 Oct 2017 20:58:36 + Gerrit-HasComments: Yes
[kudu-CR] fs: defer failure from metadata load to bootstrap when data dir is missing
Adar Dembo has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8376 Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing .. fs: defer failure from metadata load to bootstrap when data dir is missing An upcoming patch adds CLI tool actions to add and remove data directories. When a data dir is removed, all tablets with data on it will fail. Today that failure manifests as a FindOrDie in DataDirGroup::FromPB; we need to make that a little bit more graceful. This patch modifies the DataDirGroup FromPB/CopyToPB methods to return a failure when a data dir is missing. It further changes TabletMetadata to treat such failures as non-fatal, and adds checks to TabletBootstrap so that the failures manifest there instead. No tests in this patch because: 1. Andrew is already working on tablet-level tests for failed disks, and 2. The CLI tool patch adds coverage at the itest-level. Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/error_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_metadata.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/ts_tablet_manager.cc 11 files changed, 128 insertions(+), 70 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/76/8376/1 -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 1 Gerrit-Owner: Adar Dembo