Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8441 )
Change subject: disk failure: make various delta paths more robust to errors ...................................................................... Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_compaction.cc File src/kudu/tablet/delta_compaction.cc: http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_compaction.cc@372 PS3, Line 372: RETURN_NOT_OK(tracker->OpenDeltaReaders(new_redo_blocks, &new_redo_stores, REDO)); Can you add a TODO here that these OpenDeltaReaders calls could actually happen outside of the component_lock? it seems like these calls do some IO so it's not the best to be holding the DRS component lock during it since it'll block concurrent ops. http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_tracker.h File src/kudu/tablet/delta_tracker.h: http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_tracker.h@164 PS3, Line 164: in-memory delta stores and then persists them to disk I think this is kind of unclear. More precisely this updates the in-memory *list* of delta stores, and then persists the *metadata* to disk. As written it sounds like it's changing the delta stores themselves, and flushing deltafiles or somesuch http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_tracker.cc File src/kudu/tablet/delta_tracker.cc: http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_tracker.cc@307 PS3, Line 307: if (start_it != end_it) { is this necessary? erase() with an empty range should be safe http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_tracker.cc@719 PS3, Line 719: // In case the flush fails, remove the DeltaMemStore from the store list. I dont think this is safe. Another update could have landed in the new 'dms_', so the fact that you've now dropped it on the floor means we've lost a user's update that they might think was committed, no? http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_tracker.cc@720 PS3, Line 720: auto cleanup_redo_delta_stores = MakeScopedCleanup([&] { nit: can use the new SCOPED_CLEANUP macro here http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/delta_tracker.cc@757 PS3, Line 757: // TODO(unknown): wherever we write stuff, we should write to a tmp path and I think this can just be removed now, it's super ancient (goes back to https://github.com/apache/kudu/commit/0e3bd37ab8645a00001758fa5de84861ac443eef !) http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/diskrowset.cc File src/kudu/tablet/diskrowset.cc: http://gerrit.cloudera.org:8080/#/c/8441/3/src/kudu/tablet/diskrowset.cc@575 PS3, Line 575: // revert changes in case of an error. not sure this is safe in the general case, again because as soon as we swap in a new DMS, we can have operations coming in and committing with their dms_id set to the new DMS's id, and I'm not entirely clear on how we prevent losing them. The answer of course could be that any failure immediately causes everything to shut down, but if you are shutting down then you also probably don't really need to re-flush the tablet metadata? -- To view, visit http://gerrit.cloudera.org:8080/8441 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141 Gerrit-Change-Number: 8441 Gerrit-PatchSet: 3 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Tue, 07 Nov 2017 01:51:42 +0000 Gerrit-HasComments: Yes
