Adar Dembo has posted comments on this change. Change subject: Allow for reserving disk space for non-Kudu processes ......................................................................
Patch Set 4: (16 comments) I reviewed everything but the LBM changes. I've indicated enough high-level issues that I think we should discuss them first, in case they require some reimplementation. http://gerrit.cloudera.org:8080/#/c/3135/4//COMMIT_MSG Commit Message: Line 13: reservation limit then a crash will result. Why did you choose for the behavior to be a crash instead of just returning an error up the stack? Granted, I doubt we can handle these errors in a meaningful way during a flush (if we can't write to disk, we'll run out of memory eventually), but we certainly can on a WAL write or a compaction. In general I think "leaf" code should return errors and "non-leaf" code should, if necessary, crash. That way when we do crash we have more context: not only do we know we ran out of disk space, but we know we were in the middle of e.g. a flush. http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/integration-tests/disk_reservation-itest.cc File src/kudu/integration-tests/disk_reservation-itest.cc: Line 55: // Note: This functionality is not implemented in the file block manager, so : // this test is disabled on non-Linux platforms. Can't we compile the code on all platforms but, at runtime, test the kind of block manager we're using, and no-op the test if it's not the LBM? Line 67: workload.set_num_write_threads(4); : workload.set_timeout_allowed(true); : workload.set_write_timeout_millis(500); How did you arrive at these non-default values? If they're important, they're worth comments. num_replicas(1) I understand (you've only got one TS). Line 73: // Write a few rows to get some rows flushing. : while (workload.rows_inserted() < 10) { : SleepFor(MonoDelta::FromMilliseconds(10)); : } Why do we need to do this if we're also looping and waiting for containers below? Line 80: // Wait until we have 2 active containers on TS1. Nit: TS1 is the only TS; why refer to it as TS1? Below too. Line 95: 1L * 1024 * 1024 * 1024))); 1GB seems unnecessarily high. How about 128 MB, or even lower? How long does this test take to run on a spinning disk? On an SSD? Line 126: ts_flags.push_back("--enable_maintenance_manager=false"); // No flushing for this test. Why is this important? Line 132: workload.set_timeout_allowed(true); : workload.set_write_timeout_millis(500); : workload.set_num_write_threads(8); : workload.set_write_batch_size(1024); As above, please justify the non-default values with comments. http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/util/env.h File src/kudu/util/env.h: Line 212: // Returns information about a mounted filesystem. No unit test in env-test for StatVfs? Line 214: virtual Status StatVfs(const std::string& path, struct statvfs* buf) = 0; Not a fan of leaking a POSIX-specific struct into the rest of Kudu. AFAICT this is a first. Could you instead adapt what you need from statvfs into a new Kudu struct or class? http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/util/env_posix.cc File src/kudu/util/env_posix.cc: Line 993: virtual Status StatVfs(const std::string& path, struct statvfs* buf) OVERRIDE { Need to add ThreadRestrictions::AssertIOAllowed(). Also should add a TRACE_EVENT() call. Line 997: return Status::IOError("statvfs: " + path, ErrnoToString(errno), errno); You should use the IOError() function so that FLAGS_suicide_on_eio will be honored. http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/util/env_util.cc File src/kudu/util/env_util.cc: Line 37: DEFINE_int64(disk_reserved_bytes, 0, : "Number of bytes to reserve on each filesystem for non-Kudu usage"); I don't know whether this will be sufficient. I'm thinking we may want to configure this separately for WALs and for data blocks. The rationale being: WALs live on one device and that's typically a fast, non-spinning device (like an SSD or NVRAM). Said device is often smaller and may be used for the OS too, so you'd want to handle reservations differently for it. I'm guessing HDFS doesn't work this way because all it stores are blocks, and so its data directories are uniform more often than not. This means the reservations should be handled separately in the block managers and in the WAL code, and then there's no real use for this centralized code. Line 43: Number of inodes to reserve on each filesystem for non-Kudu usage" Why would an operator care to reserve the number of inodes? Do other systems do this? What's their rationale? Line 125: if (PREDICT_FALSE(FLAGS_disk_reserved_bytes_free_for_testing > -1)) { If these for_testing flags are set, why bother with the StatVfs() call at all? http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/util/env_util.h File src/kudu/util/env_util.h: Line 45: Status AssertSufficientDiskSpace(Env *env, const std::string& path, int64_t bytes); Nit: maybe replace "Assert" with "Check" or "Verify"? When I see assert I think of a gtest assert. -- To view, visit http://gerrit.cloudera.org:8080/3135 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifd0451d4dbddc1783019a53302de0263080939c7 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
