The object reclaim doesn't support hypervolume yet. But hypervolume cannot be used as a virtual disk (both of qemu and tgt don't support it) currently. And the removal of old vdi deletion is acceptable for hypervolume because it doesn't support snapshot, etc. So I think this patchset can be applied to the master branch.
The same code is pushed to: https://github.com/sheepdog/sheepdog/tree/snapshot-object-reclaim There is a problem which can be caused by discard operation. But the problem can be solved as an individual topic. I'll post a patchset for it later. The leak problem was (also) caused by bugs in QEMU's sheepdog driver. The fixed version of QEMU driver is here: https://github.com/sheepdog/qemu/tree/inode-sync I'll post it to the QEMU list later. v10: - rebase on master v9: - rebase on master - add Tested-by of Valerio v8: - let COW and snapshot be excluded mutually -- This change introduces limitation that "dog vdi snapshot" must be executed on a same node which execute QEMU. But it is temporal and removed easily. - migrate VDI deletion from sheep to dog - pass Valerio's test v7: - reorder invalid tests group - rebase on master - correct invalid inheritence of reference count -- add a new test for checking the case - refine test output v6: - recycle old child_vdi_id and move btree_counter to the place - update CHANGELOG.md v5: - rebase on master - update store version and break compatibility explicitly - remove coding style problems -- don't use pthread_mutex_t directly -- rename inode.gref -> inode.gref - resolve regression in http feature (10th patch) - forbid creating snapshot of hypervolumes temporally - enhance tests (14th patch) - remove various problems exposed by the patchset (1st - 3rd patches) v4: - remove some bugs - refine tests -- the patchset can pass every test in vdi group v3: - rebase onto master v2: - rebase onto master - add test to check object reclaim feature - rename *_obj() to *_object() - add documentation about the algorithm Currently, sheepdog doesn't reclaim objects until all the snapshots are deleted. For example: $ dog vdi create image $ dog vdi write image < some_data $ dog vdi snapshot image -s snap1 $ dog vdi write image < some_data $ dog vdi delete image <- this doesn't reclaim the objects of the image $ dog vdi delete image -s snap1 <- this reclaims all the data objects of both image and image:snap1 To resolve this problem we need to introduce something like reference counting mechanism. However, a naive approach doesn't work well on Sheepdog. It is because, when we create a snapshot or clone vdi, we have to increment all the reference count of its data objects and send many requests to all over the nodes. This series uses generation reference counting [1], which is more appropriate algorithm for distributed system like Sheepdog. With this series, we can reclaim objects just when they are no longer necessary. [1] B. Goldberg, Generation reference counting: A reduced communication distributed storage reclamation scheme, PLDI '89 == Here is a brief explanation ob the object reclaim algorithm. If you want to know the details or correctness proof of the generational reference counting, please refer the original paper: Each data object contains the following structure in its inode object: struct generation_reference { int32_t generation; int32_t count; } gref; The generation field identifies the generation of the reference and the count field records the number of references copied from this particular reference. We also introduce a new type of object, a ledger object. The ledger object exists for each data object and contains an array of int32_t values. The i-th element of the array contains information about the number of outstanding data object references in the i-th generation. Generational reference counting is performed as follows: 1. When a vdi A is created, the initial data object references are initialized as follows: for (i = 0; i < MAX_DATA_OBJS; i++) { A.gref[i].generation = 0; A.gref[i].count = 0; } and a ledger object for each data object is initialized as follows: ledger_object[0] = 1; for (i = 1; i < MAX_DATA_OBJS; i++) { ledger_object[i] = 0; } In practice, the ledger object is created and initialized when the object is accessed for the first time, so we don't have to create all the ledger objects when the vdi is created. 2. When a VDI B is created as a clone (or snapshot) of A, the new reference B is initialized as follows: for (i = 0; i < MAX_DATA_OBJS; i++) { B.gref[i].generation = A.gref[i].generation + 1; B.gref[i].count = 0; } In addition, A.gref.count's are incremented: for (i = 0; i < MAX_DATA_OBJS; i++) { A.gref[i].count++; } 3. When a object o is removed, a decrement message is sent to its ledger object containing the values of o.generation and o.count. 4. When a decrement message to o is received, in which the generation field has the value i and the count field has the value c, the following actions are performed: o.ledger[i]--; o.ledger[i + 1] += c; When every element of the ledger becomes zero (for all i >= 0, o.ledger[i] = 0), o can be reclaimed. Note that some elements of the ledger may hold negative values. Hitoshi Mitake (19): sheep: introduce generational reference counting for object reclaim sheep: introduce ledger objects sheep: decrement generational reference count on vdi deletion sheep: decrement generational reference count on copy-on-write sheep: remove max children limit sheep: introduce sparse objects sheepdog proto: recycle old child_vdi_id for storing btree_counter sheep: get_request() for deferred response sheep, dog: prevent COW during snapshot creation sheep: reduce needless remove request during object reclamation sheep, dog: migrate deletion process from sheep to dog dog: print status of generational reference of inodes in dump-inode sheep: update store format version to v5 tests: show ledger objects in list doc: add documentation of object reclaim algorithm tests/functional: add test for object reclaim update CHANGELOG.md for new object reclaim scheme tests/functional: add a new test of the object reclaim for cloned VDIs tests/functional: new test of object reclaim for a case of continuously snapshotted VDI CHANGELOG.md | 3 + doc/object-reclaim.txt | 73 +++ dog/vdi.c | 99 +++- include/internal_proto.h | 4 + include/sheepdog_proto.h | 59 ++- include/util.h | 3 + lib/util.c | 62 +++ sheep/config.c | 2 +- sheep/gateway.c | 99 +++- sheep/group.c | 3 + sheep/migrate.c | 16 + sheep/ops.c | 169 +++++++ sheep/plain_store.c | 83 +++- sheep/request.c | 42 +- sheep/sheep_priv.h | 40 ++ sheep/store.c | 26 ++ sheep/vdi.c | 387 ++++------------ tests/functional/016.out | 14 +- tests/functional/042 | 4 +- tests/functional/042.out | 44 +- tests/functional/044.out | 14 +- tests/functional/057.out | 140 +++--- tests/functional/058.out | 14 +- tests/functional/063.out | 28 +- tests/functional/064.out | 38 +- tests/functional/086 | 36 +- tests/functional/086.out | 1068 ++------------------------------------------ tests/functional/092 | 47 ++ tests/functional/092.out | 42 ++ tests/functional/093 | 32 ++ tests/functional/093.out | 26 ++ tests/functional/common.rc | 12 +- tests/functional/group | 4 +- 33 files changed, 1206 insertions(+), 1527 deletions(-) create mode 100644 doc/object-reclaim.txt create mode 100755 tests/functional/092 create mode 100644 tests/functional/092.out create mode 100755 tests/functional/093 create mode 100644 tests/functional/093.out -- 1.9.1 -- sheepdog mailing list [email protected] http://lists.wpkg.org/mailman/listinfo/sheepdog
