Hi, I think the code in the bouyer-quota2 branch is stable now, and ready to be merged to HEAD. Unless objections, I'll merge it in about 2 weeks.
To get a diff: cvs -d anon...@anoncvs.netbsd.org:/cvsroot -kk -u -r bouyer-quota2 -r bouyer-quota2-base src This branch is for the developement of a modernized disk quota system. The 2 main changes are: a new quotactl(2) interface and a new on-disk format, compatible with journaled ffs. The new quotactl(2) uses a plist format to send commands and exange data with the kernel. Using plists for this has several bonus: - the plist format can change without the need to version the syscall, only the plist parser needs to be changed and backward compat can be at the parser level. - the plist format can easily be extended to fit other filesystems than ufs. - it is easy to pass it back to puffs servers - it is easy to use in scripts. the format used is documented in quotactl(2). A new quotactl(8) command has been added, which allows to send/receive plist from userland; the idea is to make it easier to manage quotas from scripts. The branch has code under COMPAT_50 to deal with the old syscall. The in-tree quota commands quota(1), edquota(8), repquota(8), rpc.rquotad(8), quotacheck(8), quotaon(8) have been updated to use the new syscall interface. I also took this opportunity to change the semantic of values reported by these utilities (wich are also the values used in plists): 0 is "nothing allowed" (instead of 1 actually), "no limit" is represented by the string "-" or "unlimited" (in the plist as well as the new on-disk format this is UQUAD_MAX, i.e. 0xffffffffffffffff). The old disk format still uses 0 as umlimited and 1 as nothing allowed; the semantic difference is handled in kernel and userland convertion utilities (see quota1_subr.c) repquota gains a -x option, which exports the quotas as a "set" plist command which can be feed directly to quotactl(8). This is one way to move limits from one fs to another (or convert to the new on-disk format). A new on-disk format has been added (called quota2, see quota2.h). The usages and limits are stored in unlinked inodes (one for users and one for group quotas), it can not be stored outside of the filesystem any more. This ensures that quotas are covered by the filesystem clean flag or journal. A quota file has a header, containing some persistent parameters, a default quota entry, and quota entries free and hash list. The quota file is not sparse, quota entries are held in hash list. The kernel keeps a cache of quota entries, which is keeps offset in the file to avoid to walk the list on each loopup. This new format has grown 64bis limits and usage (32bit is not enough for modern storage sizes), and 2 new features: - a default quota entry is used as template for new quota entries allocated when a new uid/gid shows up on the filesystem. This template is configurable, so that a sysadmin what to allow to unknown users. - per-user/group grace time. quota are enabled with tunefs -q user and/or -q group (and disabled with -q nouser -q nogroup), of at newfs time with the same -q option. after a tunefs -q a fsck of the filesystem is required. There is no quotacheck/quotaon anymore for quota version 2. quota usages are checked in fsck_ffs(8) at the same time as other filesystem metadata. Usages are computed phase1 (and adjsusted in othe phases if fsck needs to create or delete files, or change block allocations) and checked against recorded usages in phase6. phase6 will also do other consistency checks against the quota inodes, or even create it if noone exists (e.g. just after a tunefs). While doing this I discovered some pieces missing in fsck_ffs about block accountings when allocating inodes and blocks, which I fixed (This is why ffs_clusteracct() moved to ffs_subr.c, as a bonus it's one less function replicated in makefs(8)). Instead of keeping usages in memory, synced to disk on sync or at umount time, quota usages are now updated as other metadata in real time (or delayed write, depending on mount options). This way, quota usages are also covered by the journal (usage update is in the same WAPBL transaction as the one allocating/freeing inodes or blocks), and so usages should be accurate after a log replay (quotacheck(8) is basically a pass 1 fsck, and the time required for today's storage sizes is just not acceptable). This code has been tested in several way. In addition to the atf tests in the branch testing basic functionalities (as well as some corruption senarii for fsck_ffs), I did stress-tests on a XEN3_DOMU with 256Mo RAM as well as on a dual-core i5 (with hyperthreading, so the kernel sees 4 CPUs) with 2Gb ram. One of the stress test has been to run 5 bonnie++ in a loop under 5 different uids, while at the same time running quota(1), repquota(8), quotactl(8) commands in loops, on both logged and non-log filesystems. I also ran a bonnie++ in a loop while taking and deleteing snapshots of the filesystems, also in loops. All issues discovered this way have been fixed. In order to have fsck_ffs against a snapshot report no error, I had to do wider change. I added a per-inode flag, "SF_SNAPINVAL", used to mark a snapshot inode as invalid. Right now, a snapshot inode shows up as a 0-size regular file in the snapshot, and userland tools don't know it is a snapshot inode. The result is that quota usages are miscomputed by fsck_ffs as snapshot inodes are not included in usage. Now snapshot inodes in the snapshot are marked SF_SNAPSHOT | SF_SNAPINVAL, so userland tools know it's a snapshot (as a bonus, dump can ignore them as well), while the kernel can deny using it as a snapshot. I believe this flag can also be used to speed up snapshot creations, but this won't be investigated as part of the branch. Finaly here are some bonnie++ results on the code i5 above (i'll add that the disk system is a 500Gb WDC WD5000AADS-00S9B0 on a ahcisata controller) used for tests. "plain" is HEAD with plain ffs, "log" the same mounted with -o log. "quota1" is "plain" with user quota1 enabled (the quota file is at the root of the test filesystem), "quota2" is "plain" with the new quota enabled for user. "quota2log" is "quota2" mounted -o log (qouta1 and log are mutually exclusive). As you can see there is no measurable performance impact. Version 1.03e ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP plain 4G 71199 43 71717 11 30440 5 73216 77 74573 9 183.4 0 plain 4G 71972 44 71906 12 30446 6 73959 77 74637 9 178.4 0 plain 4G 71922 44 71800 11 30438 6 73756 77 74669 9 177.8 0 log 4G 69776 43 71641 13 30732 6 73709 77 74653 9 176.1 0 log 4G 71254 44 71404 12 30548 6 73968 77 74653 9 176.5 0 log 4G 71183 44 71581 13 30499 6 73400 77 74812 9 176.5 0 quota1 4G 70320 43 71792 12 30694 6 73787 77 74637 9 180.3 0 quota1 4G 71567 43 71772 12 30781 6 73774 77 74541 9 178.8 0 quota1 4G 71829 44 71669 12 30393 5 73324 77 74796 9 179.1 0 quota2 4G 70349 43 71311 12 30502 5 71670 75 74636 9 181.2 0 quota2 4G 72125 44 71486 12 30560 6 73385 77 74621 9 178.0 0 quota2 4G 71411 43 71379 12 30606 6 73772 77 74621 9 179.9 0 quota2log 4G 69453 43 71947 13 30700 6 73554 77 74748 9 177.7 0 quota2log 4G 70718 44 71635 13 30433 6 74192 78 74716 9 174.3 0 quota2log 4G 72394 45 71641 13 30681 6 73601 77 74684 9 177.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP plain 16 1693 25 +++++ +++ 5220 14 1835 27 12712 99 3718 28 plain 16 1707 25 +++++ +++ 5225 14 1783 26 12647 99 3665 28 plain 16 1800 26 +++++ +++ 5127 15 1830 27 12697 99 3402 26 log 16 8687 88 +++++ +++ +++++ +++ 10006 99 12608 99 23303 99 log 16 9051 91 +++++ +++ +++++ +++ 10014 99 12652 99 23148 99 log 16 9868 99 +++++ +++ +++++ +++ 10027 99 12675 99 23300 100 quota1 16 1639 24 +++++ +++ 5220 14 1704 25 12713 100 3614 27 quota1 16 1718 25 +++++ +++ 5222 14 1628 24 12744 100 3659 28 quota1 16 1742 25 +++++ +++ 4535 13 1854 27 12643 99 3720 28 quota2 16 1729 25 +++++ +++ 5188 15 1940 28 12626 99 3743 29 quota2 16 1839 27 +++++ +++ 5178 15 1750 25 12699 99 3647 28 quota2 16 1755 26 +++++ +++ 5208 15 1739 25 12570 99 3581 27 quota2log 16 9227 94 +++++ +++ +++++ +++ 9957 99 12686 99 23035 100 quota2log 16 9807 99 +++++ +++ +++++ +++ 9252 92 12649 99 23301 99 quota2log 16 9789 99 +++++ +++ +++++ +++ 9263 93 12682 99 23032 99 -- Manuel Bouyer <bou...@antioche.eu.org> NetBSD: 26 ans d'experience feront toujours la difference --