Hi, tl;dr: btrfs_trim_fs, IMHO, needs more thorough surgery.
Thanks for providing the new patch. I think it will work in the case that "fstrim" is called without specifying a range to trim (that is, defaulting to the whole filesystem), but I didn't test that yet, sorry. Instead, I have been thinking about what happens if a range smaller than the whole filesystem is specified. After all, the observation that in my filesystem the smallest "objectid" is already larger than the filesystem size still holds even in that case, and wanting to trim only part of the filesystem seems to be a valid wish, too. So I dug into the code myself and came to the conclusion that the way "btrfs_trim_fs" interprets the range passed from the ioctl is fundamentally broken. Instead of papering over that I'd very much prefer a more thorough fix here, which in addition might make it unnecessary to treat the "trim the complete filesystem" case specially. Please read on for the details: The current implementation of "btrfs_trim_fs" simply compares the objectid's it finds in the chunk tree against the range passed from the ioctl, seemingly assuming that both kinds of numbers span the same range. But this is clearly not true: Even without adding and deleting of devices, in a simple mirror the objectids will span only about half the size of the filesystem. With suitable add/delete of devices I can construct a filesystem with a smallest objectid of 0 and a largest objectid much larger than the size of the filesystem (so, obviously, with large holes in the set of used objectid's), or, as in my filesystem mentioned above, with a smallest objectid larger than the size of the filesystem. So, in the function the statement cache = btrfs_lookup_block_group(fs_info, range->start); will lead to nothing being trimmed if the smallest objectid is too large; if the if (cache->key.objectid >= (range->start + range->len)) is left in, too little will be trimmed; if it is left out, the following calculations start = max(range->start, cache->key.objectid); end = min(range->start + range->len, cache->key.objectid + cache->key.offset); if (end - start >= range->minlen) are bound to misbehave, for example start being larger than end seems possible, overflowing the subtraction and, I believe, leading to the whole filesystem being trimmed where the user only requested a part. Here is the start of the chunk tree of my filesystem for reference, showing the device sizes and the first objectid: chunk tree node 241620221952 level 1 items 2 free 119 generation 16831 owner 3 fs uuid 88af7576-3027-4a3b-a5ae-34bfd167982f chunk uuid 9a7d25e3-ebcf-4f31-bdd4-f44531a2fb1c key (DEV_ITEMS DEV_ITEM 3) block 241620226048 (58989313) gen 16831 key (FIRST_CHUNK_TREE CHUNK_ITEM 231956545536) block 241620230144 (58989314) gen 16831 leaf 241620226048 items 19 free space 1420 generation 16831 owner 3 fs uuid 88af7576-3027-4a3b-a5ae-34bfd167982f chunk uuid 9a7d25e3-ebcf-4f31-bdd4-f44531a2fb1c item 0 key (DEV_ITEMS DEV_ITEM 3) itemoff 3897 itemsize 98 dev item devid 3 total_bytes 80789987328 bytes used 44090523648 item 1 key (DEV_ITEMS DEV_ITEM 4) itemoff 3799 itemsize 98 dev item devid 4 total_bytes 80025313280 bytes used 44090523648 item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 213702934528) itemoff 3687 itemsize 112 chunk length 1073741824 owner 2 type 20 num_stripes 2 stripe 0 devid 4 offset 1074790400 stripe 1 devid 3 offset 40870346752 So, to make bulk discard of ranges useful, it seems the incoming range should be interpreted relative to the size of the filesystem and not to the allocated chunks. As AFAIK the size of the filesystem is just the sum of the sizes of its devices it might be possible to map the range onto a virtual concatenation of the devices, these perhaps ordered by devid, and then to find the free space by searching for the resulting devid(s) and device-relative offsets in the device tree? I understand that currently unallocated space is never trimmed? To nevertheless do that might be useful after a balance or at file system creation time? If there is a way to find the unallocated space for a device, this could be improved at the same time, too. Kind regards, Lutz -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html