> When testing Btrfs with fio 4k random write, That's an exceptionally narrowly defined workload. Also it is narrower than that, because it must be without 'fsync' after each write, or else there would be no accumulation of dirty blocks in memory at all.
> I found that volume with smaller free space available has > lower performance. That's an inappropriate use of "performance"... The speed may be lower, the performance is another matter. > It seems that the smaller the free space of volume is, the > smaller amount of dirty page filesystem could have. Is this a problem? Consider: all filesystems do less well when there is less free space (smaller chance of finding spatially compact allocations), it is usually good to minimize the the amont of dirty pages anyhow (even if there are reasons to keep delay writing them out). > [ ... ] btrfs will reserve metadata for every write. The > amount to reserve is calculated as follows: nodesize * > BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata. > The maximum amount of metadata reservation depends on size of > metadata currently in used and free space within volume(free > chunk size /16) When metadata reaches the limit, btrfs will > need to flush the data to release the reservation. I don't understand here: under POSIX semantics filesystems are not really allowed to avoid flushing *metadata* to disk for most operations, that is metadata operations have an implied 'fsync'. Your case of the "4k random write" with "cow disabled" the only metadata that should get updated is the last-modified timestamp, unless the user/application has been so amazingly stupid to not preallocate the file, and then they deserve whatever they get. > 1. Is there any logic behind the value (free chunk size /16) > /* > * If we have dup, raid1 or raid10 then only half of the free > * space is actually useable. For raid56, the space info used > * doesn't include the parity drive, so we don't have to > * change the math > */ > if (profile & (BTRFS_BLOCK_GROUP_DUP | > BTRFS_BLOCK_GROUP_RAID1 | > BTRFS_BLOCK_GROUP_RAID10)) > avail >>= 1; As written there is a plausible logic, but it is quite crude. > /* > * If we aren't flushing all things, let us overcommit up to > * 1/2th of the space. If we can flush, don't let us overcommit > * too much, let it overcommit up to 1/8 of the space. > */ > if (flush == BTRFS_RESERVE_FLUSH_ALL) > avail >>= 3; > else > avail >>= 1; Presumably overcommitting beings some benefits on other workloads. In particular other parts of Btrfs don't behave awesomely well when free space runs out. > 2. Is there any way to improve this problem? Again, is it a problem? More interestingly, if it is a problem is a solution available that does not impact other workloads? It is simply impossible to optimize a filesystem perfectly for every workload. I'll try to summarize your report as I understand it: * If: - The workload is "4k random write" (without 'fsync'). - On a "cow disabled" file. - The file is not preallocated. - There is not much free space available. * Then allocation overcommitting results in a higher frequency of unrequested metadata flushes, and those metadata flushes slow down a specific benchmark. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html