On Sat, Jan 16, 2021 at 08:21:16PM +0300, Andrei Borzenkov wrote: > 16.01.2021 18:19, Adam Borowski пишет: > > On Sat, Jan 16, 2021 at 10:39:51AM +0300, Andrei Borzenkov wrote: > >> 15.01.2021 06:54, Zygo Blaxell пишет: > >>> On the other hand, I'm in favor of deprecating the whole discard option > >>> and going with fstrim instead. discard in its current form tends to > >>> increase write wear rather than decrease it, especially on metadata-heavy > >>> workloads. discard is roughly equivalent to running fstrim thousands > >>> of times a day, which is clearly bad for many (most? all?) SSDs. > >> > >> My (probably naive) understanding so far was that trim on SSD marks > >> areas as "unused" which means SSD need to copy less residual data from > >> erase block when reusing it. Assuming TRIM unit is (significantly) > >> smaller than erase block. > >> > >> I would appreciate if you elaborate how trim results in more write on SSD? > > > > The areas are not only marked as unused, but also zeroed. To keep the > > zeroing semantic, every discard must be persisted, thus requiring a write > > to the SSD's metadata (not btrfs metadata) area. > > > > There is no requirement that TRIM did it. If device sets RZAT SUPPORTED > bit, it should return zeroes for trimmed range, but there is no need to > physically zero anything - simply return zeroes for areas marked as > unallocated. Discard must be persisted in allocation table, but then > every write must be persisted in allocation table anyway.
That is exactly the problem--the persistence is a write that counts against total drive wear. That is why TRIM variants that leave the contents of the discarded LBAs undefined are better than those which define the contents as zero. The effect seems to be the equivalent of a small write, i.e. a 16K write might be the same cost as any length of contiguous discard. So it's OK to discard block-group-sized regions, but not OK to issue one discard for every metadata free page hole. Different drives have different ratios between these costs, so parity might occur at 4K or 256K depending on the drive. AIUI there is a minimum discard length filter implemented in btrfs already, so maybe it just needs tuning? > Moreover, to actually zero on TRIM either trim request must be issued > for the full erase block or device must perform garbage collection. > > Do you have any links that show that discards increase write load on > physical media? I am really curious. I have no links, it's a directly observed result. It's fairly straightforward to replicate: Set up a machine to do git checkouts of each Linux kernel tag in random order, in a loop (maybe multiple instances of this if needed to get the SSD device IO saturated). While that happens, watch the percentage used endurance indicator reported on the drives (smartctl -x). Wait for the indicator to increment twice, and measure the time between the first and second increment. Use a low-cost consumer or OEM SSD so you get results in less than a few hundred hours. Then mount -o discard=async and wait for two more increments. Assuming the workload produces constant amounts of IO over time, and the percentage used endurance indicator variable from SMART is not a complete lie, the time between increments should roughly indicate the wear rates of the different workloads. In the field, we discovered this on CI builder workloads (lots of tiny files created, destroyed, and created again in rapid succession). They get almost double the SSD wear rate with discard on vs. discard off. We have monitoring on the p-u-e-i variable, and use it to project the date when 100% endurance will be reached. If that date lands within the date range when we want to be using the SSD, we get an alert. When discard is accidentally enabled on a CI server due to a configuration failure, we get an alert about a week later, as it shortens our drives' projected lifespan from more than 6 years to less than 4. Other workloads are less sensitive to this. If the workload has fewer metadata updates, bigger files, and sequential writes, then discard doesn't have a negative effect--though to be fair, it doesn't seem to have a positive effect either, at least not by this measurement method.