Re: defragmenting best practice?

Austin S. Hemmelgarn Thu, 02 Nov 2017 11:38:08 -0700

On 2017-11-02 14:09, Dave wrote:

On Thu, Nov 2, 2017 at 7:17 AM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:

And the worst performing machine was the one with the most RAM and a
fast NVMe drive and top of the line hardware.


Somewhat nonsensically, I'll bet that NVMe is a contributing factor in this
particular case.  NVMe has particularly bad performance with the old block
IO schedulers (though it is NVMe, so it should still be better than a SATA
or SAS SSD), and the new blk-mq framework just got scheduling support in
4.12, and only got reasonably good scheduling options in 4.13.  I doubt it's
the entirety of the issue, but it's probably part of it.


Thanks for that news. Based on that, I assume the advice here (to use
noop for NVMe) is now outdated?
https://stackoverflow.com/a/27664577/463994

Is the solution as simple as running a kernel >= 4.13? Or do I need to
specify which scheduler to use?

I just checked one computer:

uname -a
Linux morpheus 4.13.5-1-ARCH #1 SMP PREEMPT Fri Oct 6 09:58:47 CEST
2017 x86_64 GNU/Linux

$ sudo find /sys -name scheduler -exec grep . {} +
/sys/devices/pci0000:00/0000:00:1d.0/0000:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler:[none]
mq-deadline kyber bfq

 From this article, it sounds like (maybe) I should use kyber. I see
kyber listed in the output above, so I assume that means it is
available. I also think [none] is the current scheduler being used, as
it is in brackets.

I checked this:
https://www.kernel.org/doc/Documentation/block/switching-sched.txt
Based on that, I assume I would do this at runtime:

echo kyber > 
/sys/devices/pci0000:00/0000:00:1d.0/0000:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler

I assume this is equivalent:

echo kyber > /sys/block/nvme0n1/queue/scheduler

How would I set it permanently at boot time?

It's kind of complicated overall. As of 4.14, there are four optionsfor the blk-mq path. The 'none' scheduler is the old behavior prior to4.13, and does no scheduling. 'mq-deadline' is the default AFAIK, andbehaves like the old deadline I/O scheduler (not sure if it supports I/Opriorities). 'bfq' is a blk-mq port of a scheduler originally designedto replace the default CFQ scheduler from the old block layer. 'kyber'I know essentially nothing about, I never saw the patches on LKML (notsure if I just missed them, or they only went to topic lists), and I'venot tried it myself.

I have no personal experience with anything but the none scheduler onNVMe devices, so i can't really comment much more than saying that I'veseen a huge difference on the SATA SSD's I use first when the deadlinescheduler became the default and then again when I switched to BFQ on mysystems, and the fact that I've seen reports of using the deadlinescheduler improving things on NVMe.

As far as setting it at boot time, there's currently no kernelconfiguration option to set a default like there is for the old blockinterface, and I don't know of any kernel command line option to set iteither, but a udev rule setting it as a attribute works reliably. I'musing something like the following to set all my SATA devices to use BFQby default:

KERNEL=="sd?", SUBSYSTEM=="block", ACTION=="add",ATTR{queue/scheduler}="bfq"

While Firefox and Linux in general have their performance "issues",
that's not relevant here. I'm comparing the same distros, same Firefox
versions, same Firefox add-ons, etc. I eventually tested many hardware
configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc.
The only remaining difference I can find is that the computer with
acceptable performance uses LVM + EXT4 while all the others use BTRFS.

With all the great feedback I have gotten here, I'm now ready to
retest this after implementing all the BTRFS-related suggestions I
have received. Maybe that will solve the problem or maybe this mystery
will continue...


Hmm, if you're only using SSD's, that may partially explain things.  I don't
remember if it was mentioned earlier in this thread, but you might try
adding 'nossd' to the mount options.  The 'ssd' mount option (which gets set
automatically if the device reports as non-rotational) impacts how the block
allocator works, and that can have a pretty insane impact on performance.


I will test the "nossd" mount option.

If you're not seeing any difference on the newest kernels (I hadn'trealized you were running 4.13 on anything), you might not see anyimpact from doing this. I'd also suggest running a full balance priorto testing _after_ switching the option, part of the performance impactis due to the resultant on-disk layout.

Additionally, independently from that, try toggling the 'discard' mount
option.  If you have it enabled, disable it, if you have it disabled, enable
it.  Inline discards can be very expensive on some hardware, especially
older SSD's, and discards happen pretty frequently in a COW filesystem.


I have been following this advice, so I have never enabled discard for
an NVMe drive. Do you think it is worth testing?

Solid State Drives/NVMe - ArchWiki
https://wiki.archlinux.org/index.php/Solid_State_Drives/NVMe

Discards:
Note: Although continuous TRIM is an option (albeit not recommended)
for SSDs, NVMe devices should not be issued discards.

I've never heard this particular advice before, and it offers no sourcefor the claim. I have seen Intel's advice that they quote below thatbefore though, and would tend to agree with it for most users. The partthat makes this all complicated is that different devices handle batcheddiscards (what the Arch people call 'Periodic TRIM') and on-demanddiscards (what the Arch people call 'Continuous TRIM') differently.Some devices (especially old ones) do better with batched discards,while others seem to do better with on-demand discards. On top of that,there's significant variance based on the actual workload (includingthat from the filesystem itself).

Based on my own experience using BTRFS on SATA SSD's, it's usuallybetter to do batched discards unless you only write to the filesysteminfrequently, because:1. Each COW operation triggers an associated discard (this can seriouslykill your performance).2. Because old copies of blocks get discarded immediately, it's muchharder to recover a damaged filesystem.

There are some odd exceptions though. If for example you're runningBTRFS on a ramdisk or ZRAM device, you should just use on-demanddiscards, as that will free up memory immediately.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

Reply via email to