Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?

David Gwynne Fri, 10 Feb 2017 02:40:10 -0800

> On 9 Feb 2017, at 7:11 pm, Mikael <mikael.ml...@gmail.com> wrote:
>
> 2017-02-09 16:41 GMT+08:00 David Gwynne <da...@gwynne.id.au>:
> ..
> hey mikael,
>
> can you be more specific about what you mean by multiqueuing for disks? even
a
> reference to an implementation of what you’re asking about would help me
> answer this question.
>
> ill write up a bigger reply after my kids are in bed.
>
> cheers,
> dlg
>
> Hi David,
>
> Thank you for your answer.
>
> The other OpenBSD:ers I talked to also used the wording "multiqueue". My
understanding of the kernel's workings here is too limited.
>
> If I would give a reference to some implementation out there, I guess I
would to the one introduced in Linux 3.13/3.16:
>
> "Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems"
> http://kernel.dk/blk-mq.pdf
>
> "Linux Multi-Queue Block IO Queueing Mechanism (blk-mq)"
>
https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mech
anism_(blk-mq)
>
> "The multiqueue block layer"
> https://lwn.net/Articles/552904/
>
> Looking forward a lot to your followup.


sorry, i feel asleep too.

thanks for the links to info on linux mq stuff. i can understand what it
provides. however, in the situation you are testing im not sure it is
necessarily the means to addressing the difference in performance you’re
seeing in your environment.

anyway, tldr: you’re suffering under the kernels big giant lock.

according to the dmesg you provided you’re testing a single ssd (a samsung
850) connected to a sata controller (ahci). with this equipment all operations
between the computer and the actual disk are all issued through achi. because
of way ahci operates, operations on a specific disk are effectively serialises
at this point. in your setup you have multiple cpus though, and it sounds like
your benchmark runs on them concurrently, issuing io through the kernel to the
disk via ahci.

two things are obviously different between linux and openbsd that would affect
this benchmark. the first is that io to physical devices is limited to a value
called MAXPHYS in the kernel, which is 64 kilobytes. any larger read
operations issued by userland to the kernel get cut up into a series of 64k
reads against the disk. ahci itself can handle 4 meg per transfer.

the other difference is that, like most of the kernel, read() is serialised by
the big lock. the result of this is if you have userland on multiple cpus
creating a heavily io bound workload, all the cpus end up waiting for each
other to run. while one cpu is running through the io stack down to ahci,
every other cpu is spinning waiting for its turn to do the same thing.

the distance between userland and ahci is relatively long. going through the
buffer cache (i.e., /dev/sd0) is longer than bypassing it (through /dev/rsd0).
your test results confirm this.

the solution to this problem is to look at taking the big lock away from the
io paths. this is non-trivial work though.

i have already spent time working on making sd(4) and the scsi midlayer
mpsafe, but haven’t been able to take advantage of that work because both
sides of the scsi subsystem (adapters like ahci and the block layer and
syscalls) still need the big lock. some adapters have been made mpsafe, but i
dont think ahci was on that list. when i was playing with mpsafe scsi, i gave
up the big lock at the start of sd(4) and ran it, the midlayer, and mpi(4) or
mpii(4) unlocked. if i remember correctly, even just unlocking that part of
the stack doubled the throughput of the system.

the work ive done in the midlayer should mean if we can access it without
biglock, accesses to disks beyond adapters like ahci should scale pretty well
cpu cores because of how io is handed over to the midlayer. concurrent
submissions by multiple cpus end up delegating one of the cpus to operate on
the adapter on behalf of all the cpus. while that first cpu is still
submitting to the hardware, other cpus are not blocked from queuing more work
and returning to user land.

i can go into more detail if you want.

cheers,
dlg

Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?

Reply via email to