> On 9 Feb 2017, at 7:11 pm, Mikael <mikael.ml...@gmail.com> wrote: > > 2017-02-09 16:41 GMT+08:00 David Gwynne <da...@gwynne.id.au>: > .. > hey mikael, > > can you be more specific about what you mean by multiqueuing for disks? even a > reference to an implementation of what you’re asking about would help me > answer this question. > > ill write up a bigger reply after my kids are in bed. > > cheers, > dlg > > Hi David, > > Thank you for your answer. > > The other OpenBSD:ers I talked to also used the wording "multiqueue". My understanding of the kernel's workings here is too limited. > > If I would give a reference to some implementation out there, I guess I would to the one introduced in Linux 3.13/3.16: > > "Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems" > http://kernel.dk/blk-mq.pdf > > "Linux Multi-Queue Block IO Queueing Mechanism (blk-mq)" > https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mech anism_(blk-mq) > > "The multiqueue block layer" > https://lwn.net/Articles/552904/ > > Looking forward a lot to your followup.
sorry, i feel asleep too. thanks for the links to info on linux mq stuff. i can understand what it provides. however, in the situation you are testing im not sure it is necessarily the means to addressing the difference in performance you’re seeing in your environment. anyway, tldr: you’re suffering under the kernels big giant lock. according to the dmesg you provided you’re testing a single ssd (a samsung 850) connected to a sata controller (ahci). with this equipment all operations between the computer and the actual disk are all issued through achi. because of way ahci operates, operations on a specific disk are effectively serialises at this point. in your setup you have multiple cpus though, and it sounds like your benchmark runs on them concurrently, issuing io through the kernel to the disk via ahci. two things are obviously different between linux and openbsd that would affect this benchmark. the first is that io to physical devices is limited to a value called MAXPHYS in the kernel, which is 64 kilobytes. any larger read operations issued by userland to the kernel get cut up into a series of 64k reads against the disk. ahci itself can handle 4 meg per transfer. the other difference is that, like most of the kernel, read() is serialised by the big lock. the result of this is if you have userland on multiple cpus creating a heavily io bound workload, all the cpus end up waiting for each other to run. while one cpu is running through the io stack down to ahci, every other cpu is spinning waiting for its turn to do the same thing. the distance between userland and ahci is relatively long. going through the buffer cache (i.e., /dev/sd0) is longer than bypassing it (through /dev/rsd0). your test results confirm this. the solution to this problem is to look at taking the big lock away from the io paths. this is non-trivial work though. i have already spent time working on making sd(4) and the scsi midlayer mpsafe, but haven’t been able to take advantage of that work because both sides of the scsi subsystem (adapters like ahci and the block layer and syscalls) still need the big lock. some adapters have been made mpsafe, but i dont think ahci was on that list. when i was playing with mpsafe scsi, i gave up the big lock at the start of sd(4) and ran it, the midlayer, and mpi(4) or mpii(4) unlocked. if i remember correctly, even just unlocking that part of the stack doubled the throughput of the system. the work ive done in the midlayer should mean if we can access it without biglock, accesses to disks beyond adapters like ahci should scale pretty well cpu cores because of how io is handed over to the midlayer. concurrent submissions by multiple cpus end up delegating one of the cpus to operate on the adapter on behalf of all the cpus. while that first cpu is still submitting to the hardware, other cpus are not blocked from queuing more work and returning to user land. i can go into more detail if you want. cheers, dlg