Hi Alvaro,

Am 24.01.2017 um 19:36 schrieb Alvaro Herrera:
Tobias Oberstein wrote:

I am benchmarking IOPS, and while doing so, it becomes apparent that at
these scales it does matter _how_ IO is done.

The most efficient way is libaio. I get 9.7 million/sec IOPS with low CPU
load. Using any synchronous IO engine is slower and produces higher load.

I do understand that switching to libaio isn't going to fly for PG
(completely different approach).

Maybe it is possible to write a new f_smgr implementation (parallel to
md.c) that uses libaio.  There is no "seek" in that interface, at least,
though the interface does assume that the implementation is blocking.


FWIW, I now systematically compared the IO performance when normalized for system load induced over different IO methods.

I use the FIO ioengine terminology:

sync = lseek/read/write
psync = pread/pwrite

Here:

https://github.com/oberstet/scratchbox/raw/master/cruncher/engines-compared/normalized-iops.pdf

Conclusion:

psync has 1.15x the normalized IOPS compared to sync
libaio has up to 6.5x the normalized IOPS compared to sync

---

These measurements where done on 16 NVMe block devices.

As mentioned, when Linux MD comes into the game, the difference between sync and psync is much higher - the is a lock contention in MD.

The reason for that is: when MD comes into the game, even our massive CPU cannot hide the inefficiency of the double syscalls anymore.

This MD issue is our bigger problem (compared to PG using sync/psync). I am going to post to the linux-raid list about that, as being advised by FIO developers.

---

That being said, regarding getting maximum performance out of NVMes with minimal system load, the real deal probably isn't libaio either, but kernel bypass (hinted to my by FIO devs):

http://www.spdk.io/

FIO has a plugin for SPDK, which I am going to explore to establish a final conclusive baseline for maximum IOPS normalized for load.

There are similar approaches in networking (BSD netmap, DPDK) to bypass the kernel altogether (zero copy to userland, no interrupts but polling etc). With hardware like this (NVMe, 100GbE etc), the kernel gets in the way ..

Anyway, this is now probably OT as for PG;)

Cheers,
/Tobias






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to