write overhead becomes visible at scale ..

Tobias Oberstein Wed, 25 Jan 2017 00:53:12 -0800

Hi Alvaro,

Am 24.01.2017 um 19:36 schrieb Alvaro Herrera:

Tobias Oberstein wrote:

I am benchmarking IOPS, and while doing so, it becomes apparent that at
these scales it does matter _how_ IO is done.

The most efficient way is libaio. I get 9.7 million/sec IOPS with low CPU
load. Using any synchronous IO engine is slower and produces higher load.

I do understand that switching to libaio isn't going to fly for PG
(completely different approach).


Maybe it is possible to write a new f_smgr implementation (parallel to
md.c) that uses libaio.  There is no "seek" in that interface, at least,
though the interface does assume that the implementation is blocking.

FWIW, I now systematically compared the IO performance when normalizedfor system load induced over different IO methods.


I use the FIO ioengine terminology:

sync = lseek/read/write
psync = pread/pwrite

Here:

https://github.com/oberstet/scratchbox/raw/master/cruncher/engines-compared/normalized-iops.pdf

Conclusion:

psync has 1.15x the normalized IOPS compared to sync
libaio has up to 6.5x the normalized IOPS compared to sync

---

These measurements where done on 16 NVMe block devices.

As mentioned, when Linux MD comes into the game, the difference betweensync and psync is much higher - the is a lock contention in MD.

The reason for that is: when MD comes into the game, even our massiveCPU cannot hide the inefficiency of the double syscalls anymore.

This MD issue is our bigger problem (compared to PG using sync/psync). Iam going to post to the linux-raid list about that, as being advised byFIO developers.

---

That being said, regarding getting maximum performance out of NVMes withminimal system load, the real deal probably isn't libaio either, butkernel bypass (hinted to my by FIO devs):


http://www.spdk.io/

FIO has a plugin for SPDK, which I am going to explore to establish afinal conclusive baseline for maximum IOPS normalized for load.

There are similar approaches in networking (BSD netmap, DPDK) to bypassthe kernel altogether (zero copy to userland, no interrupts but pollingetc). With hardware like this (NVMe, 100GbE etc), the kernel gets in theway ..


Anyway, this is now probably OT as for PG;)

Cheers,
/Tobias






--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

Reply via email to