On Tue, Jun 08, 2021 at 08:19:20PM +0800, zhenwei pi wrote: > On 6/8/21 4:07 PM, Stefan Hajnoczi wrote: > > On Tue, Jun 08, 2021 at 10:52:05AM +0800, zhenwei pi wrote: > > > On 6/7/21 11:08 PM, Stefan Hajnoczi wrote: > > > > On Mon, Jun 07, 2021 at 09:32:52PM +0800, zhenwei pi wrote: > > > > > Since 2020, I started to develop a userspace NVMF initiator library: > > > > > https://github.com/bytedance/libnvmf > > > > > and released v0.1 recently. > > > > > > > > > > Also developed block driver for QEMU side: > > > > > https://github.com/pizhenwei/qemu/tree/block-nvmf > > > > > > > > > > Test with linux kernel NVMF target (TCP), QEMU gets about 220K IOPS, > > > > > it seems good. > > > > > > > > How does the performance compare to the Linux kernel NVMeoF initiator? > > > > > > > > In case you're interested, some Red Hat developers have started to > > > > working on a new library called libblkio. For now it supports io_uring > > > > but PCI NVMe and virtio-blk are on the roadmap. The library supports > > > > blocking, event-driven, and polling modes. There isn't a direct overlap > > > > with libnvmf but maybe they can learn from each other. > > > > https://gitlab.com/libblkio/libblkio/-/blob/main/docs/blkio.rst > > > > > > > > Stefan > > > > > > > > > > I'm sorry about that no enough information of QEMU block nvmf driver and > > > libnvmf. > > > > > > Kernel initiator & userspace initiator > > > Rather than io_uring/libaio + kernel initiator solution(read 500K+ IOPS & > > > write 200K+ IOPS), I prefer QEMU block nvmf + libnvmf(RW 200K+ IOPS): > > > 1, I don't have to upgrade host kernel. I can also run it on a lower > > > version > > > of kernel. > > > 2, During re-connection if target side hits a panic, initiator side would > > > not get 'D' state(uninterruptable state in kernel), QEMU always could be > > > killed. > > > 3, It's easier to trouble shoot for a userspace application. > > > > I see, thanks for sharing. > > > > > Default NVMe-OF IO queues > > > The mechanism of QEMU+libnvmf: > > > 1, QEMU iothread creates a request and dispatches it to NVMe-OF IO queues > > > thread by lockless list. > > > 2, QEMU iothread tries to kick NVMe-OF IO queue thread. > > > 3, NVMe-OF IO queue thread processes request and returns response to the > > > QEMU iothread. > > > > > > When the QEMU iothread reaches the limitation, 4 NVMe-OF IO queues get > > > better performance. > > > > Can you explain this bottleneck? Even with 4 NVMe-oF IO queues there is > > still just 1 IOThread submitting requests, so why are 4 IO queues faster > > than 1? > > > > Stefan > > > > QEMU + libiscsi solution uses iothread send/recv TCP and processes iSCSI > PDU directly, it could get about 60K IOPS. Let's look at the perf report of > the iothread: > + 35.06% [k] entry_SYSCALL_64_after_hwframe > + 33.13% [k] do_syscall_64 > + 19.70% [.] 0x0000000100000000 > + 18.31% [.] __libc_send > + 18.02% [.] iscsi_tcp_service > + 16.30% [k] __x64_sys_sendto > + 16.24% [k] __sys_sendto > + 15.69% [k] sock_sendmsg > + 15.56% [k] tcp_sendmsg > + 14.25% [k] __tcp_transmit_skb > + 13.94% [k] 0x0000000000001000 > + 13.78% [k] tcp_sendmsg_locked > + 13.67% [k] __ip_queue_xmit > + 13.00% [k] tcp_write_xmit > + 12.07% [k] __tcp_push_pending_frames > + 11.91% [k] inet_recvmsg > + 11.78% [k] tcp_recvmsg > + 11.73% [k] ip_output > > The bottleneck of this case is TCP, so libnvmf dispatches request to other > threads by lockless list to reduce the overhead of TCP. It gets more > effective to process requests from guest.
Are IOThread %usr and %sys CPU utilization close to 100%? Stefan
signature.asc
Description: PGP signature