On 5/28/26 8:40 AM, Geliang Tang wrote:
From: Geliang Tang<[email protected]>

Add NVMe iopolicy testing to mptcp_nvme.sh, with the default set to
"numa". It can be set to "round-robin" or "queue-depth".

Test results with 4 NVMe multipath paths and round-robin iopolicy show
that TCP and MPTCP achieve similar bandwidth:

  # ./mptcp_nvme.sh tcp 4 round-robin
    READ: bw=455MiB/s (478MB/s), 455MiB/s-455MiB/s (478MB/s-478MB/s),
                io=4665MiB (4891MB), run=10242-10242msec
   WRITE: bw=455MiB/s (477MB/s), 455MiB/s-455MiB/s (477MB/s-477MB/s),
                io=4633MiB (4858MB), run=10184-10184msec

  # ./mptcp_nvme.sh mptcp 4 round-robin
    READ: bw=445MiB/s (466MB/s), 445MiB/s-445MiB/s (466MB/s-466MB/s),
                io=4575MiB (4797MB), run=10287-10287msec
   WRITE: bw=445MiB/s (467MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s),
                io=4572MiB (4794MB), run=10267-10267msec

A "loss" argument is added to simulate network packet loss. When loss=1,
each veth interface is configured with "delay 5ms loss 0.5%" using tc
qdisc. Under this scenario, TCP performance is reduced by multiples
compared to MPTCP:

  # ./mptcp_nvme.sh tcp 4 round-robin 1
    READ: bw=144MiB/s (151MB/s), 144MiB/s-144MiB/s (151MB/s-151MB/s),
                io=1909MiB (2001MB), run=13231-13231msec
   WRITE: bw=100.0MiB/s (105MB/s), 100.0MiB/s-100.0MiB/s (105MB/s-105MB/s),
                io=1397MiB (1465MB), run=13980-13980msec

  # ./mptcp_nvme.sh mptcp 4 round-robin 1
    READ: bw=428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s),
                io=4524MiB (4743MB), run=10564-10564msec
   WRITE: bw=431MiB/s (452MB/s), 431MiB/s-431MiB/s (452MB/s-452MB/s),
                io=4513MiB (4732MB), run=10481-10481msec

These results demonstrate that MPTCP has better resilience against
packet loss compared to TCP, as it can leverage multiple subflows to
mitigate network degradation.

There are a few observations I'd like to raise:

1. It is difficult to reason about the throughput results when NVMe native
   multipath is enabled together with MPTCP. In this topology, four NVMe paths
   are created and the round-robin I/O policy is configured. As a result, each
   I/O first goes through the NVMe multipath scheduler, which selects a path,
   and is then further subjected to the MPTCP scheduler, which selects a TCP
   subflow. This means there are two independent schedulers influencing I/O
   placement, making it difficult to attribute the observed throughput
   improvements to either NVMe multipath or MPTCP.

   For throughput comparisons, it may be more meaningful to disable NVMe native
   multipath (e.g., modprobe nvme_core multipath=n) when testing MPTCP. This 
would
   ensure that all I/O is sent through a single NVMe/TCP path while allowing 
MPTCP
   alone to distribute traffic across available subflows. Such a setup would
   provide a clearer comparison between TCP and MPTCP.

2. The current test uses only a 128 KiB I/O size. It would be useful to include
   additional I/O sizes as well, such as 4 KiB, 8 KiB, and 32 KiB, since MPTCP 
and
   NVMe multipath may behave differently under different workload 
characteristics.

3. The fio runtime is only 10 seconds, which is relatively short for performance
   evaluation. The results may be influenced by startup transients and may not
   accurately reflect steady-state behavior. It would be preferable to run the 
tests
   for a longer duration, for example 120 seconds, to obtain more stable 
measurements.

4. The tests are run on the same host by setting up veth interfaces and running
   host and target under different network namespaces. It'd be useful if you 
could
   run this tests between real host and target systems.

Thanks,
--Nilay


Reply via email to