Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB

Peter Xu Tue, 09 Jul 2024 11:44:19 -0700

On Tue, Jul 09, 2024 at 08:42:59AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Yichen Wang <yichen.w...@bytedance.com>
> > Sent: Saturday, July 6, 2024 2:29 AM
> > To: Paolo Bonzini <pbonz...@redhat.com>; Daniel P. Berrangé
> > <berra...@redhat.com>; Eduardo Habkost <edua...@habkost.net>; Marc-André
> > Lureau <marcandre.lur...@redhat.com>; Thomas Huth <th...@redhat.com>;
> > Philippe Mathieu-Daudé <phi...@linaro.org>; Peter Xu <pet...@redhat.com>;
> > Fabiano Rosas <faro...@suse.de>; Eric Blake <ebl...@redhat.com>; Markus
> > Armbruster <arm...@redhat.com>; Laurent Vivier <lviv...@redhat.com>; qemu-
> > de...@nongnu.org
> > Cc: Hao Xiang <hao.xi...@linux.dev>; Liu, Yuan1 <yuan1....@intel.com>;
> > Zou, Nanhai <nanhai....@intel.com>; Ho-Ren (Jack) Chuang
> > <horenchu...@bytedance.com>; Wang, Yichen <yichen.w...@bytedance.com>
> > Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> > 
> > v4:
> > - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768
> > - Move the IOV initialization to qatzip implementation
> > - Only use qatzip to compress normal pages
> > 
> > v3:
> > - Rebase changes on top of master
> > - Merge two patches per Fabiano Rosas's comment
> > - Add versions into comments and documentations
> > 
> > v2:
> > - Rebase changes on top of recent multifd code changes.
> > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> > - Remove parameter tuning and use QATzip's defaults for better
> >   performance.
> > - Add parameter to enable QAT software fallback.
> > 
> > v1:
> > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> > 
> > * Performance
> > 
> > We present updated performance results. For circumstantial reasons, v1
> > presented performance on a low-bandwidth (1Gbps) network.
> > 
> > Here, we present updated results with a similar setup as before but with
> > two main differences:
> > 
> > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> > 2. We had a bug in our memory allocation causing us to only use ~1/2 of
> > the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> > RAM.
> > 
> > Thus, the test setup is as follows:
> > 
> > We perform multifd live migration over TCP using a VM with 64GB memory.
> > We prepare the machine's memory by powering it on, allocating a large
> > amount of memory (60GB) as a single buffer, and filling the buffer with
> > the repeated contents of the Silesia corpus[0]. This is in lieu of a more
> > realistic memory snapshot, which proved troublesome to acquire.
> > 
> > We analyze CPU usage by averaging the output of 'top' every second
> > during migration. This is admittedly imprecise, but we feel that it
> > accurately portrays the different degrees of CPU usage of varying
> > compression methods.
> > 
> > We present the latency, throughput, and CPU usage results for all of the
> > compression methods, with varying numbers of multifd threads (4, 8, and
> > 16).
> > 
> > [0] The Silesia corpus can be accessed here:
> > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > 
> > ** Results
> > 
> > 4 multifd threads:
> > 
> >     |---------------|---------------|----------------|---------|---------|
> >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
> >     |---------------|---------------|----------------|---------|---------|
> >     |qatzip         | 23.13         | 8749.94        |117.50   |186.49   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zlib           |254.35         |  771.87        |388.20   |144.40   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zstd           | 54.52         | 3442.59        |414.59   |149.77   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |none           | 12.45         |43739.60        |159.71   |204.96   |
> >     |---------------|---------------|----------------|---------|---------|
> > 
> > 8 multifd threads:
> > 
> >     |---------------|---------------|----------------|---------|---------|
> >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
> >     |---------------|---------------|----------------|---------|---------|
> >     |qatzip         | 16.91         |12306.52        |186.37   |391.84   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zlib           |130.11         | 1508.89        |753.86   |289.35   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zstd           | 27.57         | 6823.23        |786.83   |303.80   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |none           | 11.82         |46072.63        |163.74   |238.56   |
> >     |---------------|---------------|----------------|---------|---------|
> > 
> > 16 multifd threads:
> > 
> >     |---------------|---------------|----------------|---------|---------|
> >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
> >     |---------------|---------------|----------------|---------|---------|
> >     |qatzip         |18.64          |11044.52        | 573.61  |437.65   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zlib           |66.43          | 2955.79        |1469.68  |567.47   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zstd           |14.17          |13290.66        |1504.08  |615.33   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |none           |16.82          |32363.26        | 180.74  |217.17   |
> >     |---------------|---------------|----------------|---------|---------|
> > 
> > ** Observations
> > 
> > - In general, not using compression outperforms using compression in a
> >   non-network-bound environment.
> > - 'qatzip' outperforms other compression workers with 4 and 8 workers,
> >   achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
> > ~58% latency reduction over 'zstd' with 4 workers.
> > - 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
> >   showing a ~32% increase in latency. This performance difference
> > becomes more noticeable with more workers, as CPU compression is highly
> > parallelizable.
> > - 'qatzip' compression uses considerably less CPU than other compression
> >   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> > compression CPU usage compared to 'zstd' and 'zlib'.
> > - 'qatzip' decompression CPU usage is less impressive, and is even
> >   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.
> 
> Hi Peter & Yichen
> 
> I have a test based on the V4 patch set
> VM configuration:16 vCPU, 64G memory, 
> VM Workload: all vCPUs are idle and 54G memory is filled with Silesia data.
> QAT Devices: 4
> 
> Sender migration parameters
> migrate_set_capability multifd on
> migrate_set_parameter multifd-channels 2/4/8
> migrate_set_parameter max-bandwidth 1G/10G


Ah, I think this means GBps... not Gbps, then.

> migrate_set_parameter multifd-compression qatzip/zstd
> 
> Receiver migration parameters
> migrate_set_capability multifd on
> migrate_set_parameter multifd-channels 2
> migrate_set_parameter multifd-compression qatzip/zstd
> 
> max-bandwidth: 1GBps
>      |-----------|--------|---------|----------|------|------|
>      |2 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   21607|       77|      8051|    88|   125|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   78351|       96|      2199|   204|    80|
>      |-----------|--------|---------|----------|------|------|
> 
>      |-----------|--------|---------|----------|------|------|
>      |4 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   20336|       25|      8557|   110|   190|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   39324|       31|      4389|   406|   160|
>      |-----------|--------|---------|----------|------|------|
> 
>      |-----------|--------|---------|----------|------|------|
>      |8 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   20208|       22|      8613|   125|   300|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   20515|       22|      8438|   800|   340|
>      |-----------|--------|---------|----------|------|------|
> 
> max-bandwidth: 10GBps
>      |-----------|--------|---------|----------|------|------|
>      |2 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   22450|       77|      7748|    80|   125|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   78339|       76|      2199|   204|    80|
>      |-----------|--------|---------|----------|------|------|
> 
>      |-----------|--------|---------|----------|------|------|
>      |4 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   13017|       24|     13401|   180|   285|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   39466|       21|      4373|   406|   160|
>      |-----------|--------|---------|----------|------|------|
> 
>      |-----------|--------|---------|----------|------|------|
>      |8 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   10255|       22|     17037|   280|   590|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   20126|       77|      8595|   810|   340|
>      |-----------|--------|---------|----------|------|------|

PS: this 77ms downtime smells like it hits some spikes during save/load.
Doesn't look like reproducable comparing to the rest data.

> 
> If the user has enabled compression in live migration, using QAT
> can save the host CPU resources.
> 
> When compression is enabled, the bottleneck of migration is usually
> the compression throughput on the sender side, since CPU decompression
> throughput is higher than compression, some reference data 
> https://github.com/inikep/lzbench, so more CPU resources need to be 
> allocated to the sender side.

Thank you, Yuan.

> 
> Summary:
> 1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps, 
>    but ZSTD needs 800%.
> 2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps
>    But ZSTD still cannot reach 10Gbps even if it uses 810%.

So I assumed you always meant GBps across all the test results, as only
that matches with max-bandwidth parameter.

Then in this case 10GBps is actually 80Gbps, which was not a low bandwidth
test.

And I think the most interesting one that I would be curious is nocomp in
low network tests.  Would you mind run one more test with the same
workload, but with: no-comp, 8 channels, 10Gbps (or 1GBps)?

I think in this case multifd shouldn't matter a huge deal, but let's still
enable that just assume that's the baseline / default setup.  I would
expect this result should obviously show a win on using compressors, but
just to check.

> 3. The QAT decompression CPU utilization is higher than compression and ZSTD,
>    from my analysis
>    3.1 when using QAT compression, the data needs to be copied to the QAT 
>        memory (for DMA operations), and the same for decompression. However, 
>        do_user_addr_fault will be triggered during decompression because the 
>        QAT decompressed data is copied to the VM address space for the first 
> time,
>        in addition, both compression and decompression are processed by QAT 
> and 
>        do not consume CPU resources, so the CPU utilization of the receiver 
> is 
>        slightly higher than the sender.

I thought you hit this same issue when working on QPL and I remember you
used -mem-prealloc.  Why not use it here?

>    
>    3.2 Since zstd decompression decompresses data directly into the VM 
> address space, 
>        there is one less memory copy than QAT, so the CPU utilization on the 
> receiver
>        is better than QAT. For the 1GBps case, the receiver CPU utilization 
> is 125%,
>        and the memory copy occupies ~80% of CPU utilization.

Hmm, yes I read that part in code and I thought it was a design decision to
do the copy, the comment said "it is faster".  So it's not?

I think we can definitely submit compression tasks per-page rather than
buffering, if that would be better.

> 
>    I think this is acceptable. Considering the overall CPU usage of the 
> sender and receiver, 
>    the QAT benefit is good.

Yes, I don't think there's any major issue to block this from supported,
it's more about when we are at it we'd better figure all things out.

For example, I think we used to discuss the use case where there's 100G*2
network deployed, but the admin may still want to have some control plane
VMs moving around using very limited network for QoS.  In that case, I
wonder any of you thought about using postcopy?  I assume the control plane
workload isn't super critical in this case or it won't get provisioned with
low network for migrations, in that case maybe it'll also be fine to
post-copy after one round of precopy on the slow-bandwidth network.

Again, I don't think the answer blocks such feature in any form whoever
simply wants to use a compressor, just to ask.

Thanks,

-- 
Peter Xu

Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB

Reply via email to