On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> Hi,

Hi, Yuan,

I have a few comments and questions.  Many of them can be pure questions as
I don't know enough on these new technologies.

> 
> I am writing to submit a code change aimed at enhancing live migration
> acceleration by leveraging the compression capability of the Intel
> In-Memory Analytics Accelerator (IAA).
> 
> The implementation of the IAA (de)compression code is based on Intel Query
> Processing Library (QPL), an open-source software project designed for
> IAA high-level software programming. https://github.com/intel/qpl
> 
> In the last version, there was some discussion about whether to
> introduce a new compression algorithm for IAA. Because the compression
> algorithm of IAA hardware is based on deflate, and QPL already supports
> Zlib, so in this version, I implemented IAA as an accelerator for the
> Zlib compression method. However, due to some reasons, QPL is currently
> not compatible with the existing Zlib method that Zlib compressed data
> can be decompressed by QPl and vice versa.
> 
> I have some concerns about the existing Zlib compression
>   1. Will you consider supporting one channel to support multi-stream
>      compression? Of course, this may lead to a reduction in compression
>      ratio, but it will allow the hardware to process each stream 
>      concurrently. We can have each stream process multiple pages,
>      reducing the loss of compression ratio. For example, 128 pages are
>      divided into 16 streams for independent compression. I will provide
>      the a early performance data in the next version(v4).

I think Juan used to ask similar question: how much this can help if
multifd can already achieve some form of concurrency over the pages?
Couldn't the user specify more multifd channels if they want to grant more
cpu resource for comp/decomp purpose?

IOW, how many concurrent channels QPL can provide?  What is the suggested
concurrency channels there?

> 
>   2. Will you consider using QPL/IAA as an independent compression
>      algorithm instead of an accelerator? In this way, we can better
>      utilize hardware performance and some features, such as IAA's
>      canned mode, which can be dynamically generated by some statistics
>      of data. A huffman table to improve the compression ratio.

Maybe one more knob will work?  If it's not compatible with the deflate
algo maybe it should never be the default.  IOW, the accelerators may be
extended into this (based on what you already proposed):

  - auto ("qpl" first, "none" second; never "qpl-optimized")
  - none (old zlib)
  - qpl (qpl compatible)
  - qpl-optimized (qpl uncompatible)

Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
user can select it explicit, but only on both sides of QEMU.

> 
> Test condition:
>   1. Host CPUs are based on Sapphire Rapids, and frequency locked to 3.4G
>   2. VM type, 16 vCPU and 64G memory
>   3. The Idle workload means no workload is running in the VM 
>   4. The Redis workload means YCSB workloadb + Redis Server are running
>      in the VM, about 20G or more memory will be used.
>   5. Source side migartion configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter downtime-limit 300
>      d. migrate_set_parameter multifd-compression zlib
>      e. migrate_set_parameter multifd-compression-accel none/qpl
>      f. migrate_set_parameter max-bandwidth 100G
>   6. Desitination side migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter multifd-compression zlib
>      d. migrate_set_parameter multifd-compression-accel none/qpl
>      e. migrate_set_parameter max-bandwidth 100G

How is zlib-level setup?  Default (1)?

Btw, it seems both zlib/zstd levels are not even working right now to be
configured.. probably overlooked in migrate_params_apply().

> 
> Early migration result, each result is the average of three tests
>  +--------+-------------+--------+--------+---------+----+-----+
>  |        | The number  |total   |downtime|network  |pages per |
>  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
>  |        | and mode    |        |        |(mbps)   |          |
>  |        +-------------+-----------------+---------+----------+
>  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
>  |        +-------------+--------+--------+---------+----------+
>  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
>  |workload+-------------+--------+--------+---------+----------+
>  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |

The number is slightly confusing to me.  If IAA can send 3x times more
pages per-second, shouldn't the total migration time 1/3 of the other if
the guest is idle?  But the total times seem to be pretty close no matter N
of channels. Maybe I missed something?

>  +--------+-------------+--------+--------+---------+----------+
> 
>  +--------+-------------+--------+--------+---------+----+-----+
>  |        | The number  |total   |downtime|network  |pages per |
>  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
>  |        | and mode    |        |        |(mbps)   |          |
>  |        +-------------+-----------------+---------+----------+
>  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
>  |        +-------------+--------+--------+---------+----------+
>  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
>  |workload+-------------+--------+--------+---------+----------+
>  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
>  +--------+-------------+--------+--------+---------+----------+

The redis results look much more preferred on using IAA comparing to the
idle tests.  Does it mean that IAA works less good with zero pages in
general (assuming that'll be the majority in idle test)?

>From the manual, I see that IAA also supports encryption/decryption.  Would
it be able to accelerate TLS?

How should one consider IAA over QAT?  What is the major difference?  I see
that IAA requires IOMMU scalable mode, why?  Is it because the IAA HW is
something attached to the pcie bus (assume QAT the same)?

Thanks,

-- 
Peter Xu


Reply via email to