> -----Original Message-----
> From: Fabiano Rosas <faro...@suse.de>
> Sent: Thursday, June 6, 2024 9:52 PM
> To: Liu, Yuan1 <yuan1....@intel.com>; pet...@redhat.com;
> pbonz...@redhat.com; marcandre.lur...@redhat.com; berra...@redhat.com;
> th...@redhat.com; phi...@linaro.org
> Cc: qemu-devel@nongnu.org; Zou, Nanhai <nanhai....@intel.com>;
> shameerali.kolothum.th...@huawei.com
> Subject: RE: [PATCH v7 6/7] migration/multifd: implement qpl compression
> and decompression
> 
> "Liu, Yuan1" <yuan1....@intel.com> writes:
> 
> >> -----Original Message-----
> >> From: Fabiano Rosas <faro...@suse.de>
> >> Sent: Thursday, June 6, 2024 6:26 AM
> >> To: Liu, Yuan1 <yuan1....@intel.com>; pet...@redhat.com;
> >> pbonz...@redhat.com; marcandre.lur...@redhat.com; berra...@redhat.com;
> >> th...@redhat.com; phi...@linaro.org
> >> Cc: qemu-devel@nongnu.org; Liu, Yuan1 <yuan1....@intel.com>; Zou,
> Nanhai
> >> <nanhai....@intel.com>; shameerali.kolothum.th...@huawei.com
> >> Subject: Re: [PATCH v7 6/7] migration/multifd: implement qpl
> compression
> >> and decompression
> >>
> >> Yuan Liu <yuan1....@intel.com> writes:
> >>
> >> > QPL compression and decompression will use IAA hardware first.
> >> > If IAA hardware is not available, it will automatically fall
> >> > back to QPL software path, if the software job also fails,
> >> > the uncompressed page is sent directly.
> >> >
> >> > Signed-off-by: Yuan Liu <yuan1....@intel.com>
> >> > Reviewed-by: Nanhai Zou <nanhai....@intel.com>
> >> > ---
> >> >  migration/multifd-qpl.c | 412
> +++++++++++++++++++++++++++++++++++++++-
> >> >  1 file changed, 408 insertions(+), 4 deletions(-)
> >> >
> >> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> >> > index 6791a204d5..18b3384bd5 100644
> >> > --- a/migration/multifd-qpl.c
> >> > +++ b/migration/multifd-qpl.c
> >> > @@ -13,9 +13,14 @@
> >> >  #include "qemu/osdep.h"
> >> >  #include "qemu/module.h"
> >> >  #include "qapi/error.h"
> >> > +#include "qapi/qapi-types-migration.h"
> >> > +#include "exec/ramblock.h"
> >> >  #include "multifd.h"
> >> >  #include "qpl/qpl.h"
> >> >
> >> > +/* Maximum number of retries to resubmit a job if IAA work queues
> are
> >> full */
> >> > +#define MAX_SUBMIT_RETRY_NUM (3)
> >> > +
> >> >  typedef struct {
> >> >      /* the QPL hardware path job */
> >> >      qpl_job *job;
> >> > @@ -260,6 +265,219 @@ static void
> >> multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
> >> >      p->iov = NULL;
> >> >  }
> >> >
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the job
> >> > + *
> >> > + * Set the QPL job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @is_compression: indicates compression and decompression
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_job(qpl_job *job, bool
> is_compression,
> >> > +                                    uint8_t *input, uint32_t
> input_len,
> >> > +                                    uint8_t *output, uint32_t
> >> output_len)
> >> > +{
> >> > +    job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
> >> > +    job->next_in_ptr = input;
> >> > +    job->next_out_ptr = output;
> >> > +    job->available_in = input_len;
> >> > +    job->available_out = output_len;
> >> > +    job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST |
> QPL_FLAG_OMIT_VERIFY;
> >> > +    /* only supports compression level 1 */
> >> > +    job->level = 1;
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the compression job
> >>
> >> function name is wrong
> >
> > Thanks, I will fix this next version.
> >
> >> > + *
> >> > + * Set the compression job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t
> *input,
> >> > +                                         uint32_t input_len, uint8_t
> >> *output,
> >> > +                                         uint32_t output_len)
> >> > +{
> >> > +    multifd_qpl_prepare_job(job, true, input, input_len, output,
> >> output_len);
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the decompression job
> >
> > Thanks, I will fix this next version.
> >
> >> > + *
> >> > + * Set the decompression job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t
> >> *input,
> >> > +                                           uint32_t input_len,
> uint8_t
> >> *output,
> >> > +                                           uint32_t output_len)
> >> > +{
> >> > +    multifd_qpl_prepare_job(job, false, input, input_len, output,
> >> output_len);
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_fill_iov: fill in the IOV
> >> > + *
> >> > + * Fill in the QPL packet IOV
> >> > + *
> >> > + * @p: Params for the channel being used
> >> > + * @data: pointer to the IOV data
> >> > + * @len: The length of the IOV data
> >> > + */
> >> > +static void multifd_qpl_fill_iov(MultiFDSendParams *p, uint8_t
> *data,
> >> > +                                 uint32_t len)
> >> > +{
> >> > +    p->iov[p->iovs_num].iov_base = data;
> >> > +    p->iov[p->iovs_num].iov_len = len;
> >> > +    p->iovs_num++;
> >> > +    p->next_packet_size += len;
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_fill_packet: fill the compressed page into the QPL
> >> packet
> >> > + *
> >> > + * Fill the compressed page length and IOV into the QPL packet
> >> > + *
> >> > + * @idx: The index of the compressed length array
> >> > + * @p: Params for the channel being used
> >> > + * @data: pointer to the compressed page buffer
> >> > + * @len: The length of the compressed page
> >> > + */
> >> > +static void multifd_qpl_fill_packet(uint32_t idx, MultiFDSendParams
> *p,
> >> > +                                    uint8_t *data, uint32_t len)
> >> > +{
> >> > +    QplData *qpl = p->compress_data;
> >> > +
> >> > +    qpl->zlen[idx] = cpu_to_be32(len);
> >> > +    multifd_qpl_fill_iov(p, data, len);
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_submit_job: submit a job to the hardware
> >> > + *
> >> > + * Submit a QPL hardware job to the IAA device
> >> > + *
> >> > + * Returns true if the job is submitted successfully, otherwise
> false.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + */
> >> > +static bool multifd_qpl_submit_job(qpl_job *job)
> >> > +{
> >> > +    qpl_status status;
> >> > +    uint32_t num = 0;
> >> > +
> >> > +retry:
> >> > +    status = qpl_submit_job(job);
> >> > +    if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) {
> >> > +        if (num < MAX_SUBMIT_RETRY_NUM) {
> >> > +            num++;
> >> > +            goto retry;
> >> > +        }
> >> > +    }
> >> > +    return (status == QPL_STS_OK);
> >>
> >> How often do we expect this to fail? Will the queues be busy frequently
> >> or is this an unlikely event? I'm thinking whether we really need to
> >> allow a fallback for the hw path. Sorry if this has been discussed
> >> already, I don't remember.
> >
> > In some scenarios, this may happen frequently, such as configuring 4
> channels
> > but only one IAA device is available. In the case of insufficient IAA
> hardware
> > resources, retry and fallback can help optimize performance.
> > I have a comparison test below
> >
> > 1. Retry + SW fallback:
> >    total time: 14649 ms
> >    downtime: 25 ms
> >    throughput: 17666.57 mbps
> >    pages-per-second: 1509647
> >
> > 2. No fallback, always wait for work queues to become available
> >    total time: 18381 ms
> >    downtime: 25 ms
> >    throughput: 13698.65 mbps
> >    pages-per-second: 859607
> 
> Thanks for the data, this is helpful. Let's include it in the commit
> message, it's important to let people know you actually did that
> analysis. I put a suggestion below:
> 
> ---
> QPL compression and decompression will use IAA hardware path if the IAA
> hardware is available. Otherwise the QPL library software path is used.
> 
> The hardware path will automatically fall back to QPL software path if
> the IAA queues are busy. In some scenarios, this may happen frequently,
> such as configuring 4 channels but only one IAA device is available. In
> the case of insufficient IAA hardware resources, retry and fallback can
> help optimize performance:
> 
>  1. Retry + SW fallback:
>     total time: 14649 ms
>     downtime: 25 ms
>     throughput: 17666.57 mbps
>     pages-per-second: 1509647
> 
>  2. No fallback, always wait for work queues to become available
>     total time: 18381 ms
>     downtime: 25 ms
>     throughput: 13698.65 mbps
>     pages-per-second: 859607
> 
> If both the hardware and software paths fail, the uncompressed page is
> sent directly.

Very thanks for your comments, I will add these to the commit message.

> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_compress_pages_slow_path: compress pages using slow
> path
> >> > + *
> >> > + * Compress the pages using software. If compression fails, the page
> >> will
> >> > + * be sent directly.
> >> > + *
> >> > + * @p: Params for the channel being used
> >> > + */
> >> > +static void multifd_qpl_compress_pages_slow_path(MultiFDSendParams
> *p)
> >> > +{
> >> > +    QplData *qpl = p->compress_data;
> >> > +    uint32_t size = p->page_size;
> >> > +    qpl_job *job = qpl->sw_job;
> >> > +    uint8_t *zbuf = qpl->zbuf;
> >> > +    uint8_t *buf;
> >> > +
> >> > +    for (int i = 0; i < p->pages->normal_num; i++) {
> >> > +        buf = p->pages->block->host + p->pages->offset[i];
> >> > +        /* Set output length to less than the page to reduce
> >> decompression */
> >> > +        multifd_qpl_prepare_comp_job(job, buf, size, zbuf, size -
> 1);
> >> > +        if (qpl_execute_job(job) == QPL_STS_OK) {
> >> > +            multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
> >> > +        } else {
> >> > +            /* send the page directly */
> >>
> >> s/directly/uncompressed/
> >>
> >> a bit clearer.
> >
> > Sure, I will fix it next version.
> >
> >> > +            multifd_qpl_fill_packet(i, p, buf, size);
> >> > +        }
> >> > +        zbuf += size;
> >> > +    }
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_compress_pages: compress pages
> >> > + *
> >> > + * Submit the pages to the IAA hardware for compression. If hardware
> >> > + * compression fails, it falls back to software compression. If
> >> software
> >> > + * compression also fails, the page is sent directly
> >> > + *
> >> > + * @p: Params for the channel being used
> >> > + */
> >> > +static void multifd_qpl_compress_pages(MultiFDSendParams *p)
> >> > +{
> >> > +    QplData *qpl = p->compress_data;
> >> > +    MultiFDPages_t *pages = p->pages;
> >> > +    uint32_t size = p->page_size;
> >> > +    QplHwJob *hw_job;
> >> > +    uint8_t *buf;
> >> > +    uint8_t *zbuf;
> >> > +
> >>
> >> Let's document the output size choice more explicitly:
> >>
> >>     /*
> >>      * Set output length to less than the page size to force the job to
> >>      * fail in case it compresses to a larger size. We'll send that
> page
> >>      * without compression and skip the decompression operation on the
> >>      * destination.
> >>      */
> >>      out_size = size - 1;
> >>
> >> you can then omit the other comments.
> >
> > Thanks for the comments, I will refine this next version.
> >
> >> > +    for (int i = 0; i < pages->normal_num; i++) {
> >> > +        buf = pages->block->host + pages->offset[i];
> >> > +        zbuf = qpl->zbuf + (size * i);
> >> > +        hw_job = &qpl->hw_jobs[i];
> >> > +        /* Set output length to less than the page to reduce
> >> decompression */
> >> > +        multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf,
> size
> >> - 1);
> >> > +        if (multifd_qpl_submit_job(hw_job->job)) {
> >> > +            hw_job->fallback_sw_path = false;
> >> > +        } else {
> >> > +            hw_job->fallback_sw_path = true;
> >> > +            /* Set output length less than page size to reduce
> >> decompression */
> >> > +            multifd_qpl_prepare_comp_job(qpl->sw_job, buf, size,
> zbuf,
> >> > +                                         size - 1);
> >> > +            if (qpl_execute_job(qpl->sw_job) == QPL_STS_OK) {
> >> > +                hw_job->sw_output = zbuf;
> >> > +                hw_job->sw_output_len = qpl->sw_job->total_out;
> >> > +            } else {
> >> > +                hw_job->sw_output = buf;
> >> > +                hw_job->sw_output_len = size;
> >> > +            }
> >>
> >> Hmm, these look a bit cumbersome, would it work if we moved the
> fallback
> >> qpl_execute_job() down into the other loop? We could then avoid the
> >> extra fields. Something like:
> >>
> >> static void multifd_qpl_compress_pages(MultiFDSendParams *p)
> >> {
> >>     QplData *qpl = p->compress_data;
> >>     MultiFDPages_t *pages = p->pages;
> >>     uint32_t out_size, size = p->page_size;
> >>     uint8_t *buf, *zbuf;
> >>
> >>     /*
> >>      * Set output length to less than the page size to force the job to
> >>      * fail in case it compresses to a larger size. We'll send that
> page
> >>      * without compression to skip the decompression operation on the
> >>      * destination.
> >>      */
> >>     out_size = size - 1;
> >>
> >>     for (int i = 0; i < pages->normal_num; i++) {
> >>         QplHwJob *hw_job = &qpl->hw_jobs[i];
> >>
> >>         hw_job->fallback_sw_path = false;
> >>         buf = pages->block->host + pages->offset[i];
> >>         zbuf = qpl->zbuf + (size * i);
> >>
> >>         multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf,
> >> out_size);
> >>
> >>         if (!multifd_qpl_submit_job(hw_job->job)) {
> >>             hw_job->fallback_sw_path = true;
> >>         }
> >>     }
> >>
> >>     for (int i = 0; i < pages->normal_num; i++) {
> >>         QplHwJob *hw_job = &qpl->hw_jobs[i];
> >>         qpl_job *job;
> >>
> >>         buf = pages->block->host + pages->offset[i];
> >>         zbuf = qpl->zbuf + (size * i);
> >>
> >>         if (hw_job->fallback_sw_path) {
> >>             job = qpl->sw_job;
> >>             multifd_qpl_prepare_comp_job(job, buf, size, zbuf,
> out_size);
> >>             ret = qpl_execute_job(job);
> >>         } else {
> >>             job = hw_job->job;
> >>             ret = qpl_wait_job(job);
> >>         }
> >>
> >>         if (ret == QPL_STS_OK) {
> >>             multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
> >>         } else {
> >>             multifd_qpl_fill_packet(i, p, buf, size);
> >>         }
> >>     }
> >> }
> >
> > Very thanks for the reference code, I have test the code and the
> performance is not good.
> > When the work queue is full, after a hardware job fails to be submitted,
> the subsequent
> > job submission will most likely fail as well. so my idea is to use
> software job execution
> > instead immediately, but all subsequent jobs will still give priority to
> hardware path.
> 
> So let me see if I get this, you're saying that going with the sw path
> immediately after a hw path failure is beneficial because the time it
> takes to call the sw path serves as a backoff time for the hw path?

Exactly, I want to use the sw path as the backoff time for the hardware path.

> Do you have an idea on the time difference of waiting for sw path
> vs. introducing a delay to multifd_qpl_submit_job()? Aren't we leaving
> performance on the table by going with a much slower sw path instead of
> waiting for the queues to open up? Or some other strategy, such as going
> once again over the not-submitted pages.

Using a specific delay time to guarantee performance may be difficult now,
because the solution only supports shared working queue mode and when the live
migration starts waiting for a specified time, other workloads may still fill 
the device work queue, causing the live migration job submission to fail after
a while.

I agree with your point. Currently, using the software path to solve the backoff
time may also cause the performance drop due to software path overhead. I will 
consider how to solve it in the future.

> I understand there's a tradeoff here between your effort to investigate
> these things and the amount of performance to be had, so feel free to
> leave this question unanswered. We could choose to simply document this
> with a comment:

Sure, I'll leave the comments on this so I can improve it in the future

>     if (multifd_qpl_submit_job(hw_job->job)) {
>         hw_job->fallback_sw_path = false;
>         continue;
>     }
> 
>     /*
>      * The IAA work queue is full, any immediate subsequent job
>      * submission is likely to fail, sending the page via the QPL
>      * software path at this point gives us a better chance of
>      * finding the queue open for the next pages.
>      */
>     hw_job->fallback_sw_path = true;
>     ...
> 
> > There is almost no overhead in job submission because Intel uses the new
> "enqcmd" instruction,
> > which allows the user program to submit the job directly to the
> hardware.
> >
> > According to the implementation of the reference code, when a job fails
> to be submitted, there
> > is a high probability that "ALL" subsequent jobs will fail to be
> submitted and then use software
> > compression, resulting in the IAA hardware not being fully utilized.
> >
> > For 4 Channel, 1 IAA device test case, using the reference code will
> reduce IAA throughput
> > from 3.4GBps to 2.2GBps, thus affecting live migration
> performance.(total time from 14s to 18s)
> >

Reply via email to