> -----Original Message----- > From: Fabiano Rosas <faro...@suse.de> > Sent: Thursday, June 6, 2024 9:52 PM > To: Liu, Yuan1 <yuan1....@intel.com>; pet...@redhat.com; > pbonz...@redhat.com; marcandre.lur...@redhat.com; berra...@redhat.com; > th...@redhat.com; phi...@linaro.org > Cc: qemu-devel@nongnu.org; Zou, Nanhai <nanhai....@intel.com>; > shameerali.kolothum.th...@huawei.com > Subject: RE: [PATCH v7 6/7] migration/multifd: implement qpl compression > and decompression > > "Liu, Yuan1" <yuan1....@intel.com> writes: > > >> -----Original Message----- > >> From: Fabiano Rosas <faro...@suse.de> > >> Sent: Thursday, June 6, 2024 6:26 AM > >> To: Liu, Yuan1 <yuan1....@intel.com>; pet...@redhat.com; > >> pbonz...@redhat.com; marcandre.lur...@redhat.com; berra...@redhat.com; > >> th...@redhat.com; phi...@linaro.org > >> Cc: qemu-devel@nongnu.org; Liu, Yuan1 <yuan1....@intel.com>; Zou, > Nanhai > >> <nanhai....@intel.com>; shameerali.kolothum.th...@huawei.com > >> Subject: Re: [PATCH v7 6/7] migration/multifd: implement qpl > compression > >> and decompression > >> > >> Yuan Liu <yuan1....@intel.com> writes: > >> > >> > QPL compression and decompression will use IAA hardware first. > >> > If IAA hardware is not available, it will automatically fall > >> > back to QPL software path, if the software job also fails, > >> > the uncompressed page is sent directly. > >> > > >> > Signed-off-by: Yuan Liu <yuan1....@intel.com> > >> > Reviewed-by: Nanhai Zou <nanhai....@intel.com> > >> > --- > >> > migration/multifd-qpl.c | 412 > +++++++++++++++++++++++++++++++++++++++- > >> > 1 file changed, 408 insertions(+), 4 deletions(-) > >> > > >> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c > >> > index 6791a204d5..18b3384bd5 100644 > >> > --- a/migration/multifd-qpl.c > >> > +++ b/migration/multifd-qpl.c > >> > @@ -13,9 +13,14 @@ > >> > #include "qemu/osdep.h" > >> > #include "qemu/module.h" > >> > #include "qapi/error.h" > >> > +#include "qapi/qapi-types-migration.h" > >> > +#include "exec/ramblock.h" > >> > #include "multifd.h" > >> > #include "qpl/qpl.h" > >> > > >> > +/* Maximum number of retries to resubmit a job if IAA work queues > are > >> full */ > >> > +#define MAX_SUBMIT_RETRY_NUM (3) > >> > + > >> > typedef struct { > >> > /* the QPL hardware path job */ > >> > qpl_job *job; > >> > @@ -260,6 +265,219 @@ static void > >> multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp) > >> > p->iov = NULL; > >> > } > >> > > >> > +/** > >> > + * multifd_qpl_prepare_job: prepare the job > >> > + * > >> > + * Set the QPL job parameters and properties. > >> > + * > >> > + * @job: pointer to the qpl_job structure > >> > + * @is_compression: indicates compression and decompression > >> > + * @input: pointer to the input data buffer > >> > + * @input_len: the length of the input data > >> > + * @output: pointer to the output data buffer > >> > + * @output_len: the length of the output data > >> > + */ > >> > +static void multifd_qpl_prepare_job(qpl_job *job, bool > is_compression, > >> > + uint8_t *input, uint32_t > input_len, > >> > + uint8_t *output, uint32_t > >> output_len) > >> > +{ > >> > + job->op = is_compression ? qpl_op_compress : qpl_op_decompress; > >> > + job->next_in_ptr = input; > >> > + job->next_out_ptr = output; > >> > + job->available_in = input_len; > >> > + job->available_out = output_len; > >> > + job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST | > QPL_FLAG_OMIT_VERIFY; > >> > + /* only supports compression level 1 */ > >> > + job->level = 1; > >> > +} > >> > + > >> > +/** > >> > + * multifd_qpl_prepare_job: prepare the compression job > >> > >> function name is wrong > > > > Thanks, I will fix this next version. > > > >> > + * > >> > + * Set the compression job parameters and properties. > >> > + * > >> > + * @job: pointer to the qpl_job structure > >> > + * @input: pointer to the input data buffer > >> > + * @input_len: the length of the input data > >> > + * @output: pointer to the output data buffer > >> > + * @output_len: the length of the output data > >> > + */ > >> > +static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t > *input, > >> > + uint32_t input_len, uint8_t > >> *output, > >> > + uint32_t output_len) > >> > +{ > >> > + multifd_qpl_prepare_job(job, true, input, input_len, output, > >> output_len); > >> > +} > >> > + > >> > +/** > >> > + * multifd_qpl_prepare_job: prepare the decompression job > > > > Thanks, I will fix this next version. > > > >> > + * > >> > + * Set the decompression job parameters and properties. > >> > + * > >> > + * @job: pointer to the qpl_job structure > >> > + * @input: pointer to the input data buffer > >> > + * @input_len: the length of the input data > >> > + * @output: pointer to the output data buffer > >> > + * @output_len: the length of the output data > >> > + */ > >> > +static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t > >> *input, > >> > + uint32_t input_len, > uint8_t > >> *output, > >> > + uint32_t output_len) > >> > +{ > >> > + multifd_qpl_prepare_job(job, false, input, input_len, output, > >> output_len); > >> > +} > >> > + > >> > +/** > >> > + * multifd_qpl_fill_iov: fill in the IOV > >> > + * > >> > + * Fill in the QPL packet IOV > >> > + * > >> > + * @p: Params for the channel being used > >> > + * @data: pointer to the IOV data > >> > + * @len: The length of the IOV data > >> > + */ > >> > +static void multifd_qpl_fill_iov(MultiFDSendParams *p, uint8_t > *data, > >> > + uint32_t len) > >> > +{ > >> > + p->iov[p->iovs_num].iov_base = data; > >> > + p->iov[p->iovs_num].iov_len = len; > >> > + p->iovs_num++; > >> > + p->next_packet_size += len; > >> > +} > >> > + > >> > +/** > >> > + * multifd_qpl_fill_packet: fill the compressed page into the QPL > >> packet > >> > + * > >> > + * Fill the compressed page length and IOV into the QPL packet > >> > + * > >> > + * @idx: The index of the compressed length array > >> > + * @p: Params for the channel being used > >> > + * @data: pointer to the compressed page buffer > >> > + * @len: The length of the compressed page > >> > + */ > >> > +static void multifd_qpl_fill_packet(uint32_t idx, MultiFDSendParams > *p, > >> > + uint8_t *data, uint32_t len) > >> > +{ > >> > + QplData *qpl = p->compress_data; > >> > + > >> > + qpl->zlen[idx] = cpu_to_be32(len); > >> > + multifd_qpl_fill_iov(p, data, len); > >> > +} > >> > + > >> > +/** > >> > + * multifd_qpl_submit_job: submit a job to the hardware > >> > + * > >> > + * Submit a QPL hardware job to the IAA device > >> > + * > >> > + * Returns true if the job is submitted successfully, otherwise > false. > >> > + * > >> > + * @job: pointer to the qpl_job structure > >> > + */ > >> > +static bool multifd_qpl_submit_job(qpl_job *job) > >> > +{ > >> > + qpl_status status; > >> > + uint32_t num = 0; > >> > + > >> > +retry: > >> > + status = qpl_submit_job(job); > >> > + if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) { > >> > + if (num < MAX_SUBMIT_RETRY_NUM) { > >> > + num++; > >> > + goto retry; > >> > + } > >> > + } > >> > + return (status == QPL_STS_OK); > >> > >> How often do we expect this to fail? Will the queues be busy frequently > >> or is this an unlikely event? I'm thinking whether we really need to > >> allow a fallback for the hw path. Sorry if this has been discussed > >> already, I don't remember. > > > > In some scenarios, this may happen frequently, such as configuring 4 > channels > > but only one IAA device is available. In the case of insufficient IAA > hardware > > resources, retry and fallback can help optimize performance. > > I have a comparison test below > > > > 1. Retry + SW fallback: > > total time: 14649 ms > > downtime: 25 ms > > throughput: 17666.57 mbps > > pages-per-second: 1509647 > > > > 2. No fallback, always wait for work queues to become available > > total time: 18381 ms > > downtime: 25 ms > > throughput: 13698.65 mbps > > pages-per-second: 859607 > > Thanks for the data, this is helpful. Let's include it in the commit > message, it's important to let people know you actually did that > analysis. I put a suggestion below: > > --- > QPL compression and decompression will use IAA hardware path if the IAA > hardware is available. Otherwise the QPL library software path is used. > > The hardware path will automatically fall back to QPL software path if > the IAA queues are busy. In some scenarios, this may happen frequently, > such as configuring 4 channels but only one IAA device is available. In > the case of insufficient IAA hardware resources, retry and fallback can > help optimize performance: > > 1. Retry + SW fallback: > total time: 14649 ms > downtime: 25 ms > throughput: 17666.57 mbps > pages-per-second: 1509647 > > 2. No fallback, always wait for work queues to become available > total time: 18381 ms > downtime: 25 ms > throughput: 13698.65 mbps > pages-per-second: 859607 > > If both the hardware and software paths fail, the uncompressed page is > sent directly.
Very thanks for your comments, I will add these to the commit message. > >> > +} > >> > + > >> > +/** > >> > + * multifd_qpl_compress_pages_slow_path: compress pages using slow > path > >> > + * > >> > + * Compress the pages using software. If compression fails, the page > >> will > >> > + * be sent directly. > >> > + * > >> > + * @p: Params for the channel being used > >> > + */ > >> > +static void multifd_qpl_compress_pages_slow_path(MultiFDSendParams > *p) > >> > +{ > >> > + QplData *qpl = p->compress_data; > >> > + uint32_t size = p->page_size; > >> > + qpl_job *job = qpl->sw_job; > >> > + uint8_t *zbuf = qpl->zbuf; > >> > + uint8_t *buf; > >> > + > >> > + for (int i = 0; i < p->pages->normal_num; i++) { > >> > + buf = p->pages->block->host + p->pages->offset[i]; > >> > + /* Set output length to less than the page to reduce > >> decompression */ > >> > + multifd_qpl_prepare_comp_job(job, buf, size, zbuf, size - > 1); > >> > + if (qpl_execute_job(job) == QPL_STS_OK) { > >> > + multifd_qpl_fill_packet(i, p, zbuf, job->total_out); > >> > + } else { > >> > + /* send the page directly */ > >> > >> s/directly/uncompressed/ > >> > >> a bit clearer. > > > > Sure, I will fix it next version. > > > >> > + multifd_qpl_fill_packet(i, p, buf, size); > >> > + } > >> > + zbuf += size; > >> > + } > >> > +} > >> > + > >> > +/** > >> > + * multifd_qpl_compress_pages: compress pages > >> > + * > >> > + * Submit the pages to the IAA hardware for compression. If hardware > >> > + * compression fails, it falls back to software compression. If > >> software > >> > + * compression also fails, the page is sent directly > >> > + * > >> > + * @p: Params for the channel being used > >> > + */ > >> > +static void multifd_qpl_compress_pages(MultiFDSendParams *p) > >> > +{ > >> > + QplData *qpl = p->compress_data; > >> > + MultiFDPages_t *pages = p->pages; > >> > + uint32_t size = p->page_size; > >> > + QplHwJob *hw_job; > >> > + uint8_t *buf; > >> > + uint8_t *zbuf; > >> > + > >> > >> Let's document the output size choice more explicitly: > >> > >> /* > >> * Set output length to less than the page size to force the job to > >> * fail in case it compresses to a larger size. We'll send that > page > >> * without compression and skip the decompression operation on the > >> * destination. > >> */ > >> out_size = size - 1; > >> > >> you can then omit the other comments. > > > > Thanks for the comments, I will refine this next version. > > > >> > + for (int i = 0; i < pages->normal_num; i++) { > >> > + buf = pages->block->host + pages->offset[i]; > >> > + zbuf = qpl->zbuf + (size * i); > >> > + hw_job = &qpl->hw_jobs[i]; > >> > + /* Set output length to less than the page to reduce > >> decompression */ > >> > + multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf, > size > >> - 1); > >> > + if (multifd_qpl_submit_job(hw_job->job)) { > >> > + hw_job->fallback_sw_path = false; > >> > + } else { > >> > + hw_job->fallback_sw_path = true; > >> > + /* Set output length less than page size to reduce > >> decompression */ > >> > + multifd_qpl_prepare_comp_job(qpl->sw_job, buf, size, > zbuf, > >> > + size - 1); > >> > + if (qpl_execute_job(qpl->sw_job) == QPL_STS_OK) { > >> > + hw_job->sw_output = zbuf; > >> > + hw_job->sw_output_len = qpl->sw_job->total_out; > >> > + } else { > >> > + hw_job->sw_output = buf; > >> > + hw_job->sw_output_len = size; > >> > + } > >> > >> Hmm, these look a bit cumbersome, would it work if we moved the > fallback > >> qpl_execute_job() down into the other loop? We could then avoid the > >> extra fields. Something like: > >> > >> static void multifd_qpl_compress_pages(MultiFDSendParams *p) > >> { > >> QplData *qpl = p->compress_data; > >> MultiFDPages_t *pages = p->pages; > >> uint32_t out_size, size = p->page_size; > >> uint8_t *buf, *zbuf; > >> > >> /* > >> * Set output length to less than the page size to force the job to > >> * fail in case it compresses to a larger size. We'll send that > page > >> * without compression to skip the decompression operation on the > >> * destination. > >> */ > >> out_size = size - 1; > >> > >> for (int i = 0; i < pages->normal_num; i++) { > >> QplHwJob *hw_job = &qpl->hw_jobs[i]; > >> > >> hw_job->fallback_sw_path = false; > >> buf = pages->block->host + pages->offset[i]; > >> zbuf = qpl->zbuf + (size * i); > >> > >> multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf, > >> out_size); > >> > >> if (!multifd_qpl_submit_job(hw_job->job)) { > >> hw_job->fallback_sw_path = true; > >> } > >> } > >> > >> for (int i = 0; i < pages->normal_num; i++) { > >> QplHwJob *hw_job = &qpl->hw_jobs[i]; > >> qpl_job *job; > >> > >> buf = pages->block->host + pages->offset[i]; > >> zbuf = qpl->zbuf + (size * i); > >> > >> if (hw_job->fallback_sw_path) { > >> job = qpl->sw_job; > >> multifd_qpl_prepare_comp_job(job, buf, size, zbuf, > out_size); > >> ret = qpl_execute_job(job); > >> } else { > >> job = hw_job->job; > >> ret = qpl_wait_job(job); > >> } > >> > >> if (ret == QPL_STS_OK) { > >> multifd_qpl_fill_packet(i, p, zbuf, job->total_out); > >> } else { > >> multifd_qpl_fill_packet(i, p, buf, size); > >> } > >> } > >> } > > > > Very thanks for the reference code, I have test the code and the > performance is not good. > > When the work queue is full, after a hardware job fails to be submitted, > the subsequent > > job submission will most likely fail as well. so my idea is to use > software job execution > > instead immediately, but all subsequent jobs will still give priority to > hardware path. > > So let me see if I get this, you're saying that going with the sw path > immediately after a hw path failure is beneficial because the time it > takes to call the sw path serves as a backoff time for the hw path? Exactly, I want to use the sw path as the backoff time for the hardware path. > Do you have an idea on the time difference of waiting for sw path > vs. introducing a delay to multifd_qpl_submit_job()? Aren't we leaving > performance on the table by going with a much slower sw path instead of > waiting for the queues to open up? Or some other strategy, such as going > once again over the not-submitted pages. Using a specific delay time to guarantee performance may be difficult now, because the solution only supports shared working queue mode and when the live migration starts waiting for a specified time, other workloads may still fill the device work queue, causing the live migration job submission to fail after a while. I agree with your point. Currently, using the software path to solve the backoff time may also cause the performance drop due to software path overhead. I will consider how to solve it in the future. > I understand there's a tradeoff here between your effort to investigate > these things and the amount of performance to be had, so feel free to > leave this question unanswered. We could choose to simply document this > with a comment: Sure, I'll leave the comments on this so I can improve it in the future > if (multifd_qpl_submit_job(hw_job->job)) { > hw_job->fallback_sw_path = false; > continue; > } > > /* > * The IAA work queue is full, any immediate subsequent job > * submission is likely to fail, sending the page via the QPL > * software path at this point gives us a better chance of > * finding the queue open for the next pages. > */ > hw_job->fallback_sw_path = true; > ... > > > There is almost no overhead in job submission because Intel uses the new > "enqcmd" instruction, > > which allows the user program to submit the job directly to the > hardware. > > > > According to the implementation of the reference code, when a job fails > to be submitted, there > > is a high probability that "ALL" subsequent jobs will fail to be > submitted and then use software > > compression, resulting in the IAA hardware not being fully utilized. > > > > For 4 Channel, 1 IAA device test case, using the reference code will > reduce IAA throughput > > from 3.4GBps to 2.2GBps, thus affecting live migration > performance.(total time from 14s to 18s) > >