Re: [pve-devel] [RFC PATCH proxmox-backup-qemu] restore: make chunk loading more parallel

Dominik Csapak Mon, 07 Jul 2025 23:42:21 -0700

On 7/7/25 21:49, Adam Kalisz wrote:

Hi Dominik,


seems like a more elegant solution. In my testing on the Ryzen system:


Thanks for testing!


progress 100% (read 53624176640 bytes, zeroes = 5% (2776629248 bytes), duration 
52 sec)
progress 96% (read 53628370944 bytes, zeroes = 5% (2776629248 bytes), duration 
52 sec)
progress 97% (read 53645148160 bytes, zeroes = 5% (2776629248 bytes), duration 
52 sec)
progress 96% (read 53657731072 bytes, zeroes = 5% (2776629248 bytes), duration 
52 sec)
progress 97% (read 53666119680 bytes, zeroes = 5% (2776629248 bytes), duration 
52 sec)
progress 96% (read 53678702592 bytes, zeroes = 5% (2776629248 bytes), duration 
52 sec)
progress 97% (read 53682896896 bytes, zeroes = 5% (2776629248 bytes), duration 
52 sec)
restore image complete (bytes=53687091200, duration=52.35s, speed=977.94MB/s)

As you can see the speed is much better than the original still but
lower than my solution and also the progress reporting is showing the
lack of proper ordering.


ah yes, i can see why the ordering is wrong, i'll fix that up in my next patch,
for more info see below.


This is with 8 futures:

progress 97% (read 52076478464 bytes, zeroes = 2% (1199570944 bytes), duration 
57 sec)
progress 96% (read 52080672768 bytes, zeroes = 2% (1199570944 bytes), duration 
57 sec)
progress 97% (read 52084867072 bytes, zeroes = 2% (1203765248 bytes), duration 
57 sec)
progress 98% (read 52583989248 bytes, zeroes = 3% (1702887424 bytes), duration 
57 sec)
progress 99% (read 53120860160 bytes, zeroes = 4% (2239758336 bytes), duration 
57 sec)
progress 100% (read 53657731072 bytes, zeroes = 5% (2776629248 bytes), duration 
57 sec)
progress 97% (read 53661925376 bytes, zeroes = 5% (2776629248 bytes), duration 
57 sec)
restore image complete (bytes=53687091200, duration=57.29s, speed=893.66MB/s)

With 32 futures I have got 984 MB/s so that's the limit I guess.

The do have checksum acceleration using SHA NI.

What I have noticed that in my solution the zero chunks seem to get
written very early while the restore process is fetching other chunks,
which speeds up images with mostly zeroes greatly.

This shows on the Xeon system where I restore 3 drives:

progress 80% (read 10716446720 bytes, zeroes = 69% (7415529472 bytes), duration 
10 sec)
progress 79% (read 10720641024 bytes, zeroes = 69% (7415529472 bytes), duration 
10 sec)
progress 80% (read 10737418240 bytes, zeroes = 69% (7415529472 bytes), duration 
10 sec)
restore image complete (bytes=10737418240, duration=10.18s, speed=1006.31MB/s)
--> this can be 5x faster with my solution, as it contains largely zero blocks

progress 99% (read 106292051968 bytes, zeroes = 1% (1606418432 bytes), duration 
314 sec)
progress 98% (read 106296246272 bytes, zeroes = 1% (1606418432 bytes), duration 
314 sec)
progress 99% (read 106304634880 bytes, zeroes = 1% (1606418432 bytes), duration 
314 sec)
progress 100% (read 107374182400 bytes, zeroes = 1% (2139095040 bytes), 
duration 316 sec)
restore image complete (bytes=107374182400, duration=316.35s, speed=323.70MB/s)
--> here I would do ~1 GBps with my solution

The other image is similarly slow.

This system does not have checksum acceleration using SHA NI, maybe
that could explain the large difference in speed between the solutions?
Is the sha256 calculation a bottleneck when fetching chunks?

With the futures solution, mpstat shows only 1-2 cores are busy on
these 2 sockets systems. The speed is very similar to the single
threaded solution that was there before but I confirmed multiple times
that I am using the new version (and the output of progress shows that
as well, which is why I included it).


Ah the speed increase is as much as i hoped but understandably since I now
learned that 'buffer_unordered' does not really behave like i thought it would
(documentation is also rather sparse): It does not buffer the result
of the futures in the background, but seemingly only pull them while we're 
awaiting
so it only does work when we wait for a new chunk, which is slightly better
than single threaded, but not by much (as seen in your benchmarks)

I'll send a new version of my patch with that fixed, by starting background
tokio tasks in the future itself, so they  will run to completion even when
we're still writing, that should give better improvements.

(I'll also update my benchmark data)

Thanks
Dominik


Best regards
Adam

On Mon, 2025-07-07 at 10:14 +0200, Dominik Csapak wrote:

by using async futures to load chunks and stream::buffer_unordered to
buffer up to 16 of them, depending on write/load speed.

With this, we don't need to increase the number of threads in the
runtime to trigger parallel reads and network traffic to us. This way
it's only limited by CPU if decoding and/or decrypting is the bottle
neck.

I measured restoring a VM backup with about 30GiB data and fast
storage
over a local network link (from PBS VM to the host).
Let it do multiple runs, but the variance was not that big, so here's
some representative log output with various MAX_BUFFERED_FUTURES
values.

no MAX_BUFFERED_FUTURES: duration=43.18s, speed=758.82MB/s
4:                       duration=38.61s, speed=848.77MB/s
8:                       duration=33.79s, speed=969.85MB/s
16:                      duration=31.45s, speed=1042.06MB/s

note that increasing the number has a diminishing returns after 10-12
on
my system, but I guess it depends on the exact configuration. (For
more
than 16 I did not see any improvement, but this is probably just my
setup).

I saw an increase in CPU usage (from ~75% to ~100% of one core),
which
are very likely the additional chunks to be decoded.

In general I'd like to limit the buffering somehow, but I don't think
there is a good automatic metric we can use, and giving the admin a
knob
that is hard to explain what the actual ramifications about it are is
also not good, so I settled for a value that showed improvement but
does
not seem too high.

In any case, if the target and/or source storage is too slow, there
will
be back/forward pressure, and this change should only matter for
storage
systems where IO depth plays a role.

This patch is loosely based on the patch from Adam Kalisz[0], but
removes
the need to increase the blocking threads and uses the (actually
always
used) underlying async implementation for reading remote chunks.

0:
https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-de...@lists.proxmox.com/

Signed-off-by: Dominik Csapak <d.csa...@proxmox.com>
Based-on-patch-by: Adam Kalisz <adam.kal...@notnullmakers.com>
---
@Adam could you please test this patch too to see if you still see
the
improvements you saw in your version?

Also I sent it as RFC to discuss how we decide how many chunks we
want
to buffer/threads we want to allocate. This is a non-trivial topic,
and
as i wrote we don't have a real metric to decide upfront, but giving
the
admin knobs that are complicated is also not the best solution.

My instinct would be to simply increase to 16 (as I have done here)
and
maybe expose this number in /etc/vzdump.conf or
/etc/pve/datacenter.cfg

Also I tried to make the writes multi-threaded too, but my
QEMU-knowledge is not very deep for this kind of thing, and I wanted
to
get this version out there soon. (Increasing the write threads can
still
be done afterwards if this change is enough for now)

I developed this patch on top of the 'stable-bookworm' branch, but it
should apply cleanly on master as well.

  src/restore.rs | 57 +++++++++++++++++++++++++++++++++++++-----------
--
  1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/src/restore.rs b/src/restore.rs
index 5a5a398..741b3e1 100644
--- a/src/restore.rs
+++ b/src/restore.rs
@@ -2,6 +2,7 @@ use std::convert::TryInto;
  use std::sync::{Arc, Mutex};

use anyhow::{bail, format_err, Error};

+use futures::StreamExt;
  use once_cell::sync::OnceCell;
  use tokio::runtime::Runtime;

@@ -13,7 +14,7 @@ use

pbs_datastore::cached_chunk_reader::CachedChunkReader;
  use pbs_datastore::data_blob::DataChunkBuilder;
  use pbs_datastore::fixed_index::FixedIndexReader;
  use pbs_datastore::index::IndexFile;
-use pbs_datastore::read_chunk::ReadChunk;
+use pbs_datastore::read_chunk::AsyncReadChunk;
  use pbs_datastore::BackupManifest;
  use pbs_key_config::load_and_decrypt_key;
  use pbs_tools::crypt_config::CryptConfig;
@@ -29,6 +30,9 @@ struct ImageAccessInfo {
      archive_size: u64,
  }

+// use at max 16 buffered futures to make loading of chunks more

concurrent
+const MAX_BUFFERED_FUTURES: usize = 16;
+
  pub(crate) struct RestoreTask {
      setup: BackupSetup,
      runtime: Arc<Runtime>,
@@ -165,24 +169,47 @@ impl RestoreTask {

let start_time = std::time::Instant::now();- for pos in 0..index.index_count() {

+        let read_queue = (0..index.index_count()).map(|pos| {
              let digest = index.index_digest(pos).unwrap();
              let offset = (pos * index.chunk_size) as u64;
-            if digest == &zero_chunk_digest {
-                let res = write_zero_callback(offset,
index.chunk_size as u64);
-                if res < 0 {
-                    bail!("write_zero_callback failed ({})", res);
+            let chunk_reader = chunk_reader.clone();
+            async move {
+                let chunk = if digest == &zero_chunk_digest {
+                    None
+                } else {
+                    let raw_data =
AsyncReadChunk::read_chunk(&chunk_reader, digest).await?;
+                    Some(raw_data)
+                };
+
+                Ok::<_, Error>((chunk, pos, offset))
+            }
+        });
+
+        // this buffers futures and pre-fetches some chunks for us
+        let mut stream =
futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTUR
ES);
+
+        while let Some(res) = stream.next().await {
+            let res = res?;
+            let pos = match res {
+                (None, pos, offset) => {
+                    let res = write_zero_callback(offset,
index.chunk_size as u64);
+                    if res < 0 {
+                        bail!("write_zero_callback failed ({})",
res);
+                    }
+                    bytes += index.chunk_size;
+                    zeroes += index.chunk_size;
+                    pos
                  }
-                bytes += index.chunk_size;
-                zeroes += index.chunk_size;
-            } else {
-                let raw_data = ReadChunk::read_chunk(&chunk_reader,
digest)?;
-                let res = write_data_callback(offset, &raw_data);
-                if res < 0 {
-                    bail!("write_data_callback failed ({})", res);
+                (Some(raw_data), pos, offset) => {
+                    let res = write_data_callback(offset,
&raw_data);
+                    if res < 0 {
+                        bail!("write_data_callback failed ({})",
res);
+                    }
+                    bytes += raw_data.len();
+                    pos
                  }
-                bytes += raw_data.len();
-            }
+            };
+
              if verbose {
                  let next_per = ((pos + 1) * 100) /
index.index_count();
                  if per != next_per {




_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [RFC PATCH proxmox-backup-qemu] restore: make chunk loading more parallel

Reply via email to