On 04/12/2020 11:36, Arjen via pve-user wrote:
On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote:

On 04/12/2020 09:30, Frank Thommen wrote:
On Thursday, December 3, 2020 10:16 PM, Frank Thommen
<[email protected]> wrote:


Dear all,

on our PVE cluster, the backup of a specific VM always fails
(which
makes us worry, as it is our GitLab instance). The general
backup plan
is "back up all VMs at 00:30". In the confirmation email we
see, that
the backup of this specific VM takes six to seven hours and
then fails.
The error message in the overview table used to be:

vma_queue_write: write error - Broken pipe

With detailed log

-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-----------------------------------------------


123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
123: 2020-12-01 02:53:08 INFO: status = running
123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
'ceph-rbd:vm-123-disk-0' 20G
123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
'ceph-rbd:vm-123-disk-2' 1000G
123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
'ceph-rbd:vm-123-disk-3' 2T
123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
123: 2020-12-01 02:53:09 INFO: ionice priority: 7
123: 2020-12-01 02:53:09 INFO: creating archive
'/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-
02_53_08.vma.lzo'
123: 2020-12-01 02:53:09 INFO: started backup task
'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
123: 2020-12-01 02:53:12 INFO: status: 0%
(167772160/3294239916032),
sparse 0% (31563776), duration 3, read/write 55/45 MB/s
[... ecc. ecc. ...]
123: 2020-12-01 09:42:14 INFO: status: 35%
(1170252365824/3294239916032), sparse 0% (26845003776),
duration 24545,
read/write 59/56 MB/s
123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error -
Broken
pipe
123: 2020-12-01 09:42:14 INFO: aborting backup job
123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
vma_queue_write: write error - Broken pipe

-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
----------!

---------
-----------------------------------------------------------------
-----------------------------------------------------------------
-----------------------------------------------------------------
-----------------------------------------------------------------
----------------------------------

Since lately (upgrade to the newest PVE release) it's

VM 123 qmp command 'query-backup' failed - got timeout

with log

-------------------------------------------------------------
-------------------------------------------------------------


123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
123: 2020-12-03 03:29:00 INFO: status = running
123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
'ceph-rbd:vm-123-disk-0' 20G
123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
'ceph-rbd:vm-123-disk-2' 1000G
123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
'ceph-rbd:vm-123-disk-3' 2T
123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
123: 2020-12-03 03:29:01 INFO: ionice priority: 7
123: 2020-12-03 03:29:01 INFO: creating vzdump archive
'/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-
03_29_00.vma.lzo'
123: 2020-12-03 03:29:01 INFO: started backup task
'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
123: 2020-12-03 03:29:01 INFO: resuming VM again
123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s,
read:
94.7 MiB/s, write: 51.7 MiB/s
[... ecc. ecc. ...]
123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h
36m 7s,
read: 57.3 MiB/s, write: 53.6 MiB/s
123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-
backup' failed

-   got timeout
    123: 2020-12-03 09:22:57 INFO: aborting backup job
    123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-
cancel'
    failed - unable to connect to VM 123 qmp socket - timeout
after
5981 retries
    123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed -
VM 123 qmp
    command 'query-backup' failed - got timeout


The VM has some quite big vdisks (20G, 1T and 2T). All stored
in Ceph.
There is still plenty of space in Ceph.

Can anyone give us some hint on how to investigate and debug
this
further?

Because it is a write error, maybe we should look at the backup
destination.
Maybe it is a network connection issue? Maybe something wrong
with the
host? Maybe the disk is full?
Which storage are you using for backup? Can you show us the
corresponding entry in /etc/pve/storage.cfg?

We are backing up to cephfs with still 8 TB or so free.

/etc/pve/storage.cfg is
------------
dir: local
         path /var/lib/vz
         content vztmpl,backup,iso

dir: data
         path /data
         content snippets,images,backup,iso,rootdir,vztmpl

cephfs: cephfs
         path /mnt/pve/cephfs
         content backup,vztmpl,iso
         maxfiles 5

rbd: ceph-rbd
         content images,rootdir
         krbd 0
         pool pve-pool1
------------


The problem has reached a new level of urgency, as since two days
each
time after a failed backup the VMm becomes unaccessible and has to be
stopped and started manually from the PVE UI.

I don't see anything wrong the configuration that you shared.
Was anything changed in the last few days since the last successful
backup? Any updates from Proxmox? Changes to the network?
I know very little about Ceph and clusters, sorry.
What makes this VM different, except for the size of the disks?

On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think from 6.1-3). After that the error message slightly changed and - in hindsight - since then the VM stops being accessible after the failed backup.

However: The VM never ever backed up successfully, not even before the PVE upgrade. It's just that no one really took notice of it.

The VM is not really special. It's our only Debian VM (but I hope that's not an issue :-) and the VM has been migrated 1:1 from oVirt by migrating and importing the disk images. But we have a few other such VMs and they run and back up just fine.

No network changes. Basically nothing changed that I could think of.

But to be clear: Our current main problem is the failing backup, not the crash.


Cheers, Frank




_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to