Re: race condition? virsh migrate --copy-storage-all

2022-04-20 Thread Valentijn Sessink

Hi, not really virsh related anymore, but just to summarize: it was in
fact an ethernet problem:

On 19-04-2022 16:21, Valentijn Sessink wrote:
192.168.112.31.39324 > 192.168.112.12.22: Flags [P.], cksum 0x61e7 
(incorrect -> 0x3220), seq 9618:9686, ack 5038, win 501, options


Probably related to https://bugzilla.kernel.org/show_bug.cgi?id=206969 
(same adapter).


# ethtool -K enp1s0 tx off
... helped.

Thanks again for your help. Best regards,

Valentijn
--
http://www.openoffice.nl/   Open Office - Linux Office Solutions
Valentijn Sessink  v.sess...@openoffice.nl  +31(0)20-4214059



Re: race condition? virsh migrate --copy-storage-all

2022-04-19 Thread Valentijn Sessink

Hi,

On 19-04-2022 16:07, Peter Krempa wrote:

So at this point I suspect that something without the network broke and
the migration was aborted in the storage copy phase, but could been in
any other.


Hmm, thank you. My problem is much clearer now - and probably not 
getting easier:


192.168.112.31.39324 > 192.168.112.12.22: Flags [P.], cksum 0x61e7 
(incorrect -> 0x3220), seq 9618:9686, ack 5038, win 501, options 
[nop,nop,TS val 3380045136 ecr 1586940949], length 68
(Many more of these - then a timeout. And mind you: this is not related 
to any virtual checksum waiver or anything like that, it's the physical 
machine).


Anyway, thanks for your help.

Best regards,

Valentijn
--



Re: race condition? virsh migrate --copy-storage-all

2022-04-19 Thread Peter Krempa
On Tue, Apr 19, 2022 at 15:51:32 +0200, Valentijn Sessink wrote:
> Hi Peter,
> 
> Thanks.
> 
> On 19-04-2022 13:22, Peter Krempa wrote:
> > It would be helpful if you provide the VM XML file to see how your disks
> > are configured and the debug log file when the bug reproduces:
> 
> I created a random VM to show the effect. XML file attached.
> 
> > Without that my only hunch would be that you ran out of disk space on
> > the destination which caused the I/O error.
> 
> ... it's an LVM2 volume with exact the same size as the source machine, so
> that would be rather odd ;-)

Oh, you are using raw disks backed by block volumes. That was not
obvious before ;)

> 
> I'm guessing that it's this weird message at the destination machine:
> 
> 2022-04-19 13:31:09.394+: 1412559: error : virKeepAliveTimerInternal:137
> : internal error: connection closed due to keepalive timeout

That certainly could be a hint ...

> 
> Source machine says:
> 2022-04-19 13:31:09.432+: 2641309: debug :
> qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds":
> 1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data":
> {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
> 2022-04-19 13:31:09.432+: 2641309: debug : virJSONValueFromString:1822 :
> string={"timestamp": {"seconds": 1650375069, "microseconds": 432613},
> "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2",
> "operation": "write", "action": "report"}}

The migration of non-shared storage works as follows:

1) libvirt sets up everything
2) libvirt asks destination qemu to open an NBD server exporting the
   disk backends
3) source libvirt instructs qemu to copy the disks to the NBD server via
   a block-copy job
4) when the block jobs converge, source qemu is instructed to migrate
   memory
5) when memory migrates, source qemu is killed and destination is
resumed

Now from the keepalive failure on the destiantion it seems that the
network connection at least between the migration controller and the
destination libvirt broke. That might actually cause also the NBD
connection to break and in such case the block job gets an I/O error.

Now the I/O error is actually based on the network connection and not
any storage issue.

So at this point I suspect that something without the network broke and
the migration was aborted in the storage copy phase, but could been in
any other.



Re: race condition? virsh migrate --copy-storage-all

2022-04-19 Thread Valentijn Sessink

Hi Peter,

Thanks.

On 19-04-2022 13:22, Peter Krempa wrote:

It would be helpful if you provide the VM XML file to see how your disks
are configured and the debug log file when the bug reproduces:


I created a random VM to show the effect. XML file attached.


Without that my only hunch would be that you ran out of disk space on
the destination which caused the I/O error.


... it's an LVM2 volume with exact the same size as the source machine, 
so that would be rather odd ;-)


I'm guessing that it's this weird message at the destination machine:

2022-04-19 13:31:09.394+: 1412559: error : 
virKeepAliveTimerInternal:137 : internal error: connection closed due to 
keepalive timeout


Source machine says:
2022-04-19 13:31:09.432+: 2641309: debug : 
qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 
1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
2022-04-19 13:31:09.432+: 2641309: debug : 
virJSONValueFromString:1822 : string={"timestamp": {"seconds": 
1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.432+: 2641309: info : 
qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: 
mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, 
"microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": {"device": 
"drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.432+: 2641309: debug : qemuMonitorEmitEvent:1198 
: mon=0x7f70080028a0 event=BLOCK_JOB_ERROR
2022-04-19 13:31:09.432+: 2641309: debug : 
qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 
1650375069, "microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
2022-04-19 13:31:09.432+: 2641309: debug : 
virJSONValueFromString:1822 : string={"timestamp": {"seconds": 
1650375069, "microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.433+: 2641309: info : 
qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: 
mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, 
"microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": {"device": 
"drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.433+: 2641309: debug : qemuMonitorEmitEvent:1198 
: mon=0x7f70080028a0 event=BLOCK_JOB_ERROR
2022-04-19 13:31:09.433+: 2641309: debug : 
qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 
1650375069, "microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
2022-04-19 13:31:09.433+: 2641309: debug : 
virJSONValueFromString:1822 : string={"timestamp": {"seconds": 
1650375069, "microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.433+: 2641309: info : 
qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: 
mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, 
"microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": {"device": 
"drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.433+: 2641309: debug : qemuMonitorEmitEvent:1198 
: mon=0x7f70080028a0 event=BLOCK_JOB_ERROR


... and more of these. XML file attached.

Does that show anything? Please note that there is no real "block error" 
anywhere, there is an exact LVM volume on the other side, I'm actually 
using a script to extract the name of the volume at source; then I'm 
reading the source volume size and I'm creating a destination volume 
with the exact size before I start the migration. Disks are RAID volumes 
and there are no read or write errors.


Best regards,

Valentijn
--
Durgerdamstraat 29, 1507 JL Zaandam; telefoon 075-7100071
  water
  959c1a50-5784-e3f4-1006-1bac01d513e5
  
http://libosinfo.org/xmlns/libvirt/domain/1.0;>
  http://ubuntu.com/ubuntu/20.04"/>

  
  4194304
  1740804
  1
  
/machine
  
  
hvm

  
  



  
  
Westmere

  
  
  destroy
  restart
  restart
  
/usr/bin/kvm

  
  
  
  
  
  


  
  
  
  
  
  


  
  


  


  
  
  
  
  
  


  
  
  
  
  
  


  


  


  


  
  


  
  
  


  
  

  
  
libvirt-959c1a50-5784-e3f4-1006-1bac01d513e5
libvirt-959c1a50-5784-e3f4-1006-1bac01d513e5
  
  
+111:+111
+111:+111
  




Re: race condition? virsh migrate --copy-storage-all

2022-04-19 Thread Peter Krempa
On Fri, Apr 15, 2022 at 16:58:08 +0200, Valentijn Sessink wrote:
> Hi list,
> 
> I'm trying to migrate a few qemu virtual machines between two 1G ethernet
> connected hosts, with local storage only. I got endless "error: operation
> failed: migration of disk vda failed: Input/output error" errors and
> thought: something wrong with settings.
> 
> However, then, suddenly: I succeeded without changing anything. And, hey:
>  while ! time virsh migrate --live --persistent --undefinesource
> --copy-storage-all ubuntu20.04 qemu+ssh://duikboot/system; do a=$(( $a + 1
> )); echo $a; done
> 
> ... retried 8 times, but then: success. This smells like a race condition,
> doesn't it? A bit weird is the fact that the migration seems to succeed
> every time while copying from revolving disks to SSD; but the other way
> around has this Input/output error.
> 
> There are some messages in /var/log/syslog, but not at the time of the
> failure, and no disk errors. These disks are LVM2 volumes and they live on
> raid arrays - and/so there is not a real, as in physical, I/O-error.
> 
> Source system has SSD's, target system has regular disks.
> 
> 1) is this the right mailing list? I'm not 100% sure.
> 2) how can I research this further? Spending hours on a "while / then" loop
> to try and retry live migration looks like a dull job for my poor computers
> ;-)

It would be helpful if you provide the VM XML file to see how your disks
are configured and the debug log file when the bug reproduces:

https://www.libvirt.org/kbase/debuglogs.html#less-verbose-logging-for-qemu-vms

Without that my only hunch would be that you ran out of disk space on
the destination which caused the I/O error.



race condition? virsh migrate --copy-storage-all

2022-04-15 Thread Valentijn Sessink

Hi list,

I'm trying to migrate a few qemu virtual machines between two 1G 
ethernet connected hosts, with local storage only. I got endless "error: 
operation failed: migration of disk vda failed: Input/output error" 
errors and thought: something wrong with settings.


However, then, suddenly: I succeeded without changing anything. And, hey:
 while ! time virsh migrate --live --persistent --undefinesource 
--copy-storage-all ubuntu20.04 qemu+ssh://duikboot/system; do a=$(( $a + 
1 )); echo $a; done


... retried 8 times, but then: success. This smells like a race 
condition, doesn't it? A bit weird is the fact that the migration seems 
to succeed every time while copying from revolving disks to SSD; but the 
other way around has this Input/output error.


There are some messages in /var/log/syslog, but not at the time of the 
failure, and no disk errors. These disks are LVM2 volumes and they live 
on raid arrays - and/so there is not a real, as in physical, I/O-error.


Source system has SSD's, target system has regular disks.

1) is this the right mailing list? I'm not 100% sure.
2) how can I research this further? Spending hours on a "while / then" 
loop to try and retry live migration looks like a dull job for my poor 
computers ;-)


Best regards,

Valentijn