Re: race condition? virsh migrate --copy-storage-all
Hi, On 19-04-2022 16:07, Peter Krempa wrote: So at this point I suspect that something without the network broke and the migration was aborted in the storage copy phase, but could been in any other. Hmm, thank you. My problem is much clearer now - and probably not getting easier: 192.168.112.31.39324 > 192.168.112.12.22: Flags [P.], cksum 0x61e7 (incorrect -> 0x3220), seq 9618:9686, ack 5038, win 501, options [nop,nop,TS val 3380045136 ecr 1586940949], length 68 (Many more of these - then a timeout. And mind you: this is not related to any virtual checksum waiver or anything like that, it's the physical machine). Anyway, thanks for your help. Best regards, Valentijn --
Re: race condition? virsh migrate --copy-storage-all
On Tue, Apr 19, 2022 at 15:51:32 +0200, Valentijn Sessink wrote: > Hi Peter, > > Thanks. > > On 19-04-2022 13:22, Peter Krempa wrote: > > It would be helpful if you provide the VM XML file to see how your disks > > are configured and the debug log file when the bug reproduces: > > I created a random VM to show the effect. XML file attached. > > > Without that my only hunch would be that you ran out of disk space on > > the destination which caused the I/O error. > > ... it's an LVM2 volume with exact the same size as the source machine, so > that would be rather odd ;-) Oh, you are using raw disks backed by block volumes. That was not obvious before ;) > > I'm guessing that it's this weird message at the destination machine: > > 2022-04-19 13:31:09.394+: 1412559: error : virKeepAliveTimerInternal:137 > : internal error: connection closed due to keepalive timeout That certainly could be a hint ... > > Source machine says: > 2022-04-19 13:31:09.432+: 2641309: debug : > qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": > 1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": > {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}] > 2022-04-19 13:31:09.432+: 2641309: debug : virJSONValueFromString:1822 : > string={"timestamp": {"seconds": 1650375069, "microseconds": 432613}, > "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", > "operation": "write", "action": "report"}} The migration of non-shared storage works as follows: 1) libvirt sets up everything 2) libvirt asks destination qemu to open an NBD server exporting the disk backends 3) source libvirt instructs qemu to copy the disks to the NBD server via a block-copy job 4) when the block jobs converge, source qemu is instructed to migrate memory 5) when memory migrates, source qemu is killed and destination is resumed Now from the keepalive failure on the destiantion it seems that the network connection at least between the migration controller and the destination libvirt broke. That might actually cause also the NBD connection to break and in such case the block job gets an I/O error. Now the I/O error is actually based on the network connection and not any storage issue. So at this point I suspect that something without the network broke and the migration was aborted in the storage copy phase, but could been in any other.
Re: race condition? virsh migrate --copy-storage-all
Hi Peter, Thanks. On 19-04-2022 13:22, Peter Krempa wrote: It would be helpful if you provide the VM XML file to see how your disks are configured and the debug log file when the bug reproduces: I created a random VM to show the effect. XML file attached. Without that my only hunch would be that you ran out of disk space on the destination which caused the I/O error. ... it's an LVM2 volume with exact the same size as the source machine, so that would be rather odd ;-) I'm guessing that it's this weird message at the destination machine: 2022-04-19 13:31:09.394+: 1412559: error : virKeepAliveTimerInternal:137 : internal error: connection closed due to keepalive timeout Source machine says: 2022-04-19 13:31:09.432+: 2641309: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}] 2022-04-19 13:31:09.432+: 2641309: debug : virJSONValueFromString:1822 : string={"timestamp": {"seconds": 1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}} 2022-04-19 13:31:09.432+: 2641309: info : qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}} 2022-04-19 13:31:09.432+: 2641309: debug : qemuMonitorEmitEvent:1198 : mon=0x7f70080028a0 event=BLOCK_JOB_ERROR 2022-04-19 13:31:09.432+: 2641309: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1650375069, "microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}] 2022-04-19 13:31:09.432+: 2641309: debug : virJSONValueFromString:1822 : string={"timestamp": {"seconds": 1650375069, "microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}} 2022-04-19 13:31:09.433+: 2641309: info : qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, "microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}} 2022-04-19 13:31:09.433+: 2641309: debug : qemuMonitorEmitEvent:1198 : mon=0x7f70080028a0 event=BLOCK_JOB_ERROR 2022-04-19 13:31:09.433+: 2641309: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1650375069, "microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}] 2022-04-19 13:31:09.433+: 2641309: debug : virJSONValueFromString:1822 : string={"timestamp": {"seconds": 1650375069, "microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}} 2022-04-19 13:31:09.433+: 2641309: info : qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, "microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}} 2022-04-19 13:31:09.433+: 2641309: debug : qemuMonitorEmitEvent:1198 : mon=0x7f70080028a0 event=BLOCK_JOB_ERROR ... and more of these. XML file attached. Does that show anything? Please note that there is no real "block error" anywhere, there is an exact LVM volume on the other side, I'm actually using a script to extract the name of the volume at source; then I'm reading the source volume size and I'm creating a destination volume with the exact size before I start the migration. Disks are RAID volumes and there are no read or write errors. Best regards, Valentijn -- Durgerdamstraat 29, 1507 JL Zaandam; telefoon 075-7100071 water 959c1a50-5784-e3f4-1006-1bac01d513e5 http://libosinfo.org/xmlns/libvirt/domain/1.0";> http://ubuntu.com/ubuntu/20.04"/> 4194304 1740804 1 /machine hvm Westmere destroy restart restart /usr/bin/kvm libvirt-959c1a50-5784-e3f4-1006-1bac01d513e5 libvirt-959c1a50-5784-e3f4-1006-1bac01d513e5 +111:+111 +111:+111
Re: race condition? virsh migrate --copy-storage-all
On Fri, Apr 15, 2022 at 16:58:08 +0200, Valentijn Sessink wrote: > Hi list, > > I'm trying to migrate a few qemu virtual machines between two 1G ethernet > connected hosts, with local storage only. I got endless "error: operation > failed: migration of disk vda failed: Input/output error" errors and > thought: something wrong with settings. > > However, then, suddenly: I succeeded without changing anything. And, hey: > while ! time virsh migrate --live --persistent --undefinesource > --copy-storage-all ubuntu20.04 qemu+ssh://duikboot/system; do a=$(( $a + 1 > )); echo $a; done > > ... retried 8 times, but then: success. This smells like a race condition, > doesn't it? A bit weird is the fact that the migration seems to succeed > every time while copying from revolving disks to SSD; but the other way > around has this Input/output error. > > There are some messages in /var/log/syslog, but not at the time of the > failure, and no disk errors. These disks are LVM2 volumes and they live on > raid arrays - and/so there is not a real, as in physical, I/O-error. > > Source system has SSD's, target system has regular disks. > > 1) is this the right mailing list? I'm not 100% sure. > 2) how can I research this further? Spending hours on a "while / then" loop > to try and retry live migration looks like a dull job for my poor computers > ;-) It would be helpful if you provide the VM XML file to see how your disks are configured and the debug log file when the bug reproduces: https://www.libvirt.org/kbase/debuglogs.html#less-verbose-logging-for-qemu-vms Without that my only hunch would be that you ran out of disk space on the destination which caused the I/O error.
Re: Virtio-scsi and block mirroring
On Thu, Apr 14, 2022 at 16:36:38 +, Bjoern Teipel wrote: > Hello everyone, Hi, > > I’m looking at an issue where I do see guests freezing (Dl) process state > during a block disk mirror from one storage to another storage (NFS) where > the network stack of the guest can freeze for up to 10 seconds. > Looking at the storage and IO I noticed good throughput ad low latency <3ms > and I am having trouble to track down the source for the issue, as neither > storage nor networking show issues. Interestingly when I do the same test > with virtio-blk I do not really see the process freezes at the frequency or > duration compared to virtio-scsi which seem to indicate a client side rather > than storage side problem. Hmm, this is really weird if the difference is in the guest-facing device frontend. Since libvirt is merely setting up the block job for the copy and the copy itself is handled by qemu I suggest you contact the qemu-bl...@nongnu.org mailing list. Unfortunately you didn't provide any information on the disk configuration (the VM XML) or how you start the blockjob, which I could translate for you into qemu specifics. If you provide such information I can do that to ensure that the qemu folks have all the relevant information.
Re: Help with libvirt
El 11/4/22 a les 15:06, Eduardo Kiassucumuca ha escrit: Good morning I'm Eduardo, a computer science student and I'm doing a final course work focused on virtualization. The work consists of creating virtual machines on a server and allowing ssh access to the virtual machines that are on the server containing qemu/kvm/libvirt. The problem is that I can't access the virtual machines from an external network but I can access them inside the server. I would like to know what would be the best way since we want to have a single public ip and be able to have a reverse proxy to access the virtual machines, I would like to know from your experience what you recommend? Hello Eduardo. I am afraid I am promoting our own project now. :) We have a feature in Ravada VDI to easily expose ports. It was a feature created for students virtual machines in a classroom. But it can be applied anywhere. https://ravada.readthedocs.io/en/latest/docs/expose_ports.html Hope this helps.
Re: When does the balloon-change or control-error event occur
On Thu, Apr 07, 2022 at 05:16:45PM +0800, Han Han wrote: > Hi developers, > I have questions about balloon-change or control-error event: > 1. What's the meaning of these events > 2. When do the events occur? > > The comments of their callbacks don't mention that( > https://gitlab.com/libvirt/libvirt/-/blob/master/include/libvirt/libvirt-domain.h#L4130 > https://gitlab.com/libvirt/libvirt/-/blob/master/include/libvirt/libvirt-domain.h#L3736 'balloon-change' is emitted any time the guest OS changes the ballon inflation level. eg if the host admin sets the balloon target to 1 GB and the guest is currently using 2 GB, it might not be able to immediately drop down to the 1 GB mark. The balloon-change events will emited as it make progress towards teh 1 GB mark. control-error is emitted when libvirt has some kind of problem controlling the VM. The VM is still running, but libvirt may not be able to make changes to its config. This can happen if libvirt has problems parsing JSON from QMP. In practice it is highly unlikely for this to ever happen. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: Libvirt vs QEMU benefits
On Wed, Apr 06, 2022 at 12:44:37PM +, M, Shivakumar wrote: > Hello, > > For one of our case validation, we were using direct QEMU commands before for > VM creation as it was easier to configure the VMs. Inside VM we do run the > real-time latency test. > Recently we switched to libvirt for the VM creation and deletion. > Surprisingly, we do see a significant increase in the real-time latency > performance for the VMs launched through the libvirt. > > W.r.t configuration wise both VMs are the same, we just converted the > existing QEMU commands into libvirt XMLs. It would be useful to share your QEMU command line args seen both with directly running QEMU and via libvirt. I'd be really suprised if your direct config was exactly the same as libvirt's. > I am wondering what and all the features which libvirt has, > improving this performance. If it isn't related to QEMU configuration, then most likely candidate is the use of cgroups for placing VMs. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|